Commit Graph

9201 Commits

Author SHA1 Message Date
Chris Lu 7bf2dfc9ab Bound the metadata-log flush queue (#9907)
* Bound the metadata-log flush queue

A stalled flush, e.g. slow volume servers under a reconnect storm, let up
to 256 queued 8MB buffer copies pin two gigabytes per log buffer while
producers kept filling the queue. Cap the queue at 16 so a sustained
stall backpressures writers instead of growing the heap. The flush
goroutine never feeds back into the buffer (system-log paths skip event
notification), so blocked producers cannot deadlock the consumer.

* Don't drop a force-flushed buffer on a full queue

ForceFlush enqueued with a two-second timeout, but by then the live
buffer was already sealed and reset, so a timed-out send silently lost
the copy. Block until the flush is queued; the wait for completion stays
bounded since the data is durable once the flush loop drains it.

* Never close the flush channel

ShutdownLogBuffer closed flushChan while producers could still be
blocked sending into it, which panics. Terminate loopFlush with a nil
sentinel instead, so the channel is never closed, and give every
producer-side send a shutdown escape so none parks forever once the
flush loop exits. Everything queued before the sentinel still drains,
preserving IsAllFlushed semantics.

* Copy the shutdown flush under the buffer lock

Every other copyToFlush call site holds the lock; the shutdown path read
the live buffer unlocked while producers could still be appending.
2026-06-10 10:57:30 -07:00
Chris Lu bf76040046 Share metadata-log replays per chunk instead of per file (#9906)
* Share metadata-log replays per chunk instead of per file

Log file chunks are immutable: each metadata-log flush uploads one whole
buffer of complete records as a new chunk, and appends only add chunks.
So cache decoded entries per chunk, with no age gate and no fingerprint
revalidation. The per-file cache excluded files younger than two flush
intervals, which is exactly the hot tail that every tailing or
reconnecting subscriber replays — each through a private chunk reader
holding an 8MB buffer and decoding the whole file from byte zero.

A chunk's flush time also upper-bounds every record timestamp inside it,
so a tail replay now skips cold chunks without reading them at all.

If a chunk does not decode standalone (records spanning chunk
boundaries, or a corrupt size prefix), fall back to streaming the whole
file as one byte stream, resuming after the last yielded entry.

* Evict idle metadata-log cache entries

The replay cache only evicted on insert, so once filled it held its full
budget forever. Stamp entries on use and sweep the LRU tail every minute,
dropping anything untouched for five minutes; the cache now holds memory
only while subscribers actually replay.

* Reject implausible records when decoding log chunks

proto.Unmarshal is permissive: empty payloads and unknown-field garbage
parse without error, so a chunk starting mid-record could decode by
coincidence and get cached instead of falling back to the byte stream.
Enforce what the writer guarantees - records are never empty and carry
strictly increasing positive timestamps within one flushed buffer.

* Gate the singleflight test on an open flight

The sleep alone only probabilistically created concurrent misses; a
started channel now proves the loader holds the flight before callers
are released.
2026-06-10 10:57:11 -07:00
Lisandro Pin 5150c86934 Make shell command ec.scrub return shard details upon scrub failures in LOCAL mode. (#9913)
This is useful information to deal with issues requiring EC shard rebuilding,
such as https://github.com/seaweedfs/seaweedfs/issues/9872.
2026-06-10 10:55:16 -07:00
7y-9 7c0a9acb30 fix(s3api): normalize checksum trailer header names (#9905)
Problem: SigV4 chunked upload checksum trailer parsing rejected mixed-case checksum header names even though HTTP header field names are case-insensitive.

Root cause: extractChecksumAlgorithm compared the x-amz-trailer value and trailer header key against exact lowercase strings.

Fix: Trim and lowercase checksum trailer header names before matching supported checksum algorithms.

Reproduction: go test ./weed/s3api -run TestExtractChecksumAlgorithmIsCaseInsensitive -count=1 with X-Amz-Checksum-Crc32; before the fix it returned unsupported checksum algorithm.

Validation: gofmt -w weed/s3api/chunked_reader_v4.go weed/s3api/chunked_reader_v4_test.go; git diff --check; go test ./weed/s3api -run TestExtractChecksumAlgorithmIsCaseInsensitive -count=1; go test ./weed/s3api -count=1

Co-authored-by: Codex <noreply@openai.com>
2026-06-10 00:30:43 -07:00
Chris Lu 9e98ec4b2e Share decoded metadata-log entries across subscriber replays (#9903)
perf(filer): share decoded log entries across metadata replays

Concurrent SubscribeMetadata replays of the same persisted log history each
opened a chunk reader per source filer and re-decoded the same files, so a
reconnect storm multiplied into many GB of buffers. Cache the decoded entries
of completed log files in a bounded LRU, coalescing concurrent loads with
single-flight and bounding concurrent decodes. Each hit is validated against
the file's current chunk set, so a file that received a late append is
reloaded rather than served stale; reads that stop on an unreachable chunk are
delivered but not cached so a transient outage re-probes on the next replay.
2026-06-09 13:34:11 -07:00
Chris Lu e12052ee6b fix(filer.sync): replicate a rename as an atomic move, not a no-op update (#9895)
* fix(filer.sync): replicate a rename as create-then-delete, not an in-place update

A rename arrives as a single metadata event carrying both the old and new
entry. The filer sink was routed to UpdateEntry, which looks up the old
path but issues the update against the new parent without changing the
name — and the filer UpdateEntry RPC cannot move an entry. So the rename
was dropped: the old path lingered and the new path never appeared
(same-dir renames rewrote the old name in place).

Route a real move (the sink path changed) through CreateEntry(new) then
DeleteEntry(old) in both the replicator and the filer.sync/backup driver,
the way the other sinks already handle it; reach UpdateEntry only for true
in-place updates. Create before delete so a crash between the two leaves
the entry visible rather than lost.

* fix(filer.sync): derive the rename delete key like the create key, guard the watched root

The rename delete leg rebuilt the old key with a raw util.Join, bypassing the
sink-side key normalization the create leg gets from buildKey — so a rename
could create the new entry and then fail to delete the old one under a
transformed key. Build the old key through buildKey too, and skip the delete
when the moved entry is the watched root itself (where the old key would
resolve to the target root and recursively delete the whole sink tree).

* test(filer.sync): cover the in-place update delete-then-create fallback order

The recording sinks always reported foundExisting, so the fallback that an
in-place update takes when the entry is missing on the sink was never run.
Make it configurable and assert the fallback deletes before it recreates the
same key, in both the replicator and the filer.sync drivers.

* feat(filer.sync): move filer-sink renames natively via AtomicRenameEntry

create-then-delete is unsafe for the filer sink: CreateEntry returns nil
without creating on a transient chunk-copy error, so the paired delete could
remove the only valid destination copy; a directory rename also deleted the
old subtree before descendants were recreated, and left old chunks behind.

Add an optional EntryMover sink capability and implement it on the filer sink
via AtomicRenameEntry — one atomic, metadata-only move that relocates a whole
subtree in a single transaction. Renames prefer it; sinks without a native
move keep create-then-delete. When the old path is already gone (a descendant
the parent rename moved, or one never replicated) MoveEntry creates the new
path instead, re-checking existence with a lookup so a rolled-back move that
left the old entry intact is retried rather than mistaken for gone.

* docs(filer.sync): note entryMissing's gRPC not-found string fallback is deliberate
2026-06-09 12:54:28 -07:00
7y-9 a9e4995d76 fix(http): accept no content delete responses (#9893)
* fix(http): accept no content delete responses

Problem: util/http.Delete reports an error for a successful HTTP 204 No Content response.

Root cause: Delete only treats 200 OK, 202 Accepted, and 404 Not Found as non-error responses, omitting the standard 204 status commonly returned by DELETE endpoints.

Fix: Include http.StatusNoContent in the Delete success status set.

Reproduction: go test ./weed/util/http -run TestDeleteTreatsNoContentAsSuccess -count=1 fails before the fix with an empty error for a 204 response.

Validation: go test ./weed/util/http -run TestDeleteTreatsNoContentAsSuccess -count=1; go test ./weed/util/http -count=1; git diff --check; git diff --cached --check
Co-authored-by: Codex <noreply@openai.com>

* Update weed/util/http/http_global_client_util_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-09 11:45:14 -07:00
Chris Lu 048f9ece2d Fix filer metadata-replay OOM under mount reconnect storms (#9901)
* fix(filer): propagate multi-filer metadata log read errors

A genuine (non not-found) read error in one filer's log stream was logged
and skipped, then the merged cursor advanced past the gap, silently
dropping that file's events. Abort the whole replay so the subscriber
re-reads from the unchanged position; chunk-not-found still skips.

* perf(mount): read persisted metadata log chunks directly from volume servers

Set LogFileReaderFn so the filer returns log file references and the mount
reads the chunk data itself, instead of the filer reading, decoding, and
streaming every persisted entry. Keeps a reconnect storm of many mounts
from concentrating hundreds of concurrent log replays in filer memory.

* perf(filer): pre-size chunk stream reader buffer to view size

The chunk size is known up front, so grow the buffer once instead of
letting bytes.Buffer double as the streamed pieces arrive (which
transiently overshoots to ~2x per reader).

* fix(filer): bound concurrent persisted-log replays

Each server-side replay holds an open chunk reader per source filer plus a
readahead buffer, so a reconnect storm of clients that predate the
metadata-chunks offload multiplies into many GB. Gate replays with a
semaphore; abort the acquire when the subscriber's stream is gone so
cancelled clients do not pile up parked goroutines.
2026-06-09 11:43:12 -07:00
Chris Lu 8776b9d311 feat(filer): object size distribution metric and dashboard panels (#9902)
* feat(filer): record object size distribution histogram

Add SeaweedFS_filer_object_size_bytes, a histogram sampled when an
object is first created in the filer namespace, covering every write
protocol (S3, WebDAV, FUSE mount, direct HTTP). Buckets follow the
1KB/100KB/1MB/100MB/1GB ranges operators use to size collections.
Directories, overwrites, and metadata-only updates are not sampled, so
the bucket counts track the size distribution of distinct objects.

* feat(metrics): add filer object size distribution dashboard panels

Add a write-rate-by-size-range graph and a size-distribution bar gauge,
driven by SeaweedFS_filer_object_size_bytes, to the standalone and Helm
Grafana dashboards. Per-range subtractions are clamped at zero so
transient negative rate() samples do not render below the axis.
2026-06-09 10:41:11 -07:00
Chris Lu 7b07d8177a fix(filer.sync): scope filesystem key sanitization to the local sink (#9894)
* fix(filer.sync): scope filesystem key sanitization to the local sink

destKey ran every sink key through escapeKey, whose Windows build strips
colons. Colons are illegal in NTFS filenames so the local sink needs that,
but s3/filer/azure/gcs/b2 accept them as ordinary key bytes — stripping
them silently diverged the destination key (a source a:b replicated as ab).

Move the sanitization into the local sink behind a Windows build tag,
applied at every entry point so the previously-unescaped in-place-update
paths stay consistent. Non-local sinks now keep the raw key; non-Windows
builds are unchanged; a leading drive-letter colon is preserved.

* test(filer.sync): cover incremental destKey and localsink update/delete sanitization

Lock the colon-preserving behavior for the incremental destKey branch, and
extend the Windows local-sink test to assert UpdateEntry and DeleteEntry also
sanitize the key, not just CreateEntry.
2026-06-09 10:18:49 -07:00
Jaehoon Kim 202517c02a fix(filer.backup): skip replay events whose source chunk was superseded or deleted (#9886)
* fix(filer.backup): skip replay events whose chunk no longer exists on the source

"Source" is the filer we replicate FROM (e.g. green in a green->blue backup).

Replaying the metadata log from a checkpoint can hit an event whose chunk was
since overwritten/deleted and garbage-collected on the source volume. Fetching
it returns 0 bytes (a permanent size mismatch), which the sink propagated to the
subscription — so the same offset retried forever and replication stalled.

Skip the event only when proven stale; otherwise keep refusing so genuine loss
of a live file still halts loudly:

- onCorruptChunk centralizes the three errChunkSizeMismatch sites.
- getEntryMtimeNs compares mtime at nanosecond precision so same-second rewrites
  (git's config.lock dance) are ordered correctly.
- sourceSupersedes re-reads the entry's current state on the source: gone
  (ErrNotFound) or a strictly-newer mtime than the replayed version -> skip;
  any other lookup error keeps the entry.

Skipping is lossless: events are full-entry snapshots, so a later event
re-carries the current chunks and a delete event reconciles a removed file.

* test(filer.backup): cover the superseded-chunk skip decision

- TestSourceSupersedes: not-found (sentinel / wrapped / gRPC string) and nil
  entry -> skip; network error -> keep; source newer -> skip; same/older -> keep.
- TestGetEntryMtimeNs: nanosecond precision, same-second ordering, nil safety.
- TestOnCorruptChunkRefusesWhenSupersessionUnconfirmed: never skip silently when
  supersession cannot be confirmed.

* fix(filer.backup): don't infer supersession for incremental sinks

In incremental mode the sink key carries a date prefix
(sinkDir/YYYY-MM-DD/relPath) that cannot be reversed to a real source path, so a
source lookup would always be ErrNotFound and wrongly classify a live entry as
deleted — skipping it. Make targetPathToSourcePath report "unmappable" in
incremental mode; hasSourceNewerVersion already declines to skip when the source
path cannot be mapped.

Found in code review. Non-incremental sinks (filer.backup green->blue) are
unaffected.

* refactor(filer.backup): name the mtime param sourceMtimeNs; note ns overflow bound

- Rename the threaded sourceMtime parameter to sourceMtimeNs across the internal
  replicate/fetch helpers so the unit is explicit (it only feeds
  hasSourceNewerVersion, which compares in nanoseconds).
- Document that getEntryMtimeNs's int64 ns arithmetic is safe until ~year 2262.

No behavior change.

* fix(filer.backup): order same-second versions in the CreateEntry skip and update gates

The CreateEntry already-replicated short-circuit and chooseUpdateAction
still compared second-grained mtime, so a newer version written within
the same second could be skipped as already-replicated or overwritten by
an older same-second replay. Route both through getEntryMtimeNs, matching
the precision the chunk-replication path already uses.

* test(filer.backup): cover same-second update-action ordering

* docs(filer.backup): trim verbose comments to terse why

* fix(filer.backup): check supersession against the rename's new path

For a rename the filer sink updates in place (the delete+create branch is
skipped for sink name "filer"), so the corrupt-chunk supersession check
queried the pre-rename key. Its source-side ErrNotFound was read as
"superseded", silently advancing the checkpoint without applying the rename.
Map the incoming entry's new path (newParentPath/newEntry.Name) for both
update branches.

* fix(filer.backup): detect a deleted source even when the replayed mtime is epoch

hasSourceNewerVersion returned early when sourceMtimeNs <= 0, skipping the
source lookup, so a deleted entry with mtime 0 (a valid epoch timestamp) never
got the gone verdict and wedged on permanent retries. Always look up; gate only
the newer-mtime comparison on a valid replayed mtime.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-09 08:53:29 -07:00
7y-9 1cf92f6c2e fix(s3api): clear stale object lock years (#9890)
Problem: Re-storing object-lock default retention with Days left a previous Years extended attribute in place, so later loads could see both Days and stale Years.

Root cause: StoreObjectLockConfigurationInExtended only wrote period fields that were set on the new configuration and did not delete old Days or Years keys before writing the replacement rule.

Fix: Clear stored default-retention Days and Years keys before writing the current default retention period fields.

Reproduction: go test ./weed/s3api -run TestStoreObjectLockConfigurationClearsStaleYears -count=1 failed before the fix because the stale years key remained.

Validation: go test ./weed/s3api -run TestStoreObjectLockConfigurationClearsStaleYears -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check

Co-authored-by: Codex <noreply@openai.com>
2026-06-09 00:48:38 -07:00
Chris Lu 7aba10fa1a fix(mongodb): merge URI auth fields with username/password override (#9889)
* fix(mongodb): merge URI auth fields with username/password override

SetAuth replaced the whole Credential parsed from the URI, dropping
AuthSource and AuthMechanism. Start from the URI-parsed Auth and only
override the username and password so credentials scoped to a specific
auth database keep working.

* fix(mongodb): set PasswordSet for explicit credentials

Required by GSSAPI auth when a password is supplied; ignored for other
mechanisms.
2026-06-09 00:18:33 -07:00
Chris Lu 2871e6552a fix(s3api): drop ancestor directory markers from prefixed ListObjectVersions (#9885)
processExplicitDirectory appended a directory-key object as a version
without checking it against the prefix. A versioned listing descends
through ancestor markers to reach a deeper prefix, so every ancestor
(Veeam/, Veeam/Backup/, ...) leaked into Versions even though none of
them match the prefix - which makes Veeam's immutable repository scan
abort on an unexpected key. Guard on the prefix so only keys at or under
it surface, matching ListObjectsV2 and AWS.
2026-06-09 00:01:06 -07:00
7y-9 d569dd686f fix(shell): move files into existing destination directories (#9887)
* fix(shell): move files into existing destination directories

Problem: fs.mv /src/file /dst/dir treats an existing destination directory as a destination file path, so it renames the source to /dst/dir instead of moving it into /dst/dir/file.

Root cause: commandFsMv builds the destination LookupDirectoryEntryRequest with Directory and Name swapped, so the destination directory lookup misses.

Fix: Populate LookupDirectoryEntryRequest with Directory=destinationDir and Name=destinationName before deciding whether the destination is a directory.

Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1

Validation: gofmt -w weed/shell/command_fs_mv.go weed/shell/command_fs_mv_test.go; git diff --check; git diff --cached --check; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -count=1

* Update weed/shell/command_fs_mv_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-08 23:42:13 -07:00
Chris Lu 1c9039d3ac fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart (#9874)
* fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart

On startup load_existing_volumes() scans .vif/.idx entries (not just
.dat). For distributed EC, a volume's .vif can be mirrored onto a disk
whose .ecx lives on a sibling disk, so the per-disk ecx check is false
and the loader falls through to Volume::new, which always creates the
.dat if missing -> a phantom 8-byte superblock stub. The store-level
prune_incomplete_ec_with_sibling_dat then treats that stub as the
authoritative source and deletes the real EC shards on sibling disks. Go
guards the same case (disk_location.go: 'Without this guard NewVolume
below would create a phantom empty .dat') but only same-disk.

Fix A (root cause): in load_existing_volumes, don't create a .dat during
load. Skip the entry when there is no local .dat AND the .vif does not
reference remote files -- remote-tiered volumes have no local .dat but
must still load via the remote path. Uses the robust check_dat_file_exists
helper so a transient stat error doesn't skip a real volume. New volumes
go through create_volume(). Covers the cross-disk .vif/.ecx split Go's
same-disk hasEcxFile() misses.

Fix B (defense in depth, Go + Rust): when the EC .vif records no source
size (dat_file_size==0), require the sibling .dat to be strictly larger
than a bare superblock, so an empty 8-byte stub can't pass the
credibility gate. Previously it fell back to SUPER_BLOCK_SIZE, which an
8-byte stub exactly meets.

Adds regression tests reproducing the cross-disk lone-.vif phantom and
the 8-byte stub gate; updates an existing prune test to use a real
collection so its .ecx lookup matches the loaders.

* fix(storage): don't create phantom .dat from lone .vif on Go volume load

Mirror Fix A on the Go side. loadExistingVolume scans .vif/.idx entries,
and for distributed EC a .vif can be mirrored onto a disk whose .ecx is
on a sibling disk. The same-disk hasEcxFile() guard does not fire there,
so the loader falls through to NewVolume(createDatIfMissing=true) and
writes an 8-byte phantom .dat, which the sibling-.dat prune then uses to
delete the real EC shards on sibling disks. Skip the entry when there is
no local .dat AND the .vif has no remote file (via MaybeLoadVolumeInfo);
remote-tiered volumes have no local .dat but must still load.

Adds TestLoneVifDoesNotCreatePhantomDat (fails without the guard) and
TestRemoteTier_DiskScanLoadsRemoteOnlyVolume (fails if the guard skips a
remote-only volume).
2026-06-08 22:10:16 -07:00
7y-9 7bbd28634a fix(util): return full uint64 randomness (#9864)
Problem: RandomUint64 generated eight random bytes but returned int32, truncating the value before mount file and directory handles converted it to uint64. This reduced handle entropy to 32 bits and produced sign-extended handle values.\n\nRoot cause: the helper cast BytesToUint64 to int32 and exposed int32 as its return type.\n\nFix: make RandomUint64 return uint64 and return the full BytesToUint64 result.\n\nReproduction: go test ./weed/util -run TestRandomUint64ReturnsUint64 -count=1 failed before the fix because RandomUint64() had kind int32.\n\nValidation: gofmt -w weed/util/bytes.go weed/util/bytes_test.go; git diff --check; go test ./weed/util -run TestRandomUint64ReturnsUint64 -count=1; go test ./weed/util -count=1; go test ./weed/mount -count=1; git diff --cached --check
2026-06-08 22:07:24 -07:00
Chris Lu 3fadbef3eb feat(admin): export full cluster volume list as JSON (#9876)
Adds an "Export All (JSON)" button on the Cluster Volumes page that pulls
the whole cluster's volume list from the master in one call, a superset of
volume.list. Beyond the table columns it carries garbage and fullness
ratios, modified time, compact revision, remote tiering keys, per-disk
capacity counts, EC shard sizes with file/delete counts, and a cluster-wide
duplicate-volume-id scan. Honors the active collection filter. The existing
per-page CSV export stays as "Export Page".
2026-06-08 15:01:02 -07:00
Chris Lu ed470dccb1 mini: grow volumes one at a time
Mini auto-sizes a few large volume slots, but the master pre-grows 7
volumes per new collection. Under a filer group each S3 bucket is its
own collection, so the first buckets claimed every slot and later
writes failed to assign a volume. Cap mini's volume_growth copy counts
to 1.
2026-06-08 14:51:40 -07:00
Chris Lu d67fc48fbd fix(filer.sync): guard batched events against nil EventNotification (#9877)
* fix(filer.sync): guard batched events against nil EventNotification

The server folds a backlog into one response: the first event in the
top-level fields, the rest in resp.Events, and the pipelined sender can
drain an idle heartbeat (nil EventNotification) into that tail. Only the
envelope got the freshness-signal guard, so a batched heartbeat reached
AddSyncJob and nil-derefed in IsEmpty while replaying a backlog buffered
during a peer outage.

Route every event, envelope and batched, through one handler that sends
freshness signals (nil heartbeat, empty marker) to OnIdleHeartbeat.

* fix(filer): guard MetaAggregator batched events against nil EventNotification

The peer subscription's envelope is nil-guarded but its batched tail was
not. The aggregator doesn't enable idle heartbeats today, so the server
can't fold a nil EventNotification into the batch yet, but make the two
loops consistent so it can't nil-deref if that changes.
2026-06-08 13:56:16 -07:00
Chris Lu 4c050ad76b Don't mangle filer paths with the OS separator on Windows (#9878)
fix: don't mangle filer paths with the OS separator on Windows

filepath.Dir/Join use the platform separator, so on Windows they rewrite
a forward-slash filer path like /buckets/x into \buckets\x. The mangled
value then goes into a filer RPC and operates on the wrong key, so the
op silently targets nothing.

The admin file browser hit this in New Folder (the entry landed under
\buckets\my-bucket and never showed up under /buckets/my-bucket), and
the same way in delete, view and properties. MQ topic retention and
consumer-offset listing, and the SFTP home dir plus create-permission
parent lookup, had the same bug.

Switch all of these to the path package, which always uses "/".
2026-06-08 13:56:02 -07:00
Chris Lu 8cc10460b4 fix(remote): correct content and permissions when syncing/caching remote objects (#9879)
* fix(remote): reject short reads when caching remote objects

A short read from the remote (stale listing size, truncated or flaky
response) was silently zero-padded: the S3 and Azure clients pre-size
the buffer and discard the downloaded byte count, and the chunk is
recorded with the requested size. The cached file then matched the
expected size but its tail was NULL, and the entry was marked cached
so it never re-fetched.

Check the byte count against the requested size in both clients, and
add a backend-agnostic guard in FetchAndWriteNeedle. The cache now
fails loudly and the entry stays remote-only for a later retry.

* fix(remote): match S3 default modes when syncing remote metadata

Remote object listings carry no POSIX mode, so synced entries were
created with a hardcoded 0644. Against a SeaweedFS remote, whose S3
layer writes objects as 0660 and auto-creates directories as 0771
(0660|0111), the mounted copy ended up 0644/0755 and the permissions
visibly diverged from the source.

Default to the S3 modes instead (files 0660, directories 0771). The
filer derives parent-dir modes from the child as fileMode|0111, so
fixing the file default also brings the directories into line.

Directory mtimes still reflect sync time: S3 listings don't enumerate
directories, so the remote's directory timestamps aren't available.
2026-06-08 13:55:53 -07:00
Chris Lu 5a4ff2a122 fix(mq): don't cache topic non-existence on transient filer errors
TopicExists and getTopicConfFromCache negative-cached a topic for the full
30s TTL whenever a filer lookup failed for any reason, including timeouts.
A topic created earlier then looked gone until the TTL expired, and the
metadata auto-create path couldn't heal it (CreateTopic rejects an
already-persisted conf), so producers saw UNKNOWN_TOPIC_OR_PARTITION.

Only negative-cache on a definitive ErrNotFound; let transient errors fall
through and retry against the filer.
2026-06-08 12:04:48 -07:00
7y-9 b408705f5b fix(s3api): accept HTTP-date conditionals (#9863)
* fix(s3api): accept HTTP-date conditionals

Problem: Object conditional headers rejected valid HTTP-date values in RFC850 or ANSIC format for If-Modified-Since and If-Unmodified-Since.

Root cause: parseConditionalHeaders used time.Parse(time.RFC1123), accepting only one HTTP-date representation instead of the standard formats accepted by net/http.ParseTime.

Fix: Parse conditional date headers with http.ParseTime so RFC1123, RFC850, and ANSIC HTTP-date forms are accepted.

Reproduction: go test ./weed/s3api -run TestParseConditionalHeadersAcceptsHTTPDateFormats -count=1 failed before the fix with ErrInvalidRequest for RFC850 and ANSIC date values.

Validation: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/s3api -run TestParseConditionalHeadersAcceptsHTTPDateFormats -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/s3api -count=1; git diff --check; git diff --cached --check

* fix(s3api): accept HTTP-date copy-source conditionals

Mirror the put-path http.ParseTime switch onto the copy-source If-Modified-Since / If-Unmodified-Since headers, which still rejected valid RFC850 and ANSIC dates.

* fix(s3api): keep RFC1123 UTC-zone dates working alongside http.ParseTime

http.ParseTime rejects the "UTC" zone that Go clients emit via t.UTC().Format(time.RFC1123), which the old RFC1123 parser accepted. Add a parseHTTPDate helper that tries http.ParseTime first and falls back to RFC1123, so the put and copy-source conditional date headers accept the union of HTTP-date formats plus the UTC zone.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-08 01:12:07 -07:00
Chris Lu 78da9572ae 4.32 2026-06-07 23:37:57 -07:00
Jaehoon Kim 1b5f1c1f3b feat(filer.backup): -initialSnapshot re-seeds a reinitialized destination (#9828)
* feat(filer.backup): add -resetCheckpoint to force a fresh sync

filer.backup resumes from a per-sink offset persisted in the source filer's KV.
There was no first-class way to discard that checkpoint and re-run from the
beginning short of guessing a large -timeAgo, which also skips -initialSnapshot.

Add -resetCheckpoint: before reading the offset, write 0 for this sink so
getOffset returns 0, isFreshSync stays true, and -initialSnapshot re-runs a full
walk. Effective only when -timeAgo is 0.

The flag is cleared after the first successful reset: runFilerBackup retries
doFilerBackup forever on error, so leaving it set would re-zero the checkpoint
on every retry and never make forward progress after a transient failure. Later
retries resume from the persisted checkpoint instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(filer.backup): keep fresh-sync intent when offset read fails after reset

After -resetCheckpoint writes offset 0, a transient getOffset read-back error
flipped isFreshSync to false, which skipped the -initialSnapshot walk the reset
explicitly requested. Track that the reset happened this iteration and, on a
getOffset error, preserve isFreshSync=true in that case (the non-reset path
keeps treating a read error as "not fresh" to avoid re-walking on transients).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(filer.backup): skip offset read-back on reset instead of tracking a flag

Replace the didReset bool by branching: on -resetCheckpoint, clear the offset and
start fresh without reading it back (we just wrote 0, so the state is known);
otherwise read the offset as before. This drops the redundant getOffset RPC after
a reset and removes the read-back error case entirely, so no separate flag is
needed to preserve isFreshSync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* filer.backup: -initialSnapshot re-seeds on every start; drop -resetCheckpoint

-initialSnapshot now walks the live tree whenever -timeAgo is 0, seeds the
destination, and overwrites the saved checkpoint, rather than running only on a
fresh sync. That re-seeds a reinitialized destination on its own, so the
separate -resetCheckpoint flag is gone.

The walk runs once per process: the in-memory flag is cleared after the
watermark is persisted, so the retry loop resumes from the persisted checkpoint
instead of re-walking on every transient error. A process restart re-walks, so
remove the flag once the backup is caught up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-07 23:35:53 -07:00
Chris Lu 8a4fdf06c0 admin/maintenance: reload in-flight tasks on startup instead of discarding them (#9857)
* admin/maintenance: reload in-flight tasks on startup instead of discarding

LoadTasksFromPersistence deleted all persisted task files on startup and
relied on the scanner to re-detect, so saved task state was never consumed
— the persistence was effectively write-only. Reload non-terminal tasks
(pending/assigned/in_progress) into the queue, resetting in-flight ones to
pending since their worker is gone after a restart (maintenance tasks are
idempotent). Terminal task files are dropped; the scanner still backfills
anything not persisted.

* address review: nil-guard reloaded tasks and SyncTask to ActiveTopology

- skip nil entries from LoadAllTaskStates (corrupted state)
- re-sync restored tasks with MaintenanceIntegration so ActiveTopology
  (in-memory, empty on startup) knows about them; otherwise GetNextTask's
  AssignTask rejects them as unknown and they never get assigned
2026-06-07 22:45:38 -07:00
Chris Lu 7c542128c7 vacuum: compact a read-only volume when an explicit volumeId is given (#9861)
* vacuum: compact a read-only volume when an explicit volumeId is given

The on-demand path no longer skips read-only volumes, so an operator can
reclaim a benignly read-only (full/oversized) volume without marking it
writable first. The background scan and all-volumes sweep still skip
read-only, where the flag usually signals an unhealthy disk.

* vacuum: copy locationList under lock for on-demand vacuum

The volumeId>0 path passed the live vid2location entry into the async
vacuum, where heartbeat-driven Register/UnRegister can mutate the slice
concurrently. Snapshot it under accessLock, matching the sweep path.
2026-06-07 22:42:51 -07:00
Chris Lu a549580e65 ec.balance: verify shard landed on destination before deleting the source (#9858)
* ec.balance: verify shard(s) landed on the destination before deleting source

The EC balance task copied/mounted a shard to the destination and then
immediately unmounted+deleted it from the source, reporting success as soon
as the RPCs returned. A copy/mount can return OK while the shard isn't
actually registered/loadable on the destination, so deleting the source
then loses the shard (and the scanner re-issues the same move every cycle).

Add a verification step (VolumeEcShardsInfo via VerifyShardsAcrossServers,
the same check the EC encode task uses before deleting originals): if the
destination doesn't report every moved shard, fail the task and keep the
source so the move is retried instead of losing data.

* address review: use comma-ok when reading destination shard inventory
2026-06-07 21:31:53 -07:00
7y-9 e6ab9e7b09 fix(s3api): reject zero default retention years (#9860)
Problem: Default object-lock retention accepted an explicitly provided Years value of zero, even though a default retention period must be positive when present.

Root cause: validateDefaultRetention rejected zero Days but only rejected negative Years, leaving YearsSet with Years=0 as a successful validation path.

Fix: Treat an explicitly provided zero Years value as ErrInvalidRetentionPeriod, matching the existing Days validation.

Reproduction: go test ./weed/s3api -run TestValidateDefaultRetention -count=1 failed before the fix because the Zero years case returned nil.

Validation: go test ./weed/s3api -run TestValidateDefaultRetention -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check
2026-06-07 20:53:45 -07:00
Chris Lu f9d3105e80 ec placement: spread EC shards evenly across machines, not onto the lowest-id one (#9855)
* ec placement: steer shards to less-loaded machines, not the lowest id

EC encode places every volume against one shared topology snapshot (it reserves the
shards it assigns so later volumes see reduced capacity), but node selection ranked
only by this volume's shard count and broke ties by sorted id. So the lowest-id
machine won the first shard of every volume and accumulated far more total shards
than the rest -- on a 6-machine cluster the first machines drifted to ~1.5x.

Rank eligible nodes by the machine's shards of this volume, then the machine's free
capacity, then the node's shards of this volume, then the node's free capacity. Free
capacity reflects the load already placed, so ties steer toward the least-loaded
machine instead of the lowest id, keeping total EC shards even across machines.

* test: ec.balance converges to even per-machine load from a skew

Starts machine 10.0.0.1 at 4 shards/volume and the rest at 2, then runs repeated worker-style capped passes; asserts convergence to an even per-machine total (reaches exactly even in ~13 rounds).

* reduce comments on the placement fix

Trim narration to the non-obvious why.

* test: assert convergence and count zero-shard machines

Seed the per-machine map with every host so a fully drained machine still registers, and fail explicitly if balance doesn't converge before the round cap.
2026-06-07 20:45:17 -07:00
Chris Lu 89cbb1c558 admin: default -dataDir to "." so maintenance task state persists across restarts (#9856)
admin: default -dataDir to "." so maintenance task state persists

Previously -dataDir defaulted to empty, so the admin ran maintenance in
memory only: task state was never saved and maintenance tasks (notably EC
balance/rebuild) were re-issued every scan cycle without converging,
churning EC shards (moves landed shards without their .ecx index, leaving
EC volumes unloadable/missing shards).

Default -dataDir to "." (the process working directory, which under the
standard systemd unit is the admin's data dir) so state persists out of
the box.
2026-06-07 20:45:03 -07:00
Chris Lu f0d2a0d417 Treat co-located volume servers as one fault domain when balancing and allocating (#9854)
* admin/topology: carry the volume server address on DiskInfo

The planning DiskInfo exposed only the node id, which can be an opaque label rather than ip:port. Record the address too so callers can resolve the physical machine a disk sits on.

* ec.balance: spread a volume's shards across machines, not just nodes

Volume servers sharing a host are one fault domain, but the within-rack spread treated them as independent nodes, so one box could end up holding more shards of a volume than EC can afford to lose. Add a machine (host) tier between rack and node: the within-rack pass spreads each volume across machines, and the global load phase no longer re-concentrates a volume onto a machine it already sits on. Host defaults to the node id, so clusters with one server per host are unchanged.

* ec placement: prefer machines holding fewer of a volume's shards

EC allocation and repair picked the least-loaded node in a rack with no regard for which physical machine it sits on, so a volume's shards could pile onto several servers of one box. Rank candidate nodes by their machine's shard count first, then the node's own. The machine is derived from the volume server address carried on DiskInfo, falling back to the node id, matching how the balancer resolves it.

* volume.balance: don't move a replica onto a machine already holding one

isGoodMove only rejected a move onto the same data node, so two replicas could land on two volume servers of one box and a single machine failure would lose both. Reject a target whose host already holds another replica of the volume. Best-effort: balancing simply skips and tries the next target.

* volume allocation: spread same-rack replicas across machines

PickNodesByWeight filled the same-rack replica picks by weight alone, so replicas could co-locate on one box. Prefer candidates on not-yet-used hosts, falling back when too few distinct machines exist. Data-center and rack tiers have no host, so their ordering is unchanged.

* ec.balance: harden machine spread against re-concentration and capped machines

Two cases where the machine-aware spread could still leave a volume badly placed:

- The global load phase could move a shard of a volume onto a machine that
  already held it, raising that machine's count and undoing the within-rack
  spread (a 4/4/3/3 layout could become 3/5/3/3, past parity for 10+4). Limit
  the load-only fallback to same-machine moves, which leave a machine's count
  unchanged; cross-machine concentration is no longer allowed for load alone.

- The within-rack spread chose a destination machine by free slots alone, so if
  that machine's only nodes were already at the SameRackCount cap it skipped the
  move instead of trying another machine. Require a machine to have a node that
  can actually take the shard before selecting it.

* reduce comments across the machine-affinity change

Trim narration down to the non-obvious why; one terse line where a block was overkill.

* ec.balance: gate machine spread on fault-tolerance feasibility

Spreading a volume evenly across machines only helps when there are enough that
each can stay within EC's parity tolerance (numMachines >= ceil(total/parity)).
With fewer -- or wildly unequal -- machines it can't make a machine loss
survivable anyway, and forcing it fights capacity: e.g. a cluster of 12 volume
servers on one host and 2 on another would have half of every volume crammed onto
the 2-server box. So spread across machines only when it's achievable; otherwise
fall back to per-node spread and let capacity/global balancing decide.

The global load phase applies the same test: it protects a volume's machine spread
(no cross-machine move that raises a machine's count past the source's) only where
that spread is achievable, so heterogeneous clusters still level by fullness.

* ec.balance worker: group servers by host when planning

The worker built its planner topology without recording each server's host, so
automated ec.balance treated ports on one machine as independent nodes and could
concentrate a volume's shards on one physical box. Set the host from the volume
server address, matching the shell path.

* volume.balance worker: don't move a replica onto a machine holding one

The worker compared only node ids, and the replica map dropped the server address,
so it could move replicas onto different ports of one machine. Carry the host on
ReplicaLocation (from the server address) and reject a target whose host already
holds another replica of the volume. Best-effort, matching the shell.

* ec.balance: judge machine-spread feasibility by the rack's shards

The within-rack and global feasibility checks compared the whole volume's shard
count against a rack's machine count, so a rack holding only part of a volume after
cross-rack spreading -- e.g. 7 of a 10+4 volume across 2 machines -- was wrongly
judged infeasible and fell back to node spread, which could pile 6 shards onto one
host, past parity. Gate on the rack's own shard count of the volume instead.

* ec.balance: spread a volume's shards across machines by combined count

EC recovers from any loss within parity regardless of shard type, so what bounds a
machine's exposure is its total shards of the volume, not data and parity
separately. Spreading the two independently let each type's remainder land on the
same machine -- ceil(d/M)+ceil(p/M) can exceed ceil(total/M), e.g. a 5/3 split where
4/4 was achievable, past parity. Balance the combined count in one pass; disk-level
data/parity anti-affinity stays in pickBestDiskOnNode.

* ec.balance: don't let the imbalance threshold skip an over-parity machine

The within-rack spread gated on relative skew ((max-min)/avg > threshold), so a
worker threshold of 0.5 skipped an exactly-50%-skewed layout like 5/4/3 for a 10+4
volume, leaving 5 shards -- past parity -- on one machine. The even cap
(ceil(shards/groups)) is the real bound and the move loop already sheds only what
exceeds it, so drop the threshold gate from the within-rack phase (machine and node):
a balanced rack stays a no-op while any over-cap machine is always fixed.

* ec.balance: keep the imbalance threshold for the node fallback

Dropping the threshold from the whole within-rack phase made the node fallback too
eager: it runs only when machine fault tolerance is unachievable, so it is cosmetic
load distribution that should defer to the global utilization phase. Without the
gate it would, for a one-server-per-host 6/4 split at threshold 0.5, schedule a count
move that worsens utilization balance. Restore the threshold there; machine spreading
keeps bypassing it, since that bound is durability, not cosmetic skew.
2026-06-07 14:14:45 -07:00
7y-9 25f36cd13d fix(s3api): require space in v2 auth prefix (#9852)
* fix(s3api): require space in v2 auth prefix

Problem: Signature V2 Authorization headers with a malformed algorithm token such as AWSX... are accepted as if they were AWS ... headers.

Root cause: validateV2AuthHeader checks HasPrefix("AWS") but then slices past an assumed trailing space, so an extra character after AWS is skipped and the rest is parsed as credentials.

Fix: Require the Authorization header to start with the exact AWS plus space prefix before parsing fields.

Reproduction: go test ./weed/s3api -run 'TestValidateV2AuthHeader/algorithm_prefix_without_space|TestDoesSignV2Match/malformed_auth_-_no_space_after_AWS' -count=1 fails before the fix because AWSXAKIA... is accepted.

Validation: go test ./weed/s3api -run 'TestValidateV2AuthHeader/algorithm_prefix_without_space|TestDoesSignV2Match/malformed_auth_-_no_space_after_AWS' -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check

* Update weed/s3api/auth_signature_v2.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-07 11:52:09 -07:00
7y-9 99bb5db1e3 fix(needle): use discovered file content type (#9851)
Problem: Multipart uploads where the first part was a form field and a later part contained the file used the first part's Content-Type for the file metadata.

Root cause: After finding a later part with a filename, parseUpload copied data and MD5 from part2 but read Content-Type from the original part variable.

Fix: Read Content-Type from the discovered file part.

Reproduction: go test ./weed/storage/needle -run TestParseUploadUsesDiscoveredFilePartContentType -count=1 failed before the fix because the parsed MIME type was text/plain instead of application/x-seaweed-test.

Validation: go test ./weed/storage/needle -run TestParseUploadUsesDiscoveredFilePartContentType -count=1; go test ./weed/storage/needle -count=1; git diff --check; git diff --cached --check
2026-06-07 11:50:34 -07:00
Chris Lu 058569c77b operation: index VidCache by map instead of slice (#9853)
VidCache.cache was a []VidInfo indexed directly by volume id, so caching
one volume with a large id grew the backing array to that many entries
(each 48 bytes), allocating a zeroed slot for every unused id below it. A
single id of 32M cost ~1.5GB resident, plus geometric realloc churn as the
append loop doubled the array.

Use map[uint32]VidInfo so memory scales with the number of volumes actually
cached rather than the largest id seen. Parse ids with ParseUint(.,32) so
values outside the uint32 volume-id range are rejected instead of silently
wrapping into a key.
2026-06-07 11:46:57 -07:00
Chris Lu 755af4adf4 s3: actually bind outbound connections when -ip.bind is set (#9849)
* s3: set outbound bind IP before the first filer dial

Standalone weed s3 dialed the filer for GetFilerConfiguration before
SetOutboundLocalIP ran, so that gRPC conn was created with the stock
dialer and no source address. gRPC caches conns by address and reuses
the original dialer on reconnect, so the s3->filer connection kept
leaving from the OS-chosen source for the life of the process even
after the bind IP was set a moment later.

* grpc: install the outbound-bind dialer unconditionally

The dialer was installed only when OutboundLocalAddr was already set at
GrpcDial time, baking the source-address decision into the cached conn,
so a conn dialed before the bind IP was configured never bound.

Install the context dialer always and decide per dial: bind through
OutboundDialContext once a source is set, otherwise fall back to the
stock net.Dialer so default deployments keep gRPC's dial timeout and
keepalive behavior. The bind now applies on the next reconnect
regardless of ordering, matching the HTTP transport's unconditional
DialContext.
2026-06-07 10:20:58 -07:00
Chris Lu 0e9fc6c5ba worker: drop ec.balance from the default admin script (#9848)
The dedicated ec_balance task worker handles EC shard balancing now,
so the periodic admin script no longer needs to run it.
2026-06-07 00:55:11 -07:00
Chris Lu b2127c86f4 admin: show S3 servers under Cluster (#9847)
* s3: register data center with master on startup

* admin: show S3 servers under Cluster

* admin: add S3 servers to the dashboard
2026-06-07 00:32:20 -07:00
Chris Lu d321f9efb4 s3: collapse suspended-versioning deletes onto one null marker (#9845)
A suspended-versioning DELETE was recorded with createDeleteMarker, which mints a
fresh real version id each time, so repeated suspended deletes piled up delete
markers instead of overwriting a single null marker as S3 specifies. Record the
suspended delete as a 'null' marker with a fixed file name (v_null) and point the
latest-version pointer at it explicitly; putSuspendedVersioningObject's existing
null-version cleanup removes it on the next suspended PUT, so the object undeletes
cleanly and at most one null marker exists. Enabled-versioning deletes are
unchanged (still distinct historical markers).

Update TestSuspendedVersioningDeleteBehavior to the AWS-correct counts: one null
marker after a suspended delete, and the null marker plus one real marker after a
re-enabled delete.
2026-06-06 20:49:38 -07:00
Chris Lu 309cb32416 s3: list directory key objects in versioned bucket version listings (#9842)
ListObjectVersions gated explicit directory objects on Mime ==
FolderMimeType, but an SDK PutObject of "dir/" carries a default
Content-Type (e.g. application/octet-stream), so those directory keys
were dropped from the version listing while ListObjectsV2 - which keys
off IsDirectoryKeyObject (any non-empty mime) - still showed them. Use
the same IsDirectoryKeyObject check so the two listings agree.

The directory test's storage-class assertion compared an ObjectStorageClass
constant against ObjectVersion.StorageClass (ObjectVersionStorageClass);
the values matched but the SDK enum types did not, so it only surfaced
once the directories started appearing. Use the matching constant.
2026-06-06 18:02:33 -07:00
Chris Lu 6c1fd3aeab s3: rescan .versions when the cached latest pointer is missing on a list (#9841)
* s3: rescan .versions when the cached latest pointer is missing on a list

ListObjectsV2 resolves each versioned object's current version from the
latest-version pointer cached on the .versions directory entry. When that
pointer is absent on the filer serving the list, the object was dropped
from the listing. Fall back to a read-only rescan of .versions/ to pick
the newest version - the version files are present locally even when the
cached pointer is not - so the object still lists. This mirrors the read
path's recoverLatestVersionWithoutPointer; the scan loop is shared.

Read-only by design: a list can touch many objects, so it does not persist
a pointer.

* s3: copy scanned Extended before stamping the version id
2026-06-06 18:02:30 -07:00
Chris Lu 9ede92a7cc filer: replicate RECOMPUTE_LATEST pointer updates to peers (#9840)
applyRecomputeLatest wrote the .versions latest-version pointer and the
demoted prior version's stamp through UpdateEntry without a following
NotifyUpdateEvent, so neither change entered the metadata log. Across
filers the pointer then lived only on whichever filer ran the mutation,
and ListObjects served by any other filer dropped those objects from a
versioned bucket. Emit the events the way PATCH_EXTENDED already does,
keeping a pre-update image for the notification diff.
2026-06-06 18:02:28 -07:00
Chris Lu 6e16994615 s3: make lifecycle TTL fast path per-bucket opt-in (#9825)
Stamping an Expiration.Days rule as a volume TTL at write time bakes an
irreversible TTL into the object: removing or lengthening the rule later
can't un-expire it, unlike worker-driven expiration. The metadata-only
delete it enables also skips per-chunk DeleteFile, so dead bytes linger in
a not-yet-expired TTL volume with no deleted-byte accounting until the
whole volume ages out.

Gate the resolver on a per-bucket flag, off by default; toggle with the
s3.bucket.lifecycle.fastpath shell command. Default writes take the worker
path: real deletes that honor current policy and let vacuum reclaim space.
2026-06-06 11:20:15 -07:00
Chris Lu be7f417a03 ip.bind: bind outbound connections to the configured address (#9834)
* ip.bind: bind outbound connections to the configured address

-ip.bind only governed listeners; outbound gRPC and HTTP connections let
the OS pick the source IP, which may not even be able to reach the
target. Mirror the bind address into a process-global source address and
apply it to outbound TCP dials: the gRPC context dialer, the per-client
HTTP transports, and the default transport. Loopback targets and unix
sockets keep the OS-chosen source so same-host traffic still works.

* ip.bind: first-write-wins source IP, skip on address-family mismatch

Make SetOutboundLocalIP first-write-wins so a `weed server` component's own
bind setting (run in its goroutine) can't clobber the process-wide source
address the top-level -ip.bind already established for the other components.

Skip source binding when the target is a literal IP of a different family
than the bind address, since forcing a mismatched source fails the dial.
2026-06-05 12:44:21 -07:00
Nguyễn Lộc Phúc 7f15a9fed4 fix(s3api): standardize ETag calculation in copy handlers (#9829)
* fix(s3api): standardize ETag calculation across S3 API handlers

* s3: make copyEntryETag nil-safe

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-05 12:41:18 -07:00
Chris Lu 6bd0091c72 master: grow rack-spanning volumes once per DC, capped at copy_N (#9835)
* master: grow rack-spanning volumes once per DC, capped at copy_N

The periodic rack-aware growth scan grew once per rack. For rack-spanning
replication (DiffRackCount > 0) a single logical volume already covers every
rack the placement needs, so a crowded volume made every rack report
should-grow and the scan created racks×step too many volumes: with "010"
across two racks that is 2 racks x step 2 = 4 logical (8 physical) volumes.

Plan one DC-wide grow for rack-spanning replication, and cap the per-event
step at master.volume_growth.copy_N so lowering it reduces periodic growth.

* master: distribute lastGrowCount evenly across uneven DCs

The non-rack-spanning grow divisor used the current DC's rack count, so DCs
with different rack counts each over-grew. Sum every rack up front and divide
lastGrowCount by that global count instead.
2026-06-05 12:39:59 -07:00
Chris Lu ab7be7867d security: hot-reload JWT signing keys on SIGHUP (#9826)
* security: reload JWT signing keys on SIGHUP

Signing keys were read once in the server constructors and never
refreshed. After a key rotation (Secret update, divergent reads) the
in-memory key stayed stale and every request kept failing "wrong jwt"
until the affected process was restarted.

Add Guard.UpdateSigningKeys and call it from the master, volume and
filer reload paths and the s3 reload hook, next to the existing
whitelist refresh. Make the global chunk-read JWT cache reloadable via
an atomic swap, and register the master's Reload with grace.OnReload --
it was never wired, so the master ignored SIGHUP entirely.

Mirror the same refresh in the Rust volume server's SIGHUP handler.

* security: swap signing keys behind an atomic pointer

Addresses review feedback on the in-place key swap: SigningKey is a
[]byte, so reassigning the Guard fields while a request handler reads
them is a data race that can tear the multi-word slice header and read
out of bounds.

Hold the four signing-key fields in an immutable signingConfig snapshot
behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a
reader sees either the old keys or the new ones. Reads go through new
SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors.

The Rust guard is already safe: every read and the SIGHUP write go
through the shared RwLock<Guard>.

* security: fold whitelist + auth state into the atomic snapshot

Review follow-up. UpdateSigningKeys still wrote isWriteActive while the
request path read it (and the whitelist maps) unsynchronized, so a SIGHUP
under load could expose an inconsistent mix of activation bits and
whitelist contents.

Move all hot-reloadable Guard state -- keys, expirations, whitelist, and
the activation flags -- into a single immutable guardState swapped behind
one atomic.Pointer. The Update* methods take a small mutex to serialize
the read-modify-write; readers stay lock-free. The concurrency test now
also rotates the whitelist and probes IsWhiteListed under -race.

Also read each signing key once per branch in the volume/filer JWT auth
checks, so a reload landing mid-check can't take the allow-fast-path
after auth was enabled or verify against a different key than the branch
saw.
2026-06-04 22:26:08 -07:00
Chris Lu 0d72023fac fix(master): advance maxVolumeId when registering EC shards (#9827)
* fix(master): advance maxVolumeId when registering EC shards

After EC encoding the original normal volume is deleted, so a
high-numbered volume can exist only as EC shards. Only regular volumes
advanced maxVolumeId (Disk.doAddOrUpdateVolume), so a master that
rebuilt its state from heartbeats (raft state not resumed) undercounted
the max and NextVolumeId could re-issue an id that EC shards still
occupy. A new volume then gets created on top of the EC volume id; new
writes land on it, but reads route to the old EC shards whose .ecx never
held the new needle, returning 404 and corrupting that object.

Advance maxVolumeId when EC shards are registered, mirroring the
regular-volume path. RegisterEcShards is the chokepoint both the full
and incremental heartbeat sync paths funnel through.

* test: cover incremental heartbeat path for EC maxVolumeId

Both SyncDataNodeEcShards and IncrementalSyncDataNodeEcShards funnel
through RegisterEcShards; assert the invariant on the incremental path
too.
2026-06-04 22:25:30 -07:00
Chris Lu 8d59069a0a s3: return BucketAlreadyOwnedByYou when recreating your own bucket (#9822)
* s3: return BucketAlreadyOwnedByYou when recreating your own bucket

PutBucket returned BucketAlreadyExists for every existing bucket, even
when the caller already owns it, so idempotent re-creation (e.g. a
container that creates its bucket on startup) couldn't tell "someone
else took the name" from "it's already mine".

Recreating a bucket you own now returns BucketAlreadyOwnedByYou, unless
the request conflicts with the existing bucket: a different Object Lock
setting, or an ACL on the request or the existing bucket. To detect the
latter, a requested non-default canned/grant ACL is now persisted on
creation instead of being dropped.

* s3: fail PutBucket when the existing bucket's config can't be read

When a bucket already exists, an unreadable config left the recreate
defaulting to BucketAlreadyOwnedByYou, masking the backend error and
possibly accepting a conflicting recreate (Object Lock / ACL unknown).
Surface the read error instead.

* s3: return the stored bucket ACL from GetBucketAcl

GetBucketAcl always returned the owner's default full-control grant and
ignored any stored ACL, so a bucket created with a canned ACL or one set
via PutBucketAcl never read back correctly. Decode the stored grants
instead, sharing one grants-to-XML helper with the object ACL handler.

The shared helper also emits each grantee's real xsi:type (e.g. Group for
public-read) instead of a hardcoded CanonicalUser, so group grants read
back correctly for both bucket and object ACLs.

* s3: resolve the right already-exists error on the concurrent-create race

When two requests create the same bucket at once, the loser's mkdir
fails and the handler fell back to a flat BucketAlreadyExists, bypassing
the same-owner idempotency check. Route both the pre-check and the race
fallback through one existingBucketError helper so a same-owner recreate
still gets BucketAlreadyOwnedByYou.

* s3: record the bucket owner's account id at creation

setBucketOwner only stored the creating identity name, so the canonical
account id wasn't available later. Persist it under ExtAmzOwnerKey too,
the same field PutBucketAcl writes, so the bucket owner can be reported
independently of whoever reads it.

* s3: report the bucket owner from GetBucketAcl, not the caller

GetBucketAcl built the ACL Owner from the caller's account header, so an
admin or cross-account read returned the wrong owner. Use the owner
persisted on the bucket, falling back to the caller only when none is
recorded.
2026-06-04 15:33:03 -07:00