9222 Commits

Author SHA1 Message Date
Chris Lu f724828bcb fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941)
* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)

EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.

validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
  - shards smaller than expected: an interrupted local encode left partial
    shards and the .dat is the complete source -> reclaim the .dat.
  - shards equal to expected: a valid (or still-distributing) EC volume ->
    keep; the shards may be the only copy.
  - shards larger than expected: the .dat is the stale/partial side (e.g. an
    interrupted decode left a half-written .dat next to the real shards) ->
    keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.

handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.

pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).

Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.

Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.

* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments

The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.

Also condense the verbose comments added in this change to the repo's
concise style.
2026-06-12 23:51:29 -07:00
Chris Lu 3718301599 shell: stop ec.encode/ec.rebuild from destroying live EC shards (no crash needed) (#9939)
* shell: stop ec.encode/ec.rebuild from destroying live EC shards

Three operator-triggered shell paths could destroy data with no crash:

ec.encode -volumeId on an already-EC volume tore down its shards before
failing. The volume-id path never checked the id was a regular volume:
the collection lookup scans only VolumeInfos (so an EC-only id maps to
""), and volumeLocations succeeds via the EC-location fallback, so
clearPreexistingEcShards full-teardown-deleted every shard cluster-wide
before doEcEncode failed. An EC volume has no .dat, so this is its only
copy. Add assertEncodableRegularVolumes: each requested id must be a
regular volume in the topology snapshot; an EC-only or unknown id is
refused before any teardown. A volume present as both a regular .dat and
stale orphan shards (a failed-encode retry) still passes. This closes
the operator-rerun/script-retry path; a worker racing the snapshot is a
fencing problem handled separately.

ec.rebuild dry-run (the default, without -apply) still issued real
VolumeEcShardsDelete RPCs: prepareDataToRecover appended every
would-copy shard to copiedShardIds even though the copy was skipped, and
the cleanup defer deleted that set unconditionally. Now a dry-run copies
nothing and records nothing to delete (a separate would-copy counter
drives the recoverability check so the dry-run still reports its plan),
and the cleanup runs only under -apply.

ec.rebuild could also self-destruct a live shard: localShardsInfo was
overwritten per disk instead of unioned, so a shard the rebuilder holds
on a non-last disk looked remote, got copied onto itself (in-place
O_TRUNC) and then node-wide deleted. Union local shards across all
disks, and never copy/delete a shard whose only listed holder is the
rebuilder itself.

* shell: address ec destructive-guards review comments

- countLocalShards: union shards across all of the rebuilder's disks so
  slot accounting matches what prepareDataToRecover treats as local;
  first-match counting overstated slotsNeeded on multi-disk rebuilders
- VolumeEcShardsCopy: resolve SourceDataNode via
  pb.NewServerAddressFromDataNode instead of the raw node id, which may
  not be a dialable host:port
- assertEncodableRegularVolumes: skip nil DiskInfo map entries, matching
  the other topology walks in this file; rename ecOnly to hasEcShards
  since the map marks any volume with shards, not only shard-only ones
2026-06-12 22:30:17 -07:00
Chris Lu 18cdb3819b fix(ec): crash-safe ecx-journal fold and shard rebuild (fsync before publish, no short-read-as-success) (#9938)
* fix(ec): make ecx-journal fold and shard rebuild crash-safe

Two EC rebuild paths could silently lose or corrupt data:

RebuildEcxFile folded the .ecj deletion journal into .ecx (in-place
WriteAt tombstones) and then unlinked the journal without flushing the
.ecx writes first. A crash could persist the unlink ahead of the
tombstones, resurrecting deleted needles on the next load. It also read
journal records with a bare n!=size break, so a torn tail silently
dropped the remaining tombstones before the unlink. Now: read records
with io.ReadFull (io.EOF ends cleanly, a torn tail aborts and leaves
.ecj in place for retry), fsync .ecx before removing the journal.

rebuildEcFiles treated a zero/short ReadAt as a clean end-of-input and
discarded the read error, so a truncated or unreadable input shard
produced truncated regenerated shards that were then published as
restored redundancy; the regenerated shards were also never fsynced on
the no-sidecar path. Now: derive the expected shard size from the
present inputs up front (rejecting a divergent/zero-size input), drive
the loop by that size, fail on any short read or short write, and fsync
every regenerated shard before it is mounted/renamed.

Rust volume server mirrors the rebuild fix: rebuild_ec_files now checks
the read_at byte count (it previously discarded it, the same truncation
bug). The Rust ecx fold already synced .ecx before removing the journal.

Custom EC ratios are unaffected: the shard size derives from the input
shards and the loop uses the .vif-resolved data/parity counts, never a
hardcoded 10+4.

* storage: close ecx journal files via defer in RebuildEcxFile

Per review: a single deferred Close per file replaces the per-error-path
manual closes, so new early returns cannot leak descriptors. The journal
is still closed explicitly before its unlink since Windows cannot delete
an open file; the deferred second Close is a harmless no-op.
2026-06-12 22:28:56 -07:00
7y-9 5468707289 fix(util): ignore comment only sql input (#9933)
* fix(util): ignore comment only sql input

Problem: sqlutil.SplitStatements strips SQL comments while scanning, but when no statements remain it falls back to returning the original query. Inputs that contain only comments are therefore reported as executable SQL statements.

Root cause: The no-statements fallback did not distinguish a real single statement from input that had been fully removed by comment filtering.

Fix: Remove the original-query fallback and return an explicit empty slice when scanning produces no statements.

Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -run TestSplitStatements -count=1 failed before the fix because comment-only inputs returned the comment text as a statement.

Validation: gofmt -w weed/util/sqlutil/splitter.go weed/util/sqlutil/splitter_test.go; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -run TestSplitStatements -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -count=1; git diff --check; git diff --cached --check.

Duplicate check: Searched /private/tmp/seaweedfs-codex0610-old-branch-index.tsv and existing tests for sqlutil, SplitStatements, comments, and comment-only. Old PostgreSQL query branches cover malformed wire frames and SQL engine numeric parsing, not comment-only statement splitting.

Co-authored-by: Codex <noreply@openai.com>

* Update weed/util/sqlutil/splitter.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-12 10:10:27 -07:00
Chris Lu 0345658ea8 [s3] validate indirect filer path inputs (#9931)
* s3: validate indirect filer path inputs

* s3: avoid query parsing on common request path

* filer: scope copy/move source against JWT AllowedPrefixes

maybeCheckJwtAuthorization only checked r.URL.Path, but copy and move read
their source from the cp.from / mv.from query params. A prefix-restricted
token could copy or move data out of a subtree it cannot otherwise reach.
Check every path the request touches, reusing pathHasComponentPrefix so
`..` in the source is collapsed before the prefix match.

* s3: confine iceberg CreateTable location to the catalog bucket

CreateTable derived the metadata bucket and path from the client-supplied
req.Location / req.Name and wrote there directly, so a caller scoped to one
table bucket could place metadata in another bucket (and path.Join collapsed
any `..`). Require the parsed bucket to equal the request's catalog bucket
and reject traversal segments in the table path.

* webdav: clean client path before subFolder confinement

wrappedFs concatenated subFolder + name before the underlying FileSystem
ran path.Clean, so `..` in the request path or COPY/MOVE Destination
resolved across the FilerRootPath confinement boundary. Clean the name as a
rooted path first so traversal segments collapse below subFolder. Only the
non-default -filer.path (non-empty subFolder) setup was affected.

* filer: enforce read-only rule on real write path with destination header

The x-seaweedfs-destination header overrides the path used for storage-rule
matching while the entry is written at r.URL.Path, letting a caller select a
writable rule for a read-only target. When the header is present, also check
the read-only/quota rule against the actual write path.
2026-06-11 21:56:16 -07:00
Chris Lu 34f9b91d69 fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards

A leftover empty .dat stub (a phantom from the pre-fix loader; zero
needles) next to a distributed EC volume's local shards made startup
classify the volume as an interrupted local encode: validateEcVolume
requires >= dataShards local shards when a .dat is present, fails with
the 1-2 shards a distributed volume keeps per disk, and the cleanup
deletes those shards -- the only copies of that part of the volume.
Repeated across restart waves this destroys enough shards cluster-wide
to make the volume unrecoverable.

Go:
- loadExistingVolume: hoist the empty-stub sweep above the EC presence
  checks. Previously the .vif-next-to-.ecx guard returned before the
  sweep ever ran, so exactly the dangerous layout (stub + .ecx + local
  shards) kept its stub and then lost its shards in loadAllEcShards.
- validateEcVolume / checkDatFileExists: treat a .dat <= a superblock
  (zero needles) as absent. An empty .dat cannot be the encode source,
  so it must never gate shard deletion; this also covers stubs without
  a .vif, which the sweep cannot prove are EC leftovers.

Rust mirror (seaweed-volume): the same gate in validate_ec_volume and
check_dat_file_exists (the Rust sweep already ran before validation);
the volume-load skip keeps a plain existence check so fresh,
needle-less volumes still load.

Regression tests in Go and Rust reproduce the production layout (a
zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with
and without a .vif) and fail without the fix with the shards deleted.

* fix(ec): gate source volume deletion on a recoverable shard set

After EC encode, the shell command and the (plugin) worker task refused
to delete the source volume unless every shard was present, and aborted
otherwise -- leaving the source .dat next to live shards, exactly the
mixed state the startup cleanup mishandles.

Replace the full-set requirement with a recoverability gate shared by
both callers (RequireRecoverableShardSet): deleting a non-empty source
.dat requires at least dataShards distinct shards cluster-wide. Below
that the source is kept and the encode fails as before. A degraded but
recoverable set (>= dataShards, < total) now proceeds with a warning
instead of aborting: the missing shards can be rebuilt from the
survivors, while keeping the source would preserve the dangerous mixed
state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an
empty .dat has nothing to lose.

dataShards/totalShards stay parameters so enterprise custom EC ratios
share the helper verbatim.

* test(ec): use recoverable shard verification gate
2026-06-11 20:26:20 -07:00
Chris Lu b44cf51fe9 s3: validate copy source path segments (#9929)
Reject copy sources whose bucket/object fail IsValidBucketName /
IsValidObjectKey, the helpers validateRequestPath already applies to the
request URL. The object is joined onto the bucket path and `.`/`..`
segments are collapsed by the filer, so without this the source need not
stay within the parsed bucket. Route UploadPartCopy through
ValidateCopySource too; it previously only checked for empty bucket/object.
2026-06-11 17:07:15 -07:00
Chris Lu 4f8af455bf feat(storage): sweep leftover empty EC .dat stubs on volume server startup (#9927)
* feat(storage): sweep leftover empty EC .dat stubs on volume server startup

An EC volume keeps no local .dat. The pre-fix loader left empty 8-byte
superblock .dat stubs next to EC metadata (one per lone .vif). Left in
place each loads as a phantom empty volume, and the same vid's stub on
two disks of one server blocks Rust startup via the duplicate-vid check
in Store::add_location -- the prior fix stops creating new stubs but does
not clean up existing ones.

On startup, when a .dat is empty (<= a superblock, i.e. zero needles) and
its .vif marks the volume erasure-coded, remove the stub (+ empty .idx)
instead of loading it. The real data is in the EC shards, so the empty
stub holds nothing to lose. Non-EC empty .dat files (e.g. freshly
allocated volumes) are left alone.

Done in both Rust (load_existing_volumes) and Go (loadExistingVolume),
with regression tests that fail without the sweep.

* refactor(storage): extract empty EC .dat stub sweep into its own function

Move the startup stub-sweep into remove_empty_ec_dat_stub (Rust) and
removeEmptyEcDatStub + vifIsEcVolume (Go) for clearer logic, and look up
the .vif in both the data and idx directories (each read at most once) so
a stub is still found when -dir.idx is configured. Adds direct tests for
the idx-directory lookup on both engines.
2026-06-11 12:26:21 -07:00
Chris Lu 37962e2445 admin: configure maintenance tasks via admin.toml (#9926)
* admin: configure maintenance tasks via admin.toml

Maintenance task settings could only be edited in the admin UI and live
under <dataDir>/conf, so they silently reverted to defaults whenever the
data directory was recreated. An optional admin.toml now declares vacuum,
balance, and erasure coding settings; keys set there are written through
to the persisted task configs at every startup, overriding UI edits, so
the configuration stays declarative. Generate an example with
"weed scaffold -config=admin".

* vacuum: round min volume age up to whole hours

MinVolumeAgeSeconds was truncated by integer division when converted to
the hour-granular protobuf field, so a sub-hour setting silently became
0 and disabled the age guard.

* admin: split and normalize preferred_tags from admin.toml

A comma-separated string, as set via environment variable, came through
viper as a single slice element. Split on commas and reuse
util.NormalizeTagList, matching the plugin config path.

* scaffold: clarify admin.toml wording
2026-06-11 11:04:52 -07:00
Chris Lu 42030381ae shell: volume.tier.move can move volumes between data centers (#9925)
* shell: volume.tier.move can move volumes between data centers

-fromDataCenter scopes volume selection to volumes with a replica in
that data center. -toDataCenter constrains move destinations and
replication fulfillment. With identical disk types both flags are
required, moving full volumes between data centers on the same tier.

* shell: assert node identity in data center filter test

* shell: tier move resumes when the volume is already on the target

A replica already on the target tier and data center, typically left by
an interrupted earlier run, anchors the move: skip the copy and only
complete replication fulfillment and old replica cleanup. Previously
such volumes hit the no-destination path and the stale source replicas
were never removed.
2026-06-11 10:46:34 -07:00
Chris Lu 3eb550a3f1 fix(tests): 32-bit build of EC e2e tests, type-check linux/386 in CI (#9922)
* fix(tests): keep EC e2e fid cookie arithmetic in uint32

The cookie constants 0x9490CA00 and 0x9500CA00 were added to the int
loop variable before conversion, overflowing 32-bit int at compile
time on linux/386 and linux/arm. Convert the loop variable instead so
the addition stays in uint32.

* fix(tests): pass s3client max backoff in milliseconds

MaxBackoffDelay is documented as milliseconds and multiplied by 1e6
before use, but the example set it to 5s in nanoseconds, yielding an
absurd backoff on 64-bit and a compile-time int overflow on 32-bit.

* ci: type-check code and tests for linux/386

64-bit-only constant arithmetic keeps slipping into test files and
breaking 32-bit downstream builds. Vet the whole root module under
GOOS=linux GOARCH=386 so these fail in CI instead of after release.

* fix(tests): convert s3client backoff to Duration before scaling

The ms-to-ns multiplication ran in int, wrapping at runtime on 32-bit;
scale by time.Millisecond after the Duration conversion instead.
2026-06-11 09:05:54 -07:00
Chris Lu 582b7268f5 s3: export per-bucket quota and read-only state metrics (#9923)
The quota enforcement loop already computes each bucket's configured
quota and effective read-only flag every minute, but neither was
visible to monitoring, so operators could not alert before a bucket
flips read-only.

Add two gauges next to the existing bucket size metrics:

  SeaweedFS_s3_bucket_quota_bytes  configured quota; the series is only
                                   present while the quota is enabled,
                                   so size/quota utilization queries
                                   never divide by zero
  SeaweedFS_s3_bucket_read_only    1 when the bucket's location rule is
                                   read-only (over quota or manually
                                   locked), 0 otherwise

Both are cleaned up with the other per-bucket gauges on bucket
deletion and inactivity TTL.
2026-06-11 09:03:00 -07:00
Chris Lu 55010be19b 4.33 2026-06-11 00:52:31 -07:00
Chris Lu 79ac279fe1 fix(ec): don't mix EC shards from different encode runs (#9880)
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC

EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns
(encode time in unix nanos). It rides in .vif and the read request so a
read can be scoped to the encode run that produced the index.

* fix(ec): stamp each encode and reject cross-run shard reads

Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the
shard's owning volume (resolved together via FindEcVolumeWithShard, so a
multi-disk server validates the disk that actually serves the bytes) and
reject a shard from a different encode run, recovering from parity. A
zero on either side (pre-upgrade volume) skips the guard.

* fix(ec): stamp the encode identity on the worker-generated .vif

The worker-local encode path now writes EncodeTsNs (and the resolved EC
ratio) into the .vif, so the read guard is not silently off for volumes
encoded by the maintenance worker.

* fix(ec): wipe stale EC artifacts before re-encoding

VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and
removes its on-disk shard/index/sidecar files before writing fresh ones,
so a retried encode never builds on a partial prior run and the unlink
frees the inodes instead of leaving open fds serving old bytes.

* fix(ec): unmount EC shards across all disks

UnmountEcShards walked only the first disk holding the shard, leaving a
duplicate copy mounted on a sibling disk (split-disk reconciled volumes)
still serving and heartbeating. Traverse every disk and emit one
deletion delta per disk.

* fix(ec): delete orphan shards without a local .ecx

deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx,
so it could not clean an orphan .ecNN left by a failed copy on a disk
with no index. Delete the requested shard files unconditionally; the
index-file (.ecx/.ecj/.vif) routing stays gated as before.

* fix(ec): clear stale EC shards cluster-wide before re-encoding

ec.encode unmounts and deletes EC shards for the target volumes on every
node before regenerating: fatal for the shards the topology reports
(mounted leftovers), best-effort for the rest (a sweep that catches
unmounted failed-copy orphans). A down node is a no-op.

* fix(ec): don't nil EC fds on close so reads can't race eviction

A reader resolves an EcVolume/shard under the lock then reads after it is
released, so an eviction that nils ecxFile/ecdFile would race that read
and panic. Close the fds without nilling the fields: the field is now
write-once (no data race) and a concurrent read hits a closed fd, getting
a clean error that the caller recovers from parity.

* fix(ec): wipe stale EC artifacts on every disk and surface failures

The pre-encode wipe only deleted beside the source volume, so a stale
shard on a sibling disk survived and could be mounted against the new
index at reconcile. Sweep every disk. Removal also ignored os.Remove
errors, reporting a failed cleanup as success and letting a stale shard
join the next generation; surface the first real failure (treating
already-gone as success) from removeStaleEcArtifacts and the shard delete.

* fix(ec): log when a local shard is skipped for a different encode run

The cross-run guard returned errShardNotLocal, indistinguishable in logs
from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so
operators can tell "wrong encode generation" from "shard not here".

* fix(ec): surface metadata removal failures in the shard delete path

deleteEcShardIdsForEachLocation still dropped os.Remove errors on the
.ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index
condition this path prevents, so route those through removeFileIfExists and
return the first real failure instead of reporting cleanup as success.

* fix(ec): fail orphan cleanup when a reachable node's delete fails

The pre-encode orphan sweep swallowed every error for unreported (node,
volume) pairs. That is only safe for an unreachable node, which cannot
receive this encode's new generation. A reachable node whose delete
genuinely failed (permission/IO) keeps an orphan shard that a later copy
re-stamps with the new run's volume-level .vif identity, so the read guard
would accept stale data. Surface those; stay best-effort only for
unreachable nodes (gRPC Unavailable / no status).

* fix(ec): guard ecjFile under its lock in the EC delete path

EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved
its .ecx lookup before a concurrent eviction (the generate-time
UnloadEcVolume) could then reach the journal append with a nil fd. Bail
with a clear "volume closed" error under the lock instead.

* fix(ec): reject an unstamped shard when the caller has an encode identity

The read guard required both identities nonzero, so a current (stamped)
caller accepted a holder with identity 0 and could be served a stale
pre-upgrade shard. Reject when the caller is stamped and the holder
differs (including unstamped); stay lenient only when the caller itself
has no identity (pre-upgrade reader). A skipped shard recovers from parity.

* fix(ec): full-teardown delete so cluster cleanup wipes a whole generation

The pre-encode cluster sweep deleted only the listed canonical shards on
remote nodes, leaving index/sidecar (and, on builds with versioned
generations, those too) behind. Add a full_teardown flag to
VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for
it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode
cleanup paths set it. Other delete callers (balance/decode/repair) are
unchanged.

* fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close

Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while
Close nils it under the lock, a data race on the field. Take the lock
first, then nil-check inside, in both.

* fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success

An old volume server silently ignores full_teardown and returns success
for an ordinary delete, so the caller wrongly believes the generation was
wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done
in the response; the worker destination cleanup fails when it is absent, and
the shell cluster sweep fails for a reported (mounted) leftover while staying
best-effort for an unreported node. encode_ts_ns stays an accepted transient
(an old server just skips the new read guard, no regression).

* fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown

A reachable pre-upgrade server ignores full_teardown and returns success
without wiping an orphan, which a later copy then folds into the new
generation. Treat a missing full_teardown_done ack as fatal for every
reachable node (best-effort only for a gRPC-unreachable one), not just for
topology-reported pairs.

* fix(ec): return the served shard identity and validate it client-side

The encode identity was only enforced server-side, so a pre-upgrade server
ignored the request field and served bytes unchecked. Echo the served
shard's EncodeTsNs on every read response chunk and have the client reject a
mismatch (including 0 from an old server), so the guard holds regardless of
server version; a rejected read recovers from parity.

* fix(ec): reject a short/empty remote shard read instead of serving zeros

doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and
returned success with a partly zero-filled, unvalidated buffer (the server
stamps the identity only on chunks that carry bytes). A non-deleted interval
must arrive whole: require n == len(buf), exempting the is_deleted
short-circuit (n=0), matching readLocalEcShardInterval's local check. A short
read now fails so the caller recovers from parity.

* test(ec): fake volume server echoes the full_teardown acknowledgement

The worker now fails a teardown delete that isn't acknowledged (so a
pre-upgrade server can't silently skip the wipe). The fake server's no-op
VolumeEcShardsDelete returned an empty response, which the worker read as a
skipped teardown and aborted the encode. Echo full_teardown_done.

* feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server

The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif
and rejects a read served from a shard of a different run; full_teardown wipes a
whole generation and acknowledges it. The Rust volume server had none of it.
Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume,
stamp it on every read response, and reject a request/response mismatch on both
the server and the distributed-read client (recovering from parity); handle
full_teardown by evicting the volume and wiping every EC artifact on each disk,
echoing full_teardown_done so the caller can detect a server that ignored it.

* fix(ec): remove a stale .vif on full teardown of a shard-only node

A shard copy installs shards + .ecx before .vif, so an interrupted copy after a
teardown could mount the new files under the previous run's identity / version /
shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during
full teardown, gated on .idx absence so a source-volume holder keeps its live
.vif. In Rust this lives in a teardown-only helper so the reconcile / load-
fallback paths (which share the base removal) still preserve .vif.

* fix(ec): treat a missing teardown ack as fatal, not as an unreachable node

isNodeUnreachable returned true for any non-gRPC-status error, so a reachable
pre-upgrade server's missing full_teardown_done ack (a plain error) was
classified unreachable and the unreported pair was silently skipped. Classify
only a real codes.Unavailable as unreachable, and wrap the missing ack in a
sentinel the sweep treats as fatal regardless. A genuinely down node still
surfaces as Unavailable from the RPC and stays best-effort.

* fix(ec): reject a short shard read in the local EC needle reader

read_ec_shard_needle ignored the byte count from shard.read_at and appended the
whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later
length check and parsed as garbage. Require n == buf.len() per interval, erroring
on a short read like the local interval reader already does.

* fix(ec): probe reachability before skipping a node that returns Unavailable

The pre-encode sweep skipped any node whose teardown delete returned
codes.Unavailable, but a reachable volume server in maintenance mode also
returns that code for the maintenance-gated delete, so its stale EC files were
left behind on a node that can still receive the new generation. Confirm with a
non-maintenance-gated empty-target Ping: skip only when the node fails the probe
too (genuinely unreachable).

* fix(ec): use try_exists for the teardown .vif .idx guard

The teardown-only .vif removal gated on Path::exists(), which returns false on a
permission/IO stat error, so a stat failure on a present .idx would read as a
shard-only node and delete the live source volume's .vif. Gate on
try_exists() == Ok(false) instead, preserving the sidecar on any stat error.

* fix(ec): only skip a sweep node when a Ping confirms it is transport-down

The pre-encode sweep skipped a node whenever its teardown delete and a liveness
Ping both failed, but it treated ANY Ping error as down — an application-level
Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a
reachable node's stale generation in place. Classify the Ping tri-state and skip
only when it transport-fails with codes.Unavailable; a reachable or inconclusive
node stays fatal.

* fix(ec): exclude sweep-skipped nodes from the encode's rebalance

The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance
then recollected the current topology — a node that recovered between the two
could become a copy target and receive the new generation while still holding
its stale, never-cleared shards. Have the sweep return the skipped set and
exclude those nodes from the rebalance for this encode, so a node we could not
clean cannot receive the new generation. Standalone ec.balance is unaffected.

* fix(ec): re-sweep recovered nodes before generation so they aren't stranded

A node skipped as down by the pre-encode sweep is excluded from the rebalance,
but it can recover and become the generation host — mounting all shards locally,
then being excluded from distribution. Union-only verification accepts all
shards on one node and deletes the originals: a single point of failure. Re-sweep
the skipped nodes just before generation; one whose teardown now succeeds leaves
the skipped set and rebalances normally, while a node still down stays skipped.

* fix(ec): abort the encode if a selected source is still skipped after re-sweep

The re-sweep un-skips a recovered node, but the source was selected before it and
a node can stay down through the re-sweep then recover just in time to be the
generation host — mounting all shards locally while still excluded from the
rebalance, which union-only verification accepts before deleting the originals.
Abort the encode when a selected source remains skipped after the re-sweep.

* fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch

If a volume is not EC at the batch-delete classification but is encoded to EC and
its .dat deleted before the regular-volume mutation, the mutation returns an exact
"not found" that the filer chunk-GC treats as completed, dropping the delete.
Recheck EC presence under the mutation lock and return a retriable 503 with the
"try again" token so the filer requeues it onto the EC path.

* fix(ec): recheck EC state before the regular batch-delete mutation

ec.encode mounts EC shards (copied from the .dat) before deleting the originals,
so a volume can be EC while its .dat still exists. The batch delete only rechecked
EC after a NotFound, so a successful regular-volume delete in that window wrote a
tombstone to the soon-removed .dat — the delete was lost and the needle resurrected
from the pre-tombstone shards. Recheck has_ec_volume under the write lock before
delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path.

* fix(volume): make the metrics push test independent of test order

test_push_metrics_once asserted the pushed body contains the request-counter
family without ever touching the counter — a CounterVec with no children emits
nothing, so the assertion only held when another test had already created a
labelset in the shared registry. Create one in the test itself.
2026-06-10 22:31:18 -07:00
Bruce Zou 1dd292fb84 batch drain delta heartbeat messages (#9914) 2026-06-10 13:33:45 -07:00
Lisandro Pin 6b4d20a6f3 volume.scrub and ec.scrub shell commands: make the display of scrub details optional. (#9911)
On volumes failing scrubs, the detail output can get very verbose, which makes
reading results difficult. Most users won't care about this information to
begin with - just whether or not volumes pass scrub tests.

This MR gates the display of scrub result details behind a `--details` flag.
2026-06-10 13:29:07 -07:00
Chris Lu 594fc667d5 Cut per-subscriber replay decode and widen replay concurrency (#9917)
* Filter metadata events before unmarshaling them per subscriber

Every subscriber unmarshaled every log entry into a full event just to
run the path filter, and entries carry complete chunk lists, so a fleet
of path-filtered subscribers spends almost all replay CPU materializing
events it then discards. A shallow wire scan now extracts just the
directory, entry names and rename destination into a skeleton event,
feeds the same matcher, and skips the decode for entries the subscriber
cannot match. Any scan surprise (malformed bytes, merged duplicate
message fields) falls back to the full decode, and the unsynced-events
heartbeat keeps firing for skipped entries.

* Raise the legacy replay cap

The cap was sized when every replay pinned a private chunk reader per
source filer. Replays now share decoded chunks, so sixteen needlessly
serializes subscriber catch-up; the expensive part stays bounded by the
cache's load gate.

* Weight concurrent log-chunk loads by size

The flat eight-load gate let eight tiny chunks through as reluctantly as
eight full ones. Charge each load's chunk size against a 128MB in-flight
budget instead: small chunks decode wide open while full-size ones still
serialize enough to cap the transient peak. Oversized weights clamp to
the budget so they can always acquire.

* Propagate heartbeat send failures and reset the skip counter

A failed heartbeat send means the stream is gone, so end the replay
instead of scanning on. A delivered event also resets the skip counter,
keeping the heartbeat cadence relative to the last thing the client
actually received.

* Share the unsynced-events counter across the prefilter and delivery

Two independent counters could starve the heartbeat: alternating drops
reset each side before either reached its threshold. One shared counter
increments on every dropped entry, prefiltered or not, and only an
actual delivery resets it, restoring the original cadence exactly.

* Tighten comments

* Benchmark the subscription match paths

For a thousand-chunk event that the subscriber filters out, the shallow
scan matches in 10us and 9 allocations against 175us and 4031
allocations for the full decode.
2026-06-10 13:08:34 -07:00
Chris Lu e56a1c4c05 admin: pre-gzip embedded static assets, add cache headers (#9918)
The admin UI served embedded static files uncompressed and without
cache headers: embed.FS has zero mod times, so no Last-Modified, no
ETag, no 304s -- every page load re-downloaded ~700KB of css/js in
full, which gets painful over slow or tunneled links.

Gzip the static tree at generation time (go generate ./weed/admin)
and embed only the compressed mirror, shrinking the binary ~1.5MB.
The handler hands the pre-compressed bytes to gzip-capable clients,
decompresses for the rest, and sets Cache-Control, per-variant
content-hash ETags and Vary so repeat loads revalidate with a 304.
bootstrap.min.css goes 232KB -> 30KB on the wire.

A drift test keeps static_gz/ in sync with static/.
2026-06-10 12:54:36 -07:00
Chris Lu c2271d59bb log_buffer: stop dumping the whole log entry on callback errors (#9919)
The eachLogDataFn error path printed the full LogEntry proto. For an
entry carrying a large chunk manifest that is hundreds of KB of escaped
bytes in a single log line, burying the actual error -- often just a
subscriber disconnect -- at the very end. Log the key, timestamp,
offset and data size instead.
2026-06-10 12:47:35 -07:00
Chris Lu 2ac5aa72c7 add elastic8 filer store for Elasticsearch 8 (#9916)
* elastic: fix listing against a missing or empty directory index

The refresh 404 leaked into the named return, so the first listing of a
directory whose index does not exist yet returned an error instead of an
empty result. Sorting also fails on an index with no documents
("No mapping found for [_id] in order to sort on"); unmapped_type
keeps the resumed-listing path working there.

* add elastic8 filer store for Elasticsearch 8

Elasticsearch 8 disables _id fielddata by default, so the elastic7
store's directory listings fail with "Fielddata access on the _id
field is disallowed". elastic8 uses the same client and configuration
options, but also indexes the document id as an Id field and sorts
listings on Id.keyword.
2026-06-10 12:10:49 -07:00
7y-9 689b5b61bf fix(s3api): reject empty v4 signed header names (#9910)
Problem: Signature V4 SignedHeaders parsing accepted empty header name segments such as host; or ;host. Malformed Authorization headers could continue into signature verification instead of failing during header parsing.

Root cause: parseSignedHeader only checked that the SignedHeaders value was non-empty, then split it on semicolons without validating each element.

Fix: reject empty or whitespace-only signed header elements with ErrMissingFields before returning the parsed header list.

Reproduction: go test ./weed/s3api -run TestParseSignedHeaderRejectsEmptyHeaderNames -count=1 failed before the fix because SignedHeaders=host; returned ErrNone.

Validation: gofmt -w weed/s3api/auth_signature_v4.go weed/s3api/auth_signature_v4_test.go; git diff --check; go test ./weed/s3api -run TestParseSignedHeaderRejectsEmptyHeaderNames -count=1; go test ./weed/s3api -count=1

Co-authored-by: Codex <noreply@openai.com>
2026-06-10 11:00:35 -07:00
Chris Lu 7bf2dfc9ab Bound the metadata-log flush queue (#9907)
* Bound the metadata-log flush queue

A stalled flush, e.g. slow volume servers under a reconnect storm, let up
to 256 queued 8MB buffer copies pin two gigabytes per log buffer while
producers kept filling the queue. Cap the queue at 16 so a sustained
stall backpressures writers instead of growing the heap. The flush
goroutine never feeds back into the buffer (system-log paths skip event
notification), so blocked producers cannot deadlock the consumer.

* Don't drop a force-flushed buffer on a full queue

ForceFlush enqueued with a two-second timeout, but by then the live
buffer was already sealed and reset, so a timed-out send silently lost
the copy. Block until the flush is queued; the wait for completion stays
bounded since the data is durable once the flush loop drains it.

* Never close the flush channel

ShutdownLogBuffer closed flushChan while producers could still be
blocked sending into it, which panics. Terminate loopFlush with a nil
sentinel instead, so the channel is never closed, and give every
producer-side send a shutdown escape so none parks forever once the
flush loop exits. Everything queued before the sentinel still drains,
preserving IsAllFlushed semantics.

* Copy the shutdown flush under the buffer lock

Every other copyToFlush call site holds the lock; the shutdown path read
the live buffer unlocked while producers could still be appending.
2026-06-10 10:57:30 -07:00
Chris Lu bf76040046 Share metadata-log replays per chunk instead of per file (#9906)
* Share metadata-log replays per chunk instead of per file

Log file chunks are immutable: each metadata-log flush uploads one whole
buffer of complete records as a new chunk, and appends only add chunks.
So cache decoded entries per chunk, with no age gate and no fingerprint
revalidation. The per-file cache excluded files younger than two flush
intervals, which is exactly the hot tail that every tailing or
reconnecting subscriber replays — each through a private chunk reader
holding an 8MB buffer and decoding the whole file from byte zero.

A chunk's flush time also upper-bounds every record timestamp inside it,
so a tail replay now skips cold chunks without reading them at all.

If a chunk does not decode standalone (records spanning chunk
boundaries, or a corrupt size prefix), fall back to streaming the whole
file as one byte stream, resuming after the last yielded entry.

* Evict idle metadata-log cache entries

The replay cache only evicted on insert, so once filled it held its full
budget forever. Stamp entries on use and sweep the LRU tail every minute,
dropping anything untouched for five minutes; the cache now holds memory
only while subscribers actually replay.

* Reject implausible records when decoding log chunks

proto.Unmarshal is permissive: empty payloads and unknown-field garbage
parse without error, so a chunk starting mid-record could decode by
coincidence and get cached instead of falling back to the byte stream.
Enforce what the writer guarantees - records are never empty and carry
strictly increasing positive timestamps within one flushed buffer.

* Gate the singleflight test on an open flight

The sleep alone only probabilistically created concurrent misses; a
started channel now proves the loader holds the flight before callers
are released.
2026-06-10 10:57:11 -07:00
Lisandro Pin 5150c86934 Make shell command ec.scrub return shard details upon scrub failures in LOCAL mode. (#9913)
This is useful information to deal with issues requiring EC shard rebuilding,
such as https://github.com/seaweedfs/seaweedfs/issues/9872.
2026-06-10 10:55:16 -07:00
7y-9 7c0a9acb30 fix(s3api): normalize checksum trailer header names (#9905)
Problem: SigV4 chunked upload checksum trailer parsing rejected mixed-case checksum header names even though HTTP header field names are case-insensitive.

Root cause: extractChecksumAlgorithm compared the x-amz-trailer value and trailer header key against exact lowercase strings.

Fix: Trim and lowercase checksum trailer header names before matching supported checksum algorithms.

Reproduction: go test ./weed/s3api -run TestExtractChecksumAlgorithmIsCaseInsensitive -count=1 with X-Amz-Checksum-Crc32; before the fix it returned unsupported checksum algorithm.

Validation: gofmt -w weed/s3api/chunked_reader_v4.go weed/s3api/chunked_reader_v4_test.go; git diff --check; go test ./weed/s3api -run TestExtractChecksumAlgorithmIsCaseInsensitive -count=1; go test ./weed/s3api -count=1

Co-authored-by: Codex <noreply@openai.com>
2026-06-10 00:30:43 -07:00
Chris Lu 9e98ec4b2e Share decoded metadata-log entries across subscriber replays (#9903)
perf(filer): share decoded log entries across metadata replays

Concurrent SubscribeMetadata replays of the same persisted log history each
opened a chunk reader per source filer and re-decoded the same files, so a
reconnect storm multiplied into many GB of buffers. Cache the decoded entries
of completed log files in a bounded LRU, coalescing concurrent loads with
single-flight and bounding concurrent decodes. Each hit is validated against
the file's current chunk set, so a file that received a late append is
reloaded rather than served stale; reads that stop on an unreachable chunk are
delivered but not cached so a transient outage re-probes on the next replay.
2026-06-09 13:34:11 -07:00
Chris Lu e12052ee6b fix(filer.sync): replicate a rename as an atomic move, not a no-op update (#9895)
* fix(filer.sync): replicate a rename as create-then-delete, not an in-place update

A rename arrives as a single metadata event carrying both the old and new
entry. The filer sink was routed to UpdateEntry, which looks up the old
path but issues the update against the new parent without changing the
name — and the filer UpdateEntry RPC cannot move an entry. So the rename
was dropped: the old path lingered and the new path never appeared
(same-dir renames rewrote the old name in place).

Route a real move (the sink path changed) through CreateEntry(new) then
DeleteEntry(old) in both the replicator and the filer.sync/backup driver,
the way the other sinks already handle it; reach UpdateEntry only for true
in-place updates. Create before delete so a crash between the two leaves
the entry visible rather than lost.

* fix(filer.sync): derive the rename delete key like the create key, guard the watched root

The rename delete leg rebuilt the old key with a raw util.Join, bypassing the
sink-side key normalization the create leg gets from buildKey — so a rename
could create the new entry and then fail to delete the old one under a
transformed key. Build the old key through buildKey too, and skip the delete
when the moved entry is the watched root itself (where the old key would
resolve to the target root and recursively delete the whole sink tree).

* test(filer.sync): cover the in-place update delete-then-create fallback order

The recording sinks always reported foundExisting, so the fallback that an
in-place update takes when the entry is missing on the sink was never run.
Make it configurable and assert the fallback deletes before it recreates the
same key, in both the replicator and the filer.sync drivers.

* feat(filer.sync): move filer-sink renames natively via AtomicRenameEntry

create-then-delete is unsafe for the filer sink: CreateEntry returns nil
without creating on a transient chunk-copy error, so the paired delete could
remove the only valid destination copy; a directory rename also deleted the
old subtree before descendants were recreated, and left old chunks behind.

Add an optional EntryMover sink capability and implement it on the filer sink
via AtomicRenameEntry — one atomic, metadata-only move that relocates a whole
subtree in a single transaction. Renames prefer it; sinks without a native
move keep create-then-delete. When the old path is already gone (a descendant
the parent rename moved, or one never replicated) MoveEntry creates the new
path instead, re-checking existence with a lookup so a rolled-back move that
left the old entry intact is retried rather than mistaken for gone.

* docs(filer.sync): note entryMissing's gRPC not-found string fallback is deliberate
2026-06-09 12:54:28 -07:00
7y-9 a9e4995d76 fix(http): accept no content delete responses (#9893)
* fix(http): accept no content delete responses

Problem: util/http.Delete reports an error for a successful HTTP 204 No Content response.

Root cause: Delete only treats 200 OK, 202 Accepted, and 404 Not Found as non-error responses, omitting the standard 204 status commonly returned by DELETE endpoints.

Fix: Include http.StatusNoContent in the Delete success status set.

Reproduction: go test ./weed/util/http -run TestDeleteTreatsNoContentAsSuccess -count=1 fails before the fix with an empty error for a 204 response.

Validation: go test ./weed/util/http -run TestDeleteTreatsNoContentAsSuccess -count=1; go test ./weed/util/http -count=1; git diff --check; git diff --cached --check
Co-authored-by: Codex <noreply@openai.com>

* Update weed/util/http/http_global_client_util_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-09 11:45:14 -07:00
Chris Lu 048f9ece2d Fix filer metadata-replay OOM under mount reconnect storms (#9901)
* fix(filer): propagate multi-filer metadata log read errors

A genuine (non not-found) read error in one filer's log stream was logged
and skipped, then the merged cursor advanced past the gap, silently
dropping that file's events. Abort the whole replay so the subscriber
re-reads from the unchanged position; chunk-not-found still skips.

* perf(mount): read persisted metadata log chunks directly from volume servers

Set LogFileReaderFn so the filer returns log file references and the mount
reads the chunk data itself, instead of the filer reading, decoding, and
streaming every persisted entry. Keeps a reconnect storm of many mounts
from concentrating hundreds of concurrent log replays in filer memory.

* perf(filer): pre-size chunk stream reader buffer to view size

The chunk size is known up front, so grow the buffer once instead of
letting bytes.Buffer double as the streamed pieces arrive (which
transiently overshoots to ~2x per reader).

* fix(filer): bound concurrent persisted-log replays

Each server-side replay holds an open chunk reader per source filer plus a
readahead buffer, so a reconnect storm of clients that predate the
metadata-chunks offload multiplies into many GB. Gate replays with a
semaphore; abort the acquire when the subscriber's stream is gone so
cancelled clients do not pile up parked goroutines.
2026-06-09 11:43:12 -07:00
Chris Lu 8776b9d311 feat(filer): object size distribution metric and dashboard panels (#9902)
* feat(filer): record object size distribution histogram

Add SeaweedFS_filer_object_size_bytes, a histogram sampled when an
object is first created in the filer namespace, covering every write
protocol (S3, WebDAV, FUSE mount, direct HTTP). Buckets follow the
1KB/100KB/1MB/100MB/1GB ranges operators use to size collections.
Directories, overwrites, and metadata-only updates are not sampled, so
the bucket counts track the size distribution of distinct objects.

* feat(metrics): add filer object size distribution dashboard panels

Add a write-rate-by-size-range graph and a size-distribution bar gauge,
driven by SeaweedFS_filer_object_size_bytes, to the standalone and Helm
Grafana dashboards. Per-range subtractions are clamped at zero so
transient negative rate() samples do not render below the axis.
2026-06-09 10:41:11 -07:00
Chris Lu 7b07d8177a fix(filer.sync): scope filesystem key sanitization to the local sink (#9894)
* fix(filer.sync): scope filesystem key sanitization to the local sink

destKey ran every sink key through escapeKey, whose Windows build strips
colons. Colons are illegal in NTFS filenames so the local sink needs that,
but s3/filer/azure/gcs/b2 accept them as ordinary key bytes — stripping
them silently diverged the destination key (a source a:b replicated as ab).

Move the sanitization into the local sink behind a Windows build tag,
applied at every entry point so the previously-unescaped in-place-update
paths stay consistent. Non-local sinks now keep the raw key; non-Windows
builds are unchanged; a leading drive-letter colon is preserved.

* test(filer.sync): cover incremental destKey and localsink update/delete sanitization

Lock the colon-preserving behavior for the incremental destKey branch, and
extend the Windows local-sink test to assert UpdateEntry and DeleteEntry also
sanitize the key, not just CreateEntry.
2026-06-09 10:18:49 -07:00
Jaehoon Kim 202517c02a fix(filer.backup): skip replay events whose source chunk was superseded or deleted (#9886)
* fix(filer.backup): skip replay events whose chunk no longer exists on the source

"Source" is the filer we replicate FROM (e.g. green in a green->blue backup).

Replaying the metadata log from a checkpoint can hit an event whose chunk was
since overwritten/deleted and garbage-collected on the source volume. Fetching
it returns 0 bytes (a permanent size mismatch), which the sink propagated to the
subscription — so the same offset retried forever and replication stalled.

Skip the event only when proven stale; otherwise keep refusing so genuine loss
of a live file still halts loudly:

- onCorruptChunk centralizes the three errChunkSizeMismatch sites.
- getEntryMtimeNs compares mtime at nanosecond precision so same-second rewrites
  (git's config.lock dance) are ordered correctly.
- sourceSupersedes re-reads the entry's current state on the source: gone
  (ErrNotFound) or a strictly-newer mtime than the replayed version -> skip;
  any other lookup error keeps the entry.

Skipping is lossless: events are full-entry snapshots, so a later event
re-carries the current chunks and a delete event reconciles a removed file.

* test(filer.backup): cover the superseded-chunk skip decision

- TestSourceSupersedes: not-found (sentinel / wrapped / gRPC string) and nil
  entry -> skip; network error -> keep; source newer -> skip; same/older -> keep.
- TestGetEntryMtimeNs: nanosecond precision, same-second ordering, nil safety.
- TestOnCorruptChunkRefusesWhenSupersessionUnconfirmed: never skip silently when
  supersession cannot be confirmed.

* fix(filer.backup): don't infer supersession for incremental sinks

In incremental mode the sink key carries a date prefix
(sinkDir/YYYY-MM-DD/relPath) that cannot be reversed to a real source path, so a
source lookup would always be ErrNotFound and wrongly classify a live entry as
deleted — skipping it. Make targetPathToSourcePath report "unmappable" in
incremental mode; hasSourceNewerVersion already declines to skip when the source
path cannot be mapped.

Found in code review. Non-incremental sinks (filer.backup green->blue) are
unaffected.

* refactor(filer.backup): name the mtime param sourceMtimeNs; note ns overflow bound

- Rename the threaded sourceMtime parameter to sourceMtimeNs across the internal
  replicate/fetch helpers so the unit is explicit (it only feeds
  hasSourceNewerVersion, which compares in nanoseconds).
- Document that getEntryMtimeNs's int64 ns arithmetic is safe until ~year 2262.

No behavior change.

* fix(filer.backup): order same-second versions in the CreateEntry skip and update gates

The CreateEntry already-replicated short-circuit and chooseUpdateAction
still compared second-grained mtime, so a newer version written within
the same second could be skipped as already-replicated or overwritten by
an older same-second replay. Route both through getEntryMtimeNs, matching
the precision the chunk-replication path already uses.

* test(filer.backup): cover same-second update-action ordering

* docs(filer.backup): trim verbose comments to terse why

* fix(filer.backup): check supersession against the rename's new path

For a rename the filer sink updates in place (the delete+create branch is
skipped for sink name "filer"), so the corrupt-chunk supersession check
queried the pre-rename key. Its source-side ErrNotFound was read as
"superseded", silently advancing the checkpoint without applying the rename.
Map the incoming entry's new path (newParentPath/newEntry.Name) for both
update branches.

* fix(filer.backup): detect a deleted source even when the replayed mtime is epoch

hasSourceNewerVersion returned early when sourceMtimeNs <= 0, skipping the
source lookup, so a deleted entry with mtime 0 (a valid epoch timestamp) never
got the gone verdict and wedged on permanent retries. Always look up; gate only
the newer-mtime comparison on a valid replayed mtime.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-09 08:53:29 -07:00
7y-9 1cf92f6c2e fix(s3api): clear stale object lock years (#9890)
Problem: Re-storing object-lock default retention with Days left a previous Years extended attribute in place, so later loads could see both Days and stale Years.

Root cause: StoreObjectLockConfigurationInExtended only wrote period fields that were set on the new configuration and did not delete old Days or Years keys before writing the replacement rule.

Fix: Clear stored default-retention Days and Years keys before writing the current default retention period fields.

Reproduction: go test ./weed/s3api -run TestStoreObjectLockConfigurationClearsStaleYears -count=1 failed before the fix because the stale years key remained.

Validation: go test ./weed/s3api -run TestStoreObjectLockConfigurationClearsStaleYears -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check

Co-authored-by: Codex <noreply@openai.com>
2026-06-09 00:48:38 -07:00
Chris Lu 7aba10fa1a fix(mongodb): merge URI auth fields with username/password override (#9889)
* fix(mongodb): merge URI auth fields with username/password override

SetAuth replaced the whole Credential parsed from the URI, dropping
AuthSource and AuthMechanism. Start from the URI-parsed Auth and only
override the username and password so credentials scoped to a specific
auth database keep working.

* fix(mongodb): set PasswordSet for explicit credentials

Required by GSSAPI auth when a password is supplied; ignored for other
mechanisms.
2026-06-09 00:18:33 -07:00
Chris Lu 2871e6552a fix(s3api): drop ancestor directory markers from prefixed ListObjectVersions (#9885)
processExplicitDirectory appended a directory-key object as a version
without checking it against the prefix. A versioned listing descends
through ancestor markers to reach a deeper prefix, so every ancestor
(Veeam/, Veeam/Backup/, ...) leaked into Versions even though none of
them match the prefix - which makes Veeam's immutable repository scan
abort on an unexpected key. Guard on the prefix so only keys at or under
it surface, matching ListObjectsV2 and AWS.
2026-06-09 00:01:06 -07:00
7y-9 d569dd686f fix(shell): move files into existing destination directories (#9887)
* fix(shell): move files into existing destination directories

Problem: fs.mv /src/file /dst/dir treats an existing destination directory as a destination file path, so it renames the source to /dst/dir instead of moving it into /dst/dir/file.

Root cause: commandFsMv builds the destination LookupDirectoryEntryRequest with Directory and Name swapped, so the destination directory lookup misses.

Fix: Populate LookupDirectoryEntryRequest with Directory=destinationDir and Name=destinationName before deciding whether the destination is a directory.

Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1

Validation: gofmt -w weed/shell/command_fs_mv.go weed/shell/command_fs_mv_test.go; git diff --check; git diff --cached --check; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -count=1

* Update weed/shell/command_fs_mv_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-08 23:42:13 -07:00
Chris Lu 1c9039d3ac fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart (#9874)
* fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart

On startup load_existing_volumes() scans .vif/.idx entries (not just
.dat). For distributed EC, a volume's .vif can be mirrored onto a disk
whose .ecx lives on a sibling disk, so the per-disk ecx check is false
and the loader falls through to Volume::new, which always creates the
.dat if missing -> a phantom 8-byte superblock stub. The store-level
prune_incomplete_ec_with_sibling_dat then treats that stub as the
authoritative source and deletes the real EC shards on sibling disks. Go
guards the same case (disk_location.go: 'Without this guard NewVolume
below would create a phantom empty .dat') but only same-disk.

Fix A (root cause): in load_existing_volumes, don't create a .dat during
load. Skip the entry when there is no local .dat AND the .vif does not
reference remote files -- remote-tiered volumes have no local .dat but
must still load via the remote path. Uses the robust check_dat_file_exists
helper so a transient stat error doesn't skip a real volume. New volumes
go through create_volume(). Covers the cross-disk .vif/.ecx split Go's
same-disk hasEcxFile() misses.

Fix B (defense in depth, Go + Rust): when the EC .vif records no source
size (dat_file_size==0), require the sibling .dat to be strictly larger
than a bare superblock, so an empty 8-byte stub can't pass the
credibility gate. Previously it fell back to SUPER_BLOCK_SIZE, which an
8-byte stub exactly meets.

Adds regression tests reproducing the cross-disk lone-.vif phantom and
the 8-byte stub gate; updates an existing prune test to use a real
collection so its .ecx lookup matches the loaders.

* fix(storage): don't create phantom .dat from lone .vif on Go volume load

Mirror Fix A on the Go side. loadExistingVolume scans .vif/.idx entries,
and for distributed EC a .vif can be mirrored onto a disk whose .ecx is
on a sibling disk. The same-disk hasEcxFile() guard does not fire there,
so the loader falls through to NewVolume(createDatIfMissing=true) and
writes an 8-byte phantom .dat, which the sibling-.dat prune then uses to
delete the real EC shards on sibling disks. Skip the entry when there is
no local .dat AND the .vif has no remote file (via MaybeLoadVolumeInfo);
remote-tiered volumes have no local .dat but must still load.

Adds TestLoneVifDoesNotCreatePhantomDat (fails without the guard) and
TestRemoteTier_DiskScanLoadsRemoteOnlyVolume (fails if the guard skips a
remote-only volume).
2026-06-08 22:10:16 -07:00
7y-9 7bbd28634a fix(util): return full uint64 randomness (#9864)
Problem: RandomUint64 generated eight random bytes but returned int32, truncating the value before mount file and directory handles converted it to uint64. This reduced handle entropy to 32 bits and produced sign-extended handle values.\n\nRoot cause: the helper cast BytesToUint64 to int32 and exposed int32 as its return type.\n\nFix: make RandomUint64 return uint64 and return the full BytesToUint64 result.\n\nReproduction: go test ./weed/util -run TestRandomUint64ReturnsUint64 -count=1 failed before the fix because RandomUint64() had kind int32.\n\nValidation: gofmt -w weed/util/bytes.go weed/util/bytes_test.go; git diff --check; go test ./weed/util -run TestRandomUint64ReturnsUint64 -count=1; go test ./weed/util -count=1; go test ./weed/mount -count=1; git diff --cached --check
2026-06-08 22:07:24 -07:00
Chris Lu 3fadbef3eb feat(admin): export full cluster volume list as JSON (#9876)
Adds an "Export All (JSON)" button on the Cluster Volumes page that pulls
the whole cluster's volume list from the master in one call, a superset of
volume.list. Beyond the table columns it carries garbage and fullness
ratios, modified time, compact revision, remote tiering keys, per-disk
capacity counts, EC shard sizes with file/delete counts, and a cluster-wide
duplicate-volume-id scan. Honors the active collection filter. The existing
per-page CSV export stays as "Export Page".
2026-06-08 15:01:02 -07:00
Chris Lu ed470dccb1 mini: grow volumes one at a time
Mini auto-sizes a few large volume slots, but the master pre-grows 7
volumes per new collection. Under a filer group each S3 bucket is its
own collection, so the first buckets claimed every slot and later
writes failed to assign a volume. Cap mini's volume_growth copy counts
to 1.
2026-06-08 14:51:40 -07:00
Chris Lu d67fc48fbd fix(filer.sync): guard batched events against nil EventNotification (#9877)
* fix(filer.sync): guard batched events against nil EventNotification

The server folds a backlog into one response: the first event in the
top-level fields, the rest in resp.Events, and the pipelined sender can
drain an idle heartbeat (nil EventNotification) into that tail. Only the
envelope got the freshness-signal guard, so a batched heartbeat reached
AddSyncJob and nil-derefed in IsEmpty while replaying a backlog buffered
during a peer outage.

Route every event, envelope and batched, through one handler that sends
freshness signals (nil heartbeat, empty marker) to OnIdleHeartbeat.

* fix(filer): guard MetaAggregator batched events against nil EventNotification

The peer subscription's envelope is nil-guarded but its batched tail was
not. The aggregator doesn't enable idle heartbeats today, so the server
can't fold a nil EventNotification into the batch yet, but make the two
loops consistent so it can't nil-deref if that changes.
2026-06-08 13:56:16 -07:00
Chris Lu 4c050ad76b Don't mangle filer paths with the OS separator on Windows (#9878)
fix: don't mangle filer paths with the OS separator on Windows

filepath.Dir/Join use the platform separator, so on Windows they rewrite
a forward-slash filer path like /buckets/x into \buckets\x. The mangled
value then goes into a filer RPC and operates on the wrong key, so the
op silently targets nothing.

The admin file browser hit this in New Folder (the entry landed under
\buckets\my-bucket and never showed up under /buckets/my-bucket), and
the same way in delete, view and properties. MQ topic retention and
consumer-offset listing, and the SFTP home dir plus create-permission
parent lookup, had the same bug.

Switch all of these to the path package, which always uses "/".
2026-06-08 13:56:02 -07:00
Chris Lu 8cc10460b4 fix(remote): correct content and permissions when syncing/caching remote objects (#9879)
* fix(remote): reject short reads when caching remote objects

A short read from the remote (stale listing size, truncated or flaky
response) was silently zero-padded: the S3 and Azure clients pre-size
the buffer and discard the downloaded byte count, and the chunk is
recorded with the requested size. The cached file then matched the
expected size but its tail was NULL, and the entry was marked cached
so it never re-fetched.

Check the byte count against the requested size in both clients, and
add a backend-agnostic guard in FetchAndWriteNeedle. The cache now
fails loudly and the entry stays remote-only for a later retry.

* fix(remote): match S3 default modes when syncing remote metadata

Remote object listings carry no POSIX mode, so synced entries were
created with a hardcoded 0644. Against a SeaweedFS remote, whose S3
layer writes objects as 0660 and auto-creates directories as 0771
(0660|0111), the mounted copy ended up 0644/0755 and the permissions
visibly diverged from the source.

Default to the S3 modes instead (files 0660, directories 0771). The
filer derives parent-dir modes from the child as fileMode|0111, so
fixing the file default also brings the directories into line.

Directory mtimes still reflect sync time: S3 listings don't enumerate
directories, so the remote's directory timestamps aren't available.
2026-06-08 13:55:53 -07:00
Chris Lu 5a4ff2a122 fix(mq): don't cache topic non-existence on transient filer errors
TopicExists and getTopicConfFromCache negative-cached a topic for the full
30s TTL whenever a filer lookup failed for any reason, including timeouts.
A topic created earlier then looked gone until the TTL expired, and the
metadata auto-create path couldn't heal it (CreateTopic rejects an
already-persisted conf), so producers saw UNKNOWN_TOPIC_OR_PARTITION.

Only negative-cache on a definitive ErrNotFound; let transient errors fall
through and retry against the filer.
2026-06-08 12:04:48 -07:00
7y-9 b408705f5b fix(s3api): accept HTTP-date conditionals (#9863)
* fix(s3api): accept HTTP-date conditionals

Problem: Object conditional headers rejected valid HTTP-date values in RFC850 or ANSIC format for If-Modified-Since and If-Unmodified-Since.

Root cause: parseConditionalHeaders used time.Parse(time.RFC1123), accepting only one HTTP-date representation instead of the standard formats accepted by net/http.ParseTime.

Fix: Parse conditional date headers with http.ParseTime so RFC1123, RFC850, and ANSIC HTTP-date forms are accepted.

Reproduction: go test ./weed/s3api -run TestParseConditionalHeadersAcceptsHTTPDateFormats -count=1 failed before the fix with ErrInvalidRequest for RFC850 and ANSIC date values.

Validation: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/s3api -run TestParseConditionalHeadersAcceptsHTTPDateFormats -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/s3api -count=1; git diff --check; git diff --cached --check

* fix(s3api): accept HTTP-date copy-source conditionals

Mirror the put-path http.ParseTime switch onto the copy-source If-Modified-Since / If-Unmodified-Since headers, which still rejected valid RFC850 and ANSIC dates.

* fix(s3api): keep RFC1123 UTC-zone dates working alongside http.ParseTime

http.ParseTime rejects the "UTC" zone that Go clients emit via t.UTC().Format(time.RFC1123), which the old RFC1123 parser accepted. Add a parseHTTPDate helper that tries http.ParseTime first and falls back to RFC1123, so the put and copy-source conditional date headers accept the union of HTTP-date formats plus the UTC zone.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-08 01:12:07 -07:00
Chris Lu 78da9572ae 4.32 2026-06-07 23:37:57 -07:00
Jaehoon Kim 1b5f1c1f3b feat(filer.backup): -initialSnapshot re-seeds a reinitialized destination (#9828)
* feat(filer.backup): add -resetCheckpoint to force a fresh sync

filer.backup resumes from a per-sink offset persisted in the source filer's KV.
There was no first-class way to discard that checkpoint and re-run from the
beginning short of guessing a large -timeAgo, which also skips -initialSnapshot.

Add -resetCheckpoint: before reading the offset, write 0 for this sink so
getOffset returns 0, isFreshSync stays true, and -initialSnapshot re-runs a full
walk. Effective only when -timeAgo is 0.

The flag is cleared after the first successful reset: runFilerBackup retries
doFilerBackup forever on error, so leaving it set would re-zero the checkpoint
on every retry and never make forward progress after a transient failure. Later
retries resume from the persisted checkpoint instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(filer.backup): keep fresh-sync intent when offset read fails after reset

After -resetCheckpoint writes offset 0, a transient getOffset read-back error
flipped isFreshSync to false, which skipped the -initialSnapshot walk the reset
explicitly requested. Track that the reset happened this iteration and, on a
getOffset error, preserve isFreshSync=true in that case (the non-reset path
keeps treating a read error as "not fresh" to avoid re-walking on transients).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(filer.backup): skip offset read-back on reset instead of tracking a flag

Replace the didReset bool by branching: on -resetCheckpoint, clear the offset and
start fresh without reading it back (we just wrote 0, so the state is known);
otherwise read the offset as before. This drops the redundant getOffset RPC after
a reset and removes the read-back error case entirely, so no separate flag is
needed to preserve isFreshSync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* filer.backup: -initialSnapshot re-seeds on every start; drop -resetCheckpoint

-initialSnapshot now walks the live tree whenever -timeAgo is 0, seeds the
destination, and overwrites the saved checkpoint, rather than running only on a
fresh sync. That re-seeds a reinitialized destination on its own, so the
separate -resetCheckpoint flag is gone.

The walk runs once per process: the in-memory flag is cleared after the
watermark is persisted, so the retry loop resumes from the persisted checkpoint
instead of re-walking on every transient error. A process restart re-walks, so
remove the flag once the backup is caught up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-07 23:35:53 -07:00
Chris Lu 8a4fdf06c0 admin/maintenance: reload in-flight tasks on startup instead of discarding them (#9857)
* admin/maintenance: reload in-flight tasks on startup instead of discarding

LoadTasksFromPersistence deleted all persisted task files on startup and
relied on the scanner to re-detect, so saved task state was never consumed
— the persistence was effectively write-only. Reload non-terminal tasks
(pending/assigned/in_progress) into the queue, resetting in-flight ones to
pending since their worker is gone after a restart (maintenance tasks are
idempotent). Terminal task files are dropped; the scanner still backfills
anything not persisted.

* address review: nil-guard reloaded tasks and SyncTask to ActiveTopology

- skip nil entries from LoadAllTaskStates (corrupted state)
- re-sync restored tasks with MaintenanceIntegration so ActiveTopology
  (in-memory, empty on startup) knows about them; otherwise GetNextTask's
  AssignTask rejects them as unknown and they never get assigned
2026-06-07 22:45:38 -07:00
Chris Lu 7c542128c7 vacuum: compact a read-only volume when an explicit volumeId is given (#9861)
* vacuum: compact a read-only volume when an explicit volumeId is given

The on-demand path no longer skips read-only volumes, so an operator can
reclaim a benignly read-only (full/oversized) volume without marking it
writable first. The background scan and all-volumes sweep still skip
read-only, where the flag usually signals an unhealthy disk.

* vacuum: copy locationList under lock for on-demand vacuum

The volumeId>0 path passed the live vid2location entry into the async
vacuum, where heartbeat-driven Register/UnRegister can mutate the slice
concurrently. Snapshot it under accessLock, matching the sweep path.
2026-06-07 22:42:51 -07:00
Chris Lu a549580e65 ec.balance: verify shard landed on destination before deleting the source (#9858)
* ec.balance: verify shard(s) landed on the destination before deleting source

The EC balance task copied/mounted a shard to the destination and then
immediately unmounted+deleted it from the source, reporting success as soon
as the RPCs returned. A copy/mount can return OK while the shard isn't
actually registered/loadable on the destination, so deleting the source
then loses the shard (and the scanner re-issues the same move every cycle).

Add a verification step (VolumeEcShardsInfo via VerifyShardsAcrossServers,
the same check the EC encode task uses before deleting originals): if the
destination doesn't report every moved shard, fail the task and keep the
source so the move is retried instead of losing data.

* address review: use comma-ok when reading destination shard inventory
2026-06-07 21:31:53 -07:00