52 Commits

Author SHA1 Message Date
Chris Lu f724828bcb fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941)
* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)

EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.

validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
  - shards smaller than expected: an interrupted local encode left partial
    shards and the .dat is the complete source -> reclaim the .dat.
  - shards equal to expected: a valid (or still-distributing) EC volume ->
    keep; the shards may be the only copy.
  - shards larger than expected: the .dat is the stale/partial side (e.g. an
    interrupted decode left a half-written .dat next to the real shards) ->
    keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.

handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.

pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).

Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.

Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.

* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments

The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.

Also condense the verbose comments added in this change to the repo's
concise style.
2026-06-12 23:51:29 -07:00
Chris Lu 18cdb3819b fix(ec): crash-safe ecx-journal fold and shard rebuild (fsync before publish, no short-read-as-success) (#9938)
* fix(ec): make ecx-journal fold and shard rebuild crash-safe

Two EC rebuild paths could silently lose or corrupt data:

RebuildEcxFile folded the .ecj deletion journal into .ecx (in-place
WriteAt tombstones) and then unlinked the journal without flushing the
.ecx writes first. A crash could persist the unlink ahead of the
tombstones, resurrecting deleted needles on the next load. It also read
journal records with a bare n!=size break, so a torn tail silently
dropped the remaining tombstones before the unlink. Now: read records
with io.ReadFull (io.EOF ends cleanly, a torn tail aborts and leaves
.ecj in place for retry), fsync .ecx before removing the journal.

rebuildEcFiles treated a zero/short ReadAt as a clean end-of-input and
discarded the read error, so a truncated or unreadable input shard
produced truncated regenerated shards that were then published as
restored redundancy; the regenerated shards were also never fsynced on
the no-sidecar path. Now: derive the expected shard size from the
present inputs up front (rejecting a divergent/zero-size input), drive
the loop by that size, fail on any short read or short write, and fsync
every regenerated shard before it is mounted/renamed.

Rust volume server mirrors the rebuild fix: rebuild_ec_files now checks
the read_at byte count (it previously discarded it, the same truncation
bug). The Rust ecx fold already synced .ecx before removing the journal.

Custom EC ratios are unaffected: the shard size derives from the input
shards and the loop uses the .vif-resolved data/parity counts, never a
hardcoded 10+4.

* storage: close ecx journal files via defer in RebuildEcxFile

Per review: a single deferred Close per file replaces the per-error-path
manual closes, so new early returns cannot leak descriptors. The journal
is still closed explicitly before its unlink since Windows cannot delete
an open file; the deferred second Close is a harmless no-op.
2026-06-12 22:28:56 -07:00
Chris Lu 34f9b91d69 fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards

A leftover empty .dat stub (a phantom from the pre-fix loader; zero
needles) next to a distributed EC volume's local shards made startup
classify the volume as an interrupted local encode: validateEcVolume
requires >= dataShards local shards when a .dat is present, fails with
the 1-2 shards a distributed volume keeps per disk, and the cleanup
deletes those shards -- the only copies of that part of the volume.
Repeated across restart waves this destroys enough shards cluster-wide
to make the volume unrecoverable.

Go:
- loadExistingVolume: hoist the empty-stub sweep above the EC presence
  checks. Previously the .vif-next-to-.ecx guard returned before the
  sweep ever ran, so exactly the dangerous layout (stub + .ecx + local
  shards) kept its stub and then lost its shards in loadAllEcShards.
- validateEcVolume / checkDatFileExists: treat a .dat <= a superblock
  (zero needles) as absent. An empty .dat cannot be the encode source,
  so it must never gate shard deletion; this also covers stubs without
  a .vif, which the sweep cannot prove are EC leftovers.

Rust mirror (seaweed-volume): the same gate in validate_ec_volume and
check_dat_file_exists (the Rust sweep already ran before validation);
the volume-load skip keeps a plain existence check so fresh,
needle-less volumes still load.

Regression tests in Go and Rust reproduce the production layout (a
zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with
and without a .vif) and fail without the fix with the shards deleted.

* fix(ec): gate source volume deletion on a recoverable shard set

After EC encode, the shell command and the (plugin) worker task refused
to delete the source volume unless every shard was present, and aborted
otherwise -- leaving the source .dat next to live shards, exactly the
mixed state the startup cleanup mishandles.

Replace the full-set requirement with a recoverability gate shared by
both callers (RequireRecoverableShardSet): deleting a non-empty source
.dat requires at least dataShards distinct shards cluster-wide. Below
that the source is kept and the encode fails as before. A degraded but
recoverable set (>= dataShards, < total) now proceeds with a warning
instead of aborting: the missing shards can be rebuilt from the
survivors, while keeping the source would preserve the dangerous mixed
state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an
empty .dat has nothing to lose.

dataShards/totalShards stay parameters so enterprise custom EC ratios
share the helper verbatim.

* test(ec): use recoverable shard verification gate
2026-06-11 20:26:20 -07:00
Chris Lu 4f8af455bf feat(storage): sweep leftover empty EC .dat stubs on volume server startup (#9927)
* feat(storage): sweep leftover empty EC .dat stubs on volume server startup

An EC volume keeps no local .dat. The pre-fix loader left empty 8-byte
superblock .dat stubs next to EC metadata (one per lone .vif). Left in
place each loads as a phantom empty volume, and the same vid's stub on
two disks of one server blocks Rust startup via the duplicate-vid check
in Store::add_location -- the prior fix stops creating new stubs but does
not clean up existing ones.

On startup, when a .dat is empty (<= a superblock, i.e. zero needles) and
its .vif marks the volume erasure-coded, remove the stub (+ empty .idx)
instead of loading it. The real data is in the EC shards, so the empty
stub holds nothing to lose. Non-EC empty .dat files (e.g. freshly
allocated volumes) are left alone.

Done in both Rust (load_existing_volumes) and Go (loadExistingVolume),
with regression tests that fail without the sweep.

* refactor(storage): extract empty EC .dat stub sweep into its own function

Move the startup stub-sweep into remove_empty_ec_dat_stub (Rust) and
removeEmptyEcDatStub + vifIsEcVolume (Go) for clearer logic, and look up
the .vif in both the data and idx directories (each read at most once) so
a stub is still found when -dir.idx is configured. Adds direct tests for
the idx-directory lookup on both engines.
2026-06-11 12:26:21 -07:00
Chris Lu 79ac279fe1 fix(ec): don't mix EC shards from different encode runs (#9880)
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC

EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns
(encode time in unix nanos). It rides in .vif and the read request so a
read can be scoped to the encode run that produced the index.

* fix(ec): stamp each encode and reject cross-run shard reads

Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the
shard's owning volume (resolved together via FindEcVolumeWithShard, so a
multi-disk server validates the disk that actually serves the bytes) and
reject a shard from a different encode run, recovering from parity. A
zero on either side (pre-upgrade volume) skips the guard.

* fix(ec): stamp the encode identity on the worker-generated .vif

The worker-local encode path now writes EncodeTsNs (and the resolved EC
ratio) into the .vif, so the read guard is not silently off for volumes
encoded by the maintenance worker.

* fix(ec): wipe stale EC artifacts before re-encoding

VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and
removes its on-disk shard/index/sidecar files before writing fresh ones,
so a retried encode never builds on a partial prior run and the unlink
frees the inodes instead of leaving open fds serving old bytes.

* fix(ec): unmount EC shards across all disks

UnmountEcShards walked only the first disk holding the shard, leaving a
duplicate copy mounted on a sibling disk (split-disk reconciled volumes)
still serving and heartbeating. Traverse every disk and emit one
deletion delta per disk.

* fix(ec): delete orphan shards without a local .ecx

deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx,
so it could not clean an orphan .ecNN left by a failed copy on a disk
with no index. Delete the requested shard files unconditionally; the
index-file (.ecx/.ecj/.vif) routing stays gated as before.

* fix(ec): clear stale EC shards cluster-wide before re-encoding

ec.encode unmounts and deletes EC shards for the target volumes on every
node before regenerating: fatal for the shards the topology reports
(mounted leftovers), best-effort for the rest (a sweep that catches
unmounted failed-copy orphans). A down node is a no-op.

* fix(ec): don't nil EC fds on close so reads can't race eviction

A reader resolves an EcVolume/shard under the lock then reads after it is
released, so an eviction that nils ecxFile/ecdFile would race that read
and panic. Close the fds without nilling the fields: the field is now
write-once (no data race) and a concurrent read hits a closed fd, getting
a clean error that the caller recovers from parity.

* fix(ec): wipe stale EC artifacts on every disk and surface failures

The pre-encode wipe only deleted beside the source volume, so a stale
shard on a sibling disk survived and could be mounted against the new
index at reconcile. Sweep every disk. Removal also ignored os.Remove
errors, reporting a failed cleanup as success and letting a stale shard
join the next generation; surface the first real failure (treating
already-gone as success) from removeStaleEcArtifacts and the shard delete.

* fix(ec): log when a local shard is skipped for a different encode run

The cross-run guard returned errShardNotLocal, indistinguishable in logs
from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so
operators can tell "wrong encode generation" from "shard not here".

* fix(ec): surface metadata removal failures in the shard delete path

deleteEcShardIdsForEachLocation still dropped os.Remove errors on the
.ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index
condition this path prevents, so route those through removeFileIfExists and
return the first real failure instead of reporting cleanup as success.

* fix(ec): fail orphan cleanup when a reachable node's delete fails

The pre-encode orphan sweep swallowed every error for unreported (node,
volume) pairs. That is only safe for an unreachable node, which cannot
receive this encode's new generation. A reachable node whose delete
genuinely failed (permission/IO) keeps an orphan shard that a later copy
re-stamps with the new run's volume-level .vif identity, so the read guard
would accept stale data. Surface those; stay best-effort only for
unreachable nodes (gRPC Unavailable / no status).

* fix(ec): guard ecjFile under its lock in the EC delete path

EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved
its .ecx lookup before a concurrent eviction (the generate-time
UnloadEcVolume) could then reach the journal append with a nil fd. Bail
with a clear "volume closed" error under the lock instead.

* fix(ec): reject an unstamped shard when the caller has an encode identity

The read guard required both identities nonzero, so a current (stamped)
caller accepted a holder with identity 0 and could be served a stale
pre-upgrade shard. Reject when the caller is stamped and the holder
differs (including unstamped); stay lenient only when the caller itself
has no identity (pre-upgrade reader). A skipped shard recovers from parity.

* fix(ec): full-teardown delete so cluster cleanup wipes a whole generation

The pre-encode cluster sweep deleted only the listed canonical shards on
remote nodes, leaving index/sidecar (and, on builds with versioned
generations, those too) behind. Add a full_teardown flag to
VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for
it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode
cleanup paths set it. Other delete callers (balance/decode/repair) are
unchanged.

* fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close

Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while
Close nils it under the lock, a data race on the field. Take the lock
first, then nil-check inside, in both.

* fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success

An old volume server silently ignores full_teardown and returns success
for an ordinary delete, so the caller wrongly believes the generation was
wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done
in the response; the worker destination cleanup fails when it is absent, and
the shell cluster sweep fails for a reported (mounted) leftover while staying
best-effort for an unreported node. encode_ts_ns stays an accepted transient
(an old server just skips the new read guard, no regression).

* fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown

A reachable pre-upgrade server ignores full_teardown and returns success
without wiping an orphan, which a later copy then folds into the new
generation. Treat a missing full_teardown_done ack as fatal for every
reachable node (best-effort only for a gRPC-unreachable one), not just for
topology-reported pairs.

* fix(ec): return the served shard identity and validate it client-side

The encode identity was only enforced server-side, so a pre-upgrade server
ignored the request field and served bytes unchecked. Echo the served
shard's EncodeTsNs on every read response chunk and have the client reject a
mismatch (including 0 from an old server), so the guard holds regardless of
server version; a rejected read recovers from parity.

* fix(ec): reject a short/empty remote shard read instead of serving zeros

doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and
returned success with a partly zero-filled, unvalidated buffer (the server
stamps the identity only on chunks that carry bytes). A non-deleted interval
must arrive whole: require n == len(buf), exempting the is_deleted
short-circuit (n=0), matching readLocalEcShardInterval's local check. A short
read now fails so the caller recovers from parity.

* test(ec): fake volume server echoes the full_teardown acknowledgement

The worker now fails a teardown delete that isn't acknowledged (so a
pre-upgrade server can't silently skip the wipe). The fake server's no-op
VolumeEcShardsDelete returned an empty response, which the worker read as a
skipped teardown and aborted the encode. Echo full_teardown_done.

* feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server

The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif
and rejects a read served from a shard of a different run; full_teardown wipes a
whole generation and acknowledges it. The Rust volume server had none of it.
Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume,
stamp it on every read response, and reject a request/response mismatch on both
the server and the distributed-read client (recovering from parity); handle
full_teardown by evicting the volume and wiping every EC artifact on each disk,
echoing full_teardown_done so the caller can detect a server that ignored it.

* fix(ec): remove a stale .vif on full teardown of a shard-only node

A shard copy installs shards + .ecx before .vif, so an interrupted copy after a
teardown could mount the new files under the previous run's identity / version /
shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during
full teardown, gated on .idx absence so a source-volume holder keeps its live
.vif. In Rust this lives in a teardown-only helper so the reconcile / load-
fallback paths (which share the base removal) still preserve .vif.

* fix(ec): treat a missing teardown ack as fatal, not as an unreachable node

isNodeUnreachable returned true for any non-gRPC-status error, so a reachable
pre-upgrade server's missing full_teardown_done ack (a plain error) was
classified unreachable and the unreported pair was silently skipped. Classify
only a real codes.Unavailable as unreachable, and wrap the missing ack in a
sentinel the sweep treats as fatal regardless. A genuinely down node still
surfaces as Unavailable from the RPC and stays best-effort.

* fix(ec): reject a short shard read in the local EC needle reader

read_ec_shard_needle ignored the byte count from shard.read_at and appended the
whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later
length check and parsed as garbage. Require n == buf.len() per interval, erroring
on a short read like the local interval reader already does.

* fix(ec): probe reachability before skipping a node that returns Unavailable

The pre-encode sweep skipped any node whose teardown delete returned
codes.Unavailable, but a reachable volume server in maintenance mode also
returns that code for the maintenance-gated delete, so its stale EC files were
left behind on a node that can still receive the new generation. Confirm with a
non-maintenance-gated empty-target Ping: skip only when the node fails the probe
too (genuinely unreachable).

* fix(ec): use try_exists for the teardown .vif .idx guard

The teardown-only .vif removal gated on Path::exists(), which returns false on a
permission/IO stat error, so a stat failure on a present .idx would read as a
shard-only node and delete the live source volume's .vif. Gate on
try_exists() == Ok(false) instead, preserving the sidecar on any stat error.

* fix(ec): only skip a sweep node when a Ping confirms it is transport-down

The pre-encode sweep skipped a node whenever its teardown delete and a liveness
Ping both failed, but it treated ANY Ping error as down — an application-level
Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a
reachable node's stale generation in place. Classify the Ping tri-state and skip
only when it transport-fails with codes.Unavailable; a reachable or inconclusive
node stays fatal.

* fix(ec): exclude sweep-skipped nodes from the encode's rebalance

The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance
then recollected the current topology — a node that recovered between the two
could become a copy target and receive the new generation while still holding
its stale, never-cleared shards. Have the sweep return the skipped set and
exclude those nodes from the rebalance for this encode, so a node we could not
clean cannot receive the new generation. Standalone ec.balance is unaffected.

* fix(ec): re-sweep recovered nodes before generation so they aren't stranded

A node skipped as down by the pre-encode sweep is excluded from the rebalance,
but it can recover and become the generation host — mounting all shards locally,
then being excluded from distribution. Union-only verification accepts all
shards on one node and deletes the originals: a single point of failure. Re-sweep
the skipped nodes just before generation; one whose teardown now succeeds leaves
the skipped set and rebalances normally, while a node still down stays skipped.

* fix(ec): abort the encode if a selected source is still skipped after re-sweep

The re-sweep un-skips a recovered node, but the source was selected before it and
a node can stay down through the re-sweep then recover just in time to be the
generation host — mounting all shards locally while still excluded from the
rebalance, which union-only verification accepts before deleting the originals.
Abort the encode when a selected source remains skipped after the re-sweep.

* fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch

If a volume is not EC at the batch-delete classification but is encoded to EC and
its .dat deleted before the regular-volume mutation, the mutation returns an exact
"not found" that the filer chunk-GC treats as completed, dropping the delete.
Recheck EC presence under the mutation lock and return a retriable 503 with the
"try again" token so the filer requeues it onto the EC path.

* fix(ec): recheck EC state before the regular batch-delete mutation

ec.encode mounts EC shards (copied from the .dat) before deleting the originals,
so a volume can be EC while its .dat still exists. The batch delete only rechecked
EC after a NotFound, so a successful regular-volume delete in that window wrote a
tombstone to the soon-removed .dat — the delete was lost and the needle resurrected
from the pre-tombstone shards. Recheck has_ec_volume under the write lock before
delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path.

* fix(volume): make the metrics push test independent of test order

test_push_metrics_once asserted the pushed body contains the request-counter
family without ever touching the counter — a CounterVec with no children emits
nothing, so the assertion only held when another test had already created a
labelset in the shared registry. Create one in the test itself.
2026-06-10 22:31:18 -07:00
Chris Lu 1c9039d3ac fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart (#9874)
* fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart

On startup load_existing_volumes() scans .vif/.idx entries (not just
.dat). For distributed EC, a volume's .vif can be mirrored onto a disk
whose .ecx lives on a sibling disk, so the per-disk ecx check is false
and the loader falls through to Volume::new, which always creates the
.dat if missing -> a phantom 8-byte superblock stub. The store-level
prune_incomplete_ec_with_sibling_dat then treats that stub as the
authoritative source and deletes the real EC shards on sibling disks. Go
guards the same case (disk_location.go: 'Without this guard NewVolume
below would create a phantom empty .dat') but only same-disk.

Fix A (root cause): in load_existing_volumes, don't create a .dat during
load. Skip the entry when there is no local .dat AND the .vif does not
reference remote files -- remote-tiered volumes have no local .dat but
must still load via the remote path. Uses the robust check_dat_file_exists
helper so a transient stat error doesn't skip a real volume. New volumes
go through create_volume(). Covers the cross-disk .vif/.ecx split Go's
same-disk hasEcxFile() misses.

Fix B (defense in depth, Go + Rust): when the EC .vif records no source
size (dat_file_size==0), require the sibling .dat to be strictly larger
than a bare superblock, so an empty 8-byte stub can't pass the
credibility gate. Previously it fell back to SUPER_BLOCK_SIZE, which an
8-byte stub exactly meets.

Adds regression tests reproducing the cross-disk lone-.vif phantom and
the 8-byte stub gate; updates an existing prune test to use a real
collection so its .ecx lookup matches the loaders.

* fix(storage): don't create phantom .dat from lone .vif on Go volume load

Mirror Fix A on the Go side. loadExistingVolume scans .vif/.idx entries,
and for distributed EC a .vif can be mirrored onto a disk whose .ecx is
on a sibling disk. The same-disk hasEcxFile() guard does not fire there,
so the loader falls through to NewVolume(createDatIfMissing=true) and
writes an 8-byte phantom .dat, which the sibling-.dat prune then uses to
delete the real EC shards on sibling disks. Skip the entry when there is
no local .dat AND the .vif has no remote file (via MaybeLoadVolumeInfo);
remote-tiered volumes have no local .dat but must still load.

Adds TestLoneVifDoesNotCreatePhantomDat (fails without the guard) and
TestRemoteTier_DiskScanLoadsRemoteOnlyVolume (fails if the guard skips a
remote-only volume).
2026-06-08 22:10:16 -07:00
Chris Lu ab7be7867d security: hot-reload JWT signing keys on SIGHUP (#9826)
* security: reload JWT signing keys on SIGHUP

Signing keys were read once in the server constructors and never
refreshed. After a key rotation (Secret update, divergent reads) the
in-memory key stayed stale and every request kept failing "wrong jwt"
until the affected process was restarted.

Add Guard.UpdateSigningKeys and call it from the master, volume and
filer reload paths and the s3 reload hook, next to the existing
whitelist refresh. Make the global chunk-read JWT cache reloadable via
an atomic swap, and register the master's Reload with grace.OnReload --
it was never wired, so the master ignored SIGHUP entirely.

Mirror the same refresh in the Rust volume server's SIGHUP handler.

* security: swap signing keys behind an atomic pointer

Addresses review feedback on the in-place key swap: SigningKey is a
[]byte, so reassigning the Guard fields while a request handler reads
them is a data race that can tear the multi-word slice header and read
out of bounds.

Hold the four signing-key fields in an immutable signingConfig snapshot
behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a
reader sees either the old keys or the new ones. Reads go through new
SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors.

The Rust guard is already safe: every read and the SIGHUP write go
through the shared RwLock<Guard>.

* security: fold whitelist + auth state into the atomic snapshot

Review follow-up. UpdateSigningKeys still wrote isWriteActive while the
request path read it (and the whitelist maps) unsynchronized, so a SIGHUP
under load could expose an inconsistent mix of activation bits and
whitelist contents.

Move all hot-reloadable Guard state -- keys, expirations, whitelist, and
the activation flags -- into a single immutable guardState swapped behind
one atomic.Pointer. The Update* methods take a small mutex to serialize
the read-modify-write; readers stay lock-free. The concurrency test now
also rotates the whitelist and probes IsWhiteListed under -race.

Also read each signing key once per branch in the volume/filer JWT auth
checks, so a reload landing mid-check can't take the allow-fast-path
after auth was enabled or verify against a different key than the branch
saw.
2026-06-04 22:26:08 -07:00
Chris Lu e264e9883e fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM under load (#9780)
* fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM

The Rust volume server buffered the entire upload body with
to_bytes(usize::MAX) and only checked the file-size limit afterward, so a
single large upload — or many concurrent uploads, since the in-flight byte
throttle defaults to 0 (unlimited) — could exhaust memory and get the process
OOM-killed under load. The read path had two more single-request OOM vectors:
`vec![0u8; manifest.size]` allocated from an attacker-controlled chunk-manifest
size, and gzip decompression was unbounded (gzip bomb).

- Bound the upload body read by file_size_limit_bytes (plus a margin for
  multipart framing), mirroring Go's io.LimitReader(sizeLimit+1), and reject
  oversize before the whole body is buffered.
- Validate manifest.size (reject negative / oversized) before allocating.
- Cap gzip output in maybe_decompress_gzip and route the inline GzDecoder sites
  through it.

* fix(seaweed-volume): address review - chunk offset, 32-bit cast, decompress errors

- Validate chunk.offset before indexing in chunk-manifest expansion: a negative
  offset wrapped to a huge usize and underflowed `end - offset` (panic from a
  crafted manifest). Reject negative, skip out-of-range, use saturating math.
- Use usize::try_from for the upload body limit instead of `as usize`, so a
  >usize::MAX file_size_limit on 32-bit caps at usize::MAX rather than silently
  truncating to a tiny value.
- maybe_decompress_gzip now returns Result<_, GunzipError> distinguishing a
  decode failure (callers fall back to raw bytes, as before) from hitting the
  size cap (TooLarge), which now returns 413 instead of silently serving the
  still-compressed bytes.

* fix(seaweed-volume): inflate manifest chunks into the result window to cap peak memory

The chunk-manifest expansion still doubled memory: `result` was already allocated
at manifest.size (<=2 GiB) and each compressed chunk was inflated into a separate
Vec (also up to 2 GiB), so a single request could peak near 4 GiB.

Decompress compressed chunks directly into their result[offset..] window (bounded
by the remaining space) so a chunk never allocates a second large buffer; peak
stays at ~manifest.size. Bytes past the window are dropped (matching the prior
truncation), and a fully-undecodable chunk still falls back to its raw bytes.

* fix(seaweed-volume): fall back to raw chunk bytes on any decode failure

Per review: the gzip fallback must run on any decode error, not only when no
bytes were decoded. Clear the partially-written output and copy the chunk's raw
bytes (truncated to the window), restoring the prior decode-failure behavior.
2026-06-01 22:24:13 -07:00
Chris Lu dfa86b4313 volume: keep volume writable after a deletion-tail compaction (#9776)
makeupDiff replays post-snapshot changes onto the compacted volume. For a
replayed deletion it appended a tombstone to the new .dat but recorded the
.idx entry with offset 0. When that deletion is the last replayed change the
tombstone lands at the .dat tail, and the post-commit integrity check skips
offset-0 entries, so it sees 32 trailing bytes it can't account for and flips
the volume read-only, reloading it as a SortedFileNeedleMap instead of the
writable map.

Record the tombstone's real .dat offset, matching the normal delete path; the
needle map still treats it as deleted off the negative size, so lookups are
unchanged. Mirror the same fix into the Rust volume server.
2026-06-01 13:15:08 -07:00
Chris Lu 05c6500453 volume: fix maxVolumeCount dead zone that stalled writes on auto-sized disks (#9755)
* volume: don't drop the last writable slot on auto-sized disks

MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk
with room for exactly one volume (free between 1x and 2x the size limit)
reported 0 slots. The master then never grew a writable volume and every
assign drained its retry budget, so writes failed with context deadline
exceeded. Count the full volumes that actually fit, floored at one for an
auto-sized disk that has free space.

* mini: show disk and volume capacity in the startup banner

Print free space, volume size, total volume count and free volume count
under the data directory line, so a volume size limit that outstrips the
disk is visible at startup instead of surfacing later as failed writes.
2026-05-30 23:45:17 -07:00
Chris Lu 3674f9d04d fix(storage): keep EC .vif when deleting a coexisting regular volume (#9723)
* fix(storage): keep EC .vif when deleting a coexisting regular volume

A regular volume and an EC volume for the same id share <base>.vif. When
EC shards are distributed onto a server that still holds the regular
volume — the encode source, or any replica the planner targets — the
post-encode VolumeDelete ran removeVolumeFiles and stripped the shared
.vif, leaving the freshly built EC volume without its info file.

Skip the .vif in removeVolumeFiles when an EC volume for the same id
exists on the disk (mounted, or a sealed .ecx on disk). The regular
volume's .dat/.idx still go; the EC sidecars survive.

A two-server end-to-end test encodes a volume whose source and a stub
replica both also receive shards, and asserts the final on-disk layout:
both .dat/.idx gone, each server holding only its assigned shards plus
.ecx/.vif. Storage unit tests cover the with-EC and no-EC cases, and the
Rust seaweed-volume port carries the same guard and tests.

* test(storage): assert .idx is removed in the no-EC destroy case

Strengthen TestDestroyRemovesVifWhenNoEc to confirm the full regular
volume cleanup (.dat, .idx, .vif) when no EC volume coexists.
2026-05-28 15:39:31 -07:00
Chris Lu 77dcb20a74 writeJson: drop unused JSONP branch (#9686)
* writeJson: drop unused JSONP branch

No in-tree caller uses ?callback=. Always serve application/json
with X-Content-Type-Options: nosniff.

* seaweed-volume: drop unused JSONP branch

Mirror Go: always serve application/json with
X-Content-Type-Options: nosniff.

* writeJson: drop unreachable StatusNotModified check

bodyAllowedForStatus already returns early for 304.

* test/volume_server: rename and rewrite JSONP test to assert callback is ignored

CI: /status?callback=myFunc now returns plain application/json
with X-Content-Type-Options: nosniff.
2026-05-26 01:05:07 -07:00
Chris Lu 5b42287c22 fix(storage): surface stat error on zero-size idx scrub, mirror to rust (#9612)
fix(storage): harden zero-size idx scrub and mirror to rust

When a zero-size .idx is found, openIndex stats the backing .dat through
v.DataBackend: wrap that GetStat failure with %w, fix the indices typo, and
guard both openIndex and scrubVolumeData against a nil DataBackend (closed or
remote-only volumes) instead of panicking.

Add rust scrub tests for empty (superblock-only .dat, zero-size .idx) and
healthy volumes, keeping the volume server in parity with the go zero-size
scrub handling.
2026-05-21 10:17:23 -07:00
Chris Lu cfc08fbf6c fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563) (#9565)
* fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity

verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into
Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0
so the on-disk needle header carries Size=0; TombstoneFileSize is only
the .idx sentinel for "this entry is deleted" and is never written into
a needle header.

ReadBytes' size check therefore mismatched on every tombstone
(-1 != 0), returned ErrorSizeMismatch, and triggered the
4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any
volume large enough that offset+32 GB exceeds dat fileSize the retry
read EOF, CheckVolumeDataIntegrity reported corruption, and the loader
set noWriteOrDelete = true. Every volume whose last 10 .idx entries
included a deletion went read-only on startup — i.e. any healthy
volume where the most recent operations included a delete.

Pass Size(0) so the size check matches the on-disk tombstone header.

Add a regression test that writes three needles, deletes one, and
asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx
tail. Without this fix the test reproduces the exact log shape from
the bug report:

  read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF
  verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF

The Rust port guards its integrity-check size comparison with
!size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never
hits this path, so no Rust mirror change is needed.

* test(seaweed-volume): mirror Go regression for deletion-tombstone integrity

The Rust integrity check already guards its size-mismatch comparison
with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs
with body_size=0, so the Go regression fixed in the previous commit
does not apply. Lock that guarantee in with a parallel reload test:
write three needles, delete one, sync, reopen via Volume::new, assert
the volume is not flipped read-only.

Catches any future change that removes the deleted-entry guard or
re-introduces a size-strict path in check_volume_data_integrity for
tombstones.

* fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity

CheckVolumeDataIntegrity relies on identity comparison against io.EOF
and ErrorSizeMismatch to walk back through the last ten .idx entries
and tolerate a partial truncation at the tail (the "fix and continue"
loop). The live-needle branch in doCheckAndFixVolumeData already
returns those sentinels unwrapped; the deletion branch wrapped them
in fmt.Errorf, so a genuine .dat truncation past a tombstone offset
broke the recovery and flipped the volume read-only.

Mirror the live-needle handling: both verifyDeletedNeedleIntegrity
and doCheckAndFixVolumeData now short-circuit on io.EOF /
ErrorSizeMismatch and pass them through unwrapped. Other errors keep
their existing context wrapping.

Also tighten the regression test to capture lastAppendAtNs and assert
it's non-zero, so a future regression that skips the tombstone body
(and therefore never populates AppendAtNs) is caught even when the
err check still passes.
2026-05-19 13:11:19 -07:00
Chris Lu 7c252e1f16 fix(volume): reopen .idx writable after MarkVolumeWritable (fixes #9515) (#9526)
* fix(volume): reopen .idx writable after MarkVolumeWritable

When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a
SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable
only flipped noWriteOrDelete back to false and rewrote .vif, so writes
still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its
writable form (in-memory or leveldb small/medium/large) before flipping
the flag.

Mirror the same fix in seaweed-volume: the Rust load path leaves
CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume
boots read-only, so post-MarkVolumeWritable puts silently succeeded
in-memory only and were lost on the next restart. set_writable now
reattaches an append-mode writer when one is missing.

* fix(volume): keep old needle map until replacement is built; defer writable flag

Go: build the writable needle map into a local before swapping. A
construction failure now leaves v.nm pointing at the original
SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of
stranding the volume with v.nm == nil.

Rust: attach the .idx writer before flipping no_write_or_delete to
false. A transient open/metadata failure used to leave the volume
marked writable with no writer attached, and subsequent puts would
silently skip the on-disk append.
2026-05-18 20:51:04 -07:00
Chris Lu c11ff6657b fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525)
* fix(ec): mirror EC sidecars onto every shard-bearing disk at startup

In a multi-disk volume server, ec.balance and ec.rebuild can land shards
on a disk that does not also hold the matching .ecx / .ecj / .vif index
files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks
already loads those shards by pointing the EcVolume at the sibling
disk's index files; reads work, but any failure on the index-owning
disk silently disables every shard on the other disk, even though those
shards are physically fine.

This change adds mirrorEcMetadataToShardDisks, a startup pass that
physically replicates .ecx / .ecj / .vif onto each disk that holds
shards but is missing them. Each copy is atomic (tmp + fsync + rename)
and idempotent (a destination that already has the sidecar is
preserved). After mirroring, the cross-disk reconciler prefers the
local IdxDirectory so the EcVolume mounts self-contained; the
cross-disk virtual mount remains as a fallback for volumes whose mirror
failed (read-only target, out of space, partial copy on a previous
boot).

The same-disk invariant the EC lifecycle (encode / decode / balance /
vacuum / repair) was already documented as promising is now actually
restored at boot, so a future failure of one disk in a split-shards
layout no longer takes the other disk's shards with it.

Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj /
.vif from dir1) and idempotency (an existing destination .ecx is not
overwritten with the owner's copy).

* fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check

hasAllEcSidecarsLocally checked only the modern destination path
(IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk
that still had a legacy .ecx in its data dir (written before -dir.idx
was set) would report "not present" and the mirror would write a
second copy to IdxDirectory, leaving two .ecx files on disk.

Matches HasEcxFileOnDisk's open-with-fallback contract: check the
modern path first, then the opposite directory. Factored the
exists-and-not-a-dir check into a small statRegular helper so the
fallback ladder stays readable.

* rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup

Port of the Go fix (commit 088e26ea6) to the Rust volume server.
Adds Store::mirror_ec_metadata_to_shard_disks, called from
add_location / load_new_volumes before the cross-disk orphan
reconciler. Physically copies .ecx / .ecj / .vif from the disk that
owns the index files onto every disk holding shards but missing
sidecars, so each shard-bearing disk ends up self-contained.

The reconciler now prefers the local idx_directory when the mirror
has installed a .ecx there; the cross-disk virtual mount remains as
the fallback for volumes whose mirror failed (read-only target, out
of space, partial copy on a previous boot). Adds ec_local_ecx_path
helper shared between reconcile and mirror to detect the post-mirror
fast path.

Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when
-dir.idx is configured and the destination still has a legacy .ecx
in its data dir, that's recognized so the mirror does not write a
duplicate copy into idx_directory.

Tests cover the two key cases: orphan layout (dir0 receives the
sidecars from dir1) and idempotency (a pre-existing destination .ecx
is not overwritten).

* trim verbose comments on EC mirror code

Comments now lead with the WHY (non-obvious constraints, the
post-mirror fast path, why local copies are authoritative) and drop
restate-the-code blocks, headers, and section dividers. Behavior is
unchanged; all existing tests still pass on both the Go volume
server and the seaweed-volume Rust port.

* drop github issue refs from added comments

Two stray "#9212" references slipped into comments I added on the
cross-disk reconciler call site. The git log carries the issue
history; comments stand on their own.

* test(ec): accept rebuild on either disk after sidecar mirror

TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must
land at the disk-0 path. With the boot-time sidecar mirror, every
shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now
picks whichever disk hosts the most shards — disk 1 in this layout
after the deletion. The shard can legitimately rebuild on either
disk; the test now accepts both and uses the chosen path for the
subsequent mount + read verification.
2026-05-17 19:55:15 -07:00
Chris Lu 2a41e76101 fix(ec): blanket-clean every destination over the full shard range (#9512)
* fix(ec): blanket-clean every destination over the full shard range

The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.

Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.

Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.

* log(ec): include disk_id in EC unmount/delete/refusal log lines

The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:

  UnmountEcShards 4121.1                              -> add disk_id
  ec volume video-recordings_4121 shard delete [1 5]  -> add per-loc disk_id
  volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
  ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids

ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.

Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).

* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server

The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
2026-05-17 11:31:37 -07:00
Chris Lu d51454adf4 rust(seaweed-volume): distributed EC read across peer servers (#9516)
* feat(seaweed-volume): distributed EC read across peer servers

EcVolume::read_ec_shard_needle previously errored with NotFound when
any interval's shard wasn't local. In an RS(10,4)-across-N deployment
each server holds one shard, so every read needed >=9 peer fetches and
post-EC GETs returned 404 on volumes whose shards lived on more than
one server.

Mirror of weed/storage/store_ec.go's readOneEcShardInterval ->
readRemoteEcShardInterval -> recoverOneRemoteEcShardInterval chain:

  * server/store_ec.rs (new): entry point
    read_ec_shard_needle_distributed. Snapshots locate-needle + local
    reads under the Store sync lock, drops the lock, then async-fetches
    missing intervals via the peer's VolumeEcShardRead RPC. Falls back
    to Reed-Solomon reconstruction (read every other shard at the same
    (shard_offset, size) and run rs.reconstruct) when the direct peer
    read fails. Refreshes the per-EcVolume shard_locations cache from
    the master's LookupEcVolume RPC using Go's freshness thresholds
    (11s / 7min / 37min).
  * erasure_coding/ec_volume.rs: shard_locations now sits behind a
    std::sync::RwLock so the read path can refresh the map without
    holding the Store write lock. Adds shard_locations_refresh_time
    (Mutex<Option<Instant>>) for the staleness heuristic. Mirrors Go's
    ShardLocationsLock / ShardLocationsRefreshTime fields. set/get
    helpers updated for interior mutability.
  * server/handlers.rs: GET handler now tries the local-only fast
    path first, then falls through to the distributed path on
    NotFound.

* review: address PR 9516 feedback on distributed EC read

Five of the six PR-review comments addressed; the sixth (JWT on
outgoing peer gRPC) is deferred with an explicit TODO because the
crate-wide outgoing-JWT signing surface doesn't exist yet — adding it
in this one call site would split the credential plumbing across
peer paths that already lack it (copy_file_from_source, batch_delete,
…). Revisit when an outgoing-JWT helper lands.

Fixed in this commit:

  * Handlers: drop the two-tier (local-first, then distributed)
    read in handlers.rs. read_ec_shard_needle_distributed already
    does the local-first pass under the same store read lock; the
    redundant outer attempt re-read local intervals twice for any
    needle that spanned mixed-locality shards.
  * Scanner snapshot: replace inline locate-needle math with
    `ecv.locate_needle(needle_id)`. Same routine the local-only
    read path uses, so byte-identical on shard-size + interval
    boundaries.
  * EcVolume::set_shard_locations also advances
    shard_locations_refresh_time so the staleness check honors
    callers that populate the cache directly without going through
    the master LookupEcVolume RPC.
  * parse_grpc_address moved from grpc_server.rs into
    grpc_client.rs as `pub` and is reused by both grpc_server.rs
    and the new store_ec module. Single source of truth for the
    HTTP↔gRPC port-offset convention.
  * Reconstruction (recover_one_remote_ec_shard_interval) now seeds
    bufs from locally-mounted survivor shards BEFORE the remote
    fan-out. Previously the fan-out was remote-only, so when the
    shard_locations cache was cold or the master lookup failed,
    reconstruction errored even though enough siblings were on
    local disk to recover the missing interval.

* review: tighten parse_grpc_address; atomic shard-locations cache swap

Two follow-up findings from the PR 9516 review round 2:

  * `parse_grpc_address` now validates BOTH port components in the
    dotted form (`host:port.grpcPort`) — previously a non-numeric
    HTTP port like `host:abc.18080` slipped through and tripped a
    less-useful downstream URI parse error. The implicit form
    (`host:port` → port + 10000) also gains an overflow check so
    inputs like `host:60000` (which silently wrap past u16) are
    rejected here instead of producing an opaque connection
    failure later. Six unit tests cover each rejection path.

  * `EcVolume::set_shard_locations` no longer bumps the per-volume
    refresh timestamp. The previous fix introduced a freshness
    race: a multi-shard population that inserts shard-by-shard
    would flip `needs_refresh == false` on the first write, letting
    a concurrent reader observe a half-populated map already
    marked "fresh" and return NotFound for the not-yet-inserted
    shards. Added `EcVolume::replace_shard_locations(map)` for the
    atomic bulk swap; `write_back_shard_locations` in the
    distributed-read path uses it so the cache transitions
    old → fresh in a single observable step.
2026-05-16 20:44:28 -07:00
Chris Lu 3a8389cd68 fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490)

Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.

- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
  union the bitmaps, and refuse to call deleteOriginalVolume unless all
  TotalShardsCount distinct shard ids are observed. A failed gate leaves
  the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
  topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
  source destruction is traceable in default-verbosity production logs.

The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.

* fix(ec): trim doc comments on the new shard-verification path

Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).

* fix(ec): drop issue-number anchors from new comments

Issue references age poorly — the why behind each comment already
stands on its own.

* fix(ec): parametrize RequireFullShardSet on totalShards

Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.

* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo

The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.

* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files

Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.

* fix(ec/decode): verify regenerated .dat before deleting EC shards

After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.

Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.

* fix(ec): address PR review nits in verification + fake server

- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
  bound is explicit (Set already no-ops on overflow, this is for
  clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
  doesn't panic the test process.
2026-05-13 19:29:24 -07:00
Chris Lu de28c4df61 fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) (#9480)
* fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478)

handleFoundEcxFile only checks for .dat in the same disk location as the
EC shards. In a multi-disk volume server an interrupted encode can leave
.ec?? + .ecx on disk B while the source .dat still lives on disk A: the
per-disk loader sees no .dat next to .ecx, mistakes the leftover for a
distributed-EC layout, and mounts the partial shards. The volume server
then heartbeats both a regular replica and an EC shard for the same vid
and the master keeps both.

Sweep the store after per-disk loading and before the cross-disk
reconcile to delete partial EC files when a healthy .dat for the same
(collection, vid) exists on a sibling disk. Push DeletedEcShardsChan for
every pruned shard so master forgets the new-shard message the per-disk
pass already emitted, instead of waiting for the next periodic heartbeat.

* fix(seaweed-volume): mirror prune of partial EC with sibling .dat (#9478)

Rust port of the same Store-level prune added to weed/storage. The
per-disk EC loader in disk_location.rs only checks for .dat in the same
disk as the EC shards, so an interrupted encode that leaves .ec?? + .ecx
on disk B while the source .dat sits on disk A is mounted as if it were
a distributed-EC layout. The volume server then heartbeats both a
regular replica and an EC shard for the same vid.

Sweep the store after per-disk loading and before the cross-disk
reconcile, dropping in-memory EcVolumes with fewer than DATA_SHARDS_COUNT
shards when a .dat for the same (collection, vid) exists on a sibling
disk, and remove all on-disk EC artefacts for them. The Rust heartbeat
path already diff-emits deletes from the next ec_volumes snapshot, so no
explicit delete-channel push is needed here.

Tests cover both the issue 9478 layout and a distributed-EC layout with
no .dat anywhere on the store, which must be left alone.

* fix(storage): validate sibling .dat size before deleting partial EC (#9478)

The earlier prune deleted partial EC files whenever any .dat for the
same vid existed on a sibling disk — including a zero-byte shell. A
shell is no more useful than the partial shard it would replace, and
the partial shard might still combine with shards on other servers
in a recoverable distributed-EC layout. Wiping it based on a corrupt
sibling .dat is data loss masquerading as cleanup.

Tighten the check: when the EC's .vif recorded a non-zero source size
in datFileSize, require the sibling .dat to be at least that many
bytes; otherwise fall back to "at least a superblock". The .vif value
is what the encoder wrote at the moment the source was sealed, so a
sibling .dat smaller than that is provably truncated. Carry the size
through indexDatOwners alongside the location.

The Rust port had the same gap and an additional bug behind it:
EcVolume::new wasn't reading datFileSize from .vif, so the safety
check always fell back to the superblock floor. Wire datFileSize
through. The existing shard-size calculation in
LocateEcShardNeedleInterval already uses dat_file_size when non-zero,
so populating it also matches Go's behaviour there.

Tests cover the truncated-sibling case in both ports.
2026-05-13 09:25:10 -07:00
Chris Lu 10cc06333b cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.

Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
  the dc/rack/node tree, kept in sync from doLinkChildNode and
  UnlinkChildNode plus the ip/port-rewrite branch in
  GetOrCreateDataNode. GetTopology now returns nil on a detached
  subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
  hasVolumeServer answers without scanning every vid location. The
  add path skips empty-address entries the same way the delete path
  already does, so a zero-value Location cannot leak a permanent
  serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
  of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
  protect currentMaster reads/writes with an RWMutex, fixing the
  data race with the heartbeat goroutine. The seed slice is copied
  on construction so external mutation cannot desync it from the
  frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
  servers no longer carry a peer-volume list, and the note next to
  the dropped probe is reworded to make clear that direct
  volume-to-volume reachability is intentionally not validated by
  this command.

Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.

Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
2026-05-12 13:00:52 -07:00
Chris Lu 532b088262 fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423)

When EC shards land on a target whose disk type differs from the
source volume's, master heartbeats wrongly reported under the target
disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the
target server applies it to the in-memory EcVolume via SetDiskType so
the mount notification and steady-state heartbeat both carry the
source's disk type. Empty value falls back to the location's disk
type (used by disk-scan reload paths).

The override is not persisted with the volume — disk type stays an
environmental property and .vif remains portable.

* fix(ec): plumb source disk type through plugin worker (#9423)

Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved),
populate it from the metric the detector already collects, thread it
through ec_task into the MountEcShards helper, and forward it on the
VolumeEcShardsMount RPC.

* fix(ec): mirror source disk type plumbing in rust volume server (#9423)

The volume_ec_shards_mount handler now forwards source_disk_type into
mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it
overrides ec_vol.disk_type (and each mounted shard's disk_type) via
the new set_disk_type method; empty value keeps the location's disk
type, so disk-scan reload and reconcile paths are unchanged.

Also picks up two pre-existing proto drifts that 'make gen' synced
from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds
in remote.proto).

* feat(ec): bias placement toward preferred disk type (#9423)

Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType.
When PreferredDiskType is non-empty, SelectDestinations partitions
suitable disks into matching/fallback tiers and runs the rack/server/
disk-diversity passes on the matching tier first; the fallback tier
is only consulted if the matching pool can't satisfy ShardsNeeded.
PlacementResult.SpilledToOtherDiskType lets callers warn on spillover.

Empty PreferredDiskType keeps the existing single-pool behavior.

* fix(ec): plumb source disk type into placement planner (#9423)

diskInfosToCandidates now copies DiskInfo.DiskType into the placement
candidate, and ecPlacementPlanner.selectDestinations forwards
metric.DiskType as PreferredDiskType so EC shards land on disks
matching the source volume's disk type when possible. A glog warning
fires when placement had to spill to other disk types.

* test(ec): integration coverage for source-disk-type plumbing (#9423)

store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a
shard physically lives on an HDD location, MountEcShards is called
with sourceDiskType="ssd", and the test asserts that the in-memory
EcVolume, the mounted shard, the NewEcShardsChan notification, and
the steady-state heartbeat all report under the source's disk type.
A companion test pins the empty-source path so disk-scan reload
keeps the location's disk type.

detection_disk_type_test exercises the worker plumbing: with a
cluster of nodes carrying both HDD and SSD disks, planECDestinations
must place every shard on SSD when metric.DiskType="ssd"; with only
one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout
via spillover (and log a warning).

* revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423)

make gen pulled two pre-existing OSS changes into the rust proto
tree (LockRingUpdate / by_plugin in master.proto,
listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as
scope creep — none of the rust EC fix references those fields.
Restore both files to origin/master so this branch only touches
EC-related symbols.

* fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423)

partitionByDiskType used raw string comparison, so a PreferredDiskType
of "hdd" never matched candidates whose DiskType is "" (the
HardDriveType sentinel that weed/storage/types uses). EC encoding of
an HDD source would spill onto any HDD reporting "" even when the
cluster has plenty of matching capacity. Normalize both sides
through normalizeDiskType, which lowercases and folds "" → "hdd",
mirroring types.ToDiskType without taking a dependency on it.

selectFromTier's rack-diversity pass also kept revisiting racks the
preferred tier had already used when running on the fallback tier,
which negated PreferDifferentRacks on spillover. Skip racks already
in usedRacks so fallback placements still spread onto new racks.

* fix(ec): empty-source remount must not clobber existing disk type (#9423)

mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount,
disk-scan reload, orphan-shard reconcile). After an RPC sets the
source-derived disk type, any later call passing source_disk_type=""
was resetting ec_vol.disk_type back to the location's value, which
reintroduces the heartbeat drift this PR is meant to fix. Only
default to the location's disk type when the EC volume is fresh
(no shards mounted yet); otherwise leave the recorded type alone so
empty-source reloads preserve whatever the original mount RPC set.
2026-05-11 20:21:50 -07:00
Chris Lu 487b93eb49 fix(volume): don't panic on read when needle map is nil (#9342)
* fix(volume): don't panic on read when needle map is nil

A failed CommitCompact reload (and #9335's new error path for a
remote-tiered volume with a stray .vif but no .idx) leaves v.nm == nil
on a volume that's still in the store. readNeedle / readNeedleDataInto
dereferenced v.nm with no guard, so the next GET segfaulted the
http handler instead of returning an error the client could retry on
another replica.

Add the same v.nm == nil check the other Volume accessors already use,
including the slow-read inner loop where the lock is released between
iterations and a failed reload can race in.

Fixes #9339.

* match rust nm-nil read behavior; trim comments

seaweed-volume's read_needle_with_option / re_lookup_needle_data_offset
already lift Option<NeedleMap> through ok_or(NotFound). Use ErrorNotFound
on the Go side too instead of a generic 500-mapped error so both volume
servers respond identically when v.nm is nil.

* log once when reads hit nil needle map

ErrorNotFound alone hides the real cause: a half-loaded volume just
returns 404s and the operator has nothing to grep for. Add a once-per-
volume Errorf on the nil path, reset on successful load. Mirror the
same in seaweed-volume via nm_or_not_found().

* trim comments

* drop once-flag, log inline on every nil-nm read
2026-05-06 18:23:06 -07:00
Chris Lu 1c0e24f06a fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume

A .vif left behind without its .idx (orphaned by a crashed move, partial
copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the
whole volume server down on boot, killing every healthy volume on it
too. For remote-tiered volumes treat it as a per-volume load error so
the server can come up and the operator can clean up the stray .vif.

Refs #9331.

* fix(balance): skip remote-tiered volumes in admin balance detection

The admin/worker balance detector had no equivalent of the shell-side
guard ("does not move volume in remote storage" in
command_volume_balance.go), so it scheduled moves on remote-tiered
volumes. The "move" copies .idx/.vif to the destination and then calls
Volume.Destroy on the source, which calls backendStorage.DeleteFile —
deleting the remote object the destination's new .vif now points at.

Populate HasRemoteCopy on the metrics emitted by both the admin
maintenance scanner and the worker's master poll, then drop those
volumes at the top of Detection.

Fixes #9331.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(volume): keep remote data on volume-move-driven delete

The on-source delete after a volume move (admin/worker balance and
shell volume.move) ran Volume.Destroy with no way to opt out of the
remote-object cleanup. Volume.Destroy unconditionally calls
backendStorage.DeleteFile for remote-tiered volumes, so a successful
move would copy .idx/.vif to the destination and then nuke the cloud
object the destination's new .vif was already pointing at.

Add VolumeDeleteRequest.keep_remote_data and plumb it through
Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The
balance task and shell volume.move set it to true; the post-tier-upload
cleanup of other replicas and the over-replication trim in
volume.fix.replication also set it to true since the remote object is
still referenced. Other real-delete callers keep the default. The
delete-before-receive path in VolumeCopy also sets it: the inbound copy
carries a .vif that may reference the same cloud object as the
existing volume.

Refs #9331.

* test(storage): in-process remote-tier integration tests

Cover the four operations the user is most likely to run against a
cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by
registering a local-disk-backed BackendStorage as the "remote" tier and
exercising the real Volume / DiskLocation / EC encoder code paths.

Locks in:
- Destroy(keepRemoteData=true) preserves the remote object (move case)
- Destroy(keepRemoteData=false) deletes it (real-delete case)
- Vacuum/compact on a remote-tier volume never deletes the remote object
- EC encode requires the local .dat (callers must download first)
- EC encode + rebuild round-trips after a tier-down

Tests run in-process and finish in under a second total — no cluster,
binary, or external storage required.

* fix(rust-volume): keep remote data on volume-move-driven delete

Mirror the Go fix in seaweed-volume: plumb keep_remote_data through
grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume
→ Volume.destroy, and skip the s3-tier delete_file call when the flag
is set. The pre-receive cleanup in volume_copy passes true for the
same reason as the Go side: the inbound copy carries a .vif that may
reference the same cloud object as the existing volume.

The Rust loader already warns rather than fataling on a stray .vif
without an .idx (volume.rs load_index_inmemory / load_index_redb), so
no counterpart to the Go fatal-on-missing-idx fix is needed.

Refs #9331.

* fix(volume): preserve remote tier on IO-error eviction; fix EC test target

Two review nits:

- Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored
  volumes with keepRemoteData=false, so a transient local fault on a
  remote-tiered volume would also nuke the cloud object. Track the
  delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile()
  for IO-error evictions; TTL-expired evictions still pass false.

- TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but
  called them "parity" — by the klauspost/reedsolomon convention shards
  0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1
  are parity. Switch the loop to delete the parity range so the
  intent matches the indices.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-06 15:19:43 -07:00
Chris Lu 2417ba0354 fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints

Three destructive VolumeServer gRPC endpoints (DeleteCollection,
VolumeDelete, VolumeServerLeave) had no authentication checks, unlike
their HTTP counterparts which are protected by the Guard whitelist.

Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth
helper on VolumeServer that extracts the peer IP from gRPC context and
validates it against the guard whitelist. Gate all three endpoints
behind this check.

* fix(volume): tolerate unparseable gRPC peer address in admin auth check

S3 Filer Group integration tests were failing with
PermissionDenied "bad peer address: address @: missing port in address"
when DeleteCollection ran across the in-process gRPC connection
between filer and volume server — the peer addr surfaces as "@" there
and net.SplitHostPort can't parse it. The check rejected before
IsWhiteListed could exercise its allow-all path for empty-whitelist
deployments.

Hand the raw peer string to IsWhiteListed when SplitHostPort fails.
With no whitelist configured (the test environment's mode) it accepts;
with a whitelist configured the unparseable host won't match anything
and the call still gets denied as it should.

Adds three regression tests for IsWhiteListed pinning the empty-config
allow-all, populated-list reject-unknown, and signing-key-only allow-
all branches that the gRPC admin helper relies on.

* refactor(security): dedup checkWhiteList through IsWhiteListed

The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the
same lookup logic in two places; future drift was just a matter of
time. Have checkWhiteList delegate so the membership semantics live
in exactly one function.

Behaviour is unchanged: the new path still returns nil for
isEmptyWhiteList (signing-key-only mode) and still rejects unknown
hosts when a whitelist is configured.

Addresses gemini medium review on PR #8876.

* fix(volume): protect remaining state-altering gRPC admin endpoints

DeleteCollection, VolumeDelete, and VolumeServerLeave were the
truly-destructive endpoints, but AllocateVolume, VolumeMount,
VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and
VolumeMarkWritable also modify server state and should sit behind
the same whitelist gate. Read-only endpoints (VolumeStatus,
VolumeServerStatus, VolumeNeedleStatus, Ping) stay open.

The check is a no-op when no whitelist is configured (the default),
so existing deployments keep working; operators who lock down their
volume servers via guard.white_list now get consistent coverage.

Addresses gemini security-high review on PR #8876.

* fix(volume): typed peer addr + audit log for gRPC admin auth

Prefer a typed *net.TCPAddr when extracting the peer IP — string
parsing was already a fallback for the in-process case but using the
typed form first is cleaner and skips an unnecessary parse on the
common path. Log failed authorization attempts at V(0) so an operator
running with a whitelist sees the host that was rejected (and the
raw remote address in case the IP lookup itself was the failure
mode), matching what the HTTP Guard already does.

Addresses gemini medium review on PR #8876.

* fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints

Five more master/admin-driven destructive operations live outside
volume_grpc_admin.go and were missing the same whitelist gate:

- VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup
- ScrubVolume
- VolumeEcShardsDelete

VacuumVolumeCheck stays open (read-only). BatchDelete also stays
open: it's the data-plane multi-object delete called from the S3 API
and filer, not an admin operation; gating it would break ordinary S3
DeleteObjects calls.

Addresses gemini security-high review on PR #8876.

* fix(volume): simplify no-peer-info branch in gRPC admin auth

The IsWhiteListed("") fallback was defending against a scenario
that doesn't actually arise — real gRPC connections always populate
peer info. Drop the branch and just deny when peer info is missing,
which is the safer default and matches "if we don't know who the
caller is, refuse".

* fix(volume-rust): mirror gRPC admin auth on the rust volume server

The rust volume server has the same set of destructive admin
endpoints as the Go side and the same Guard infrastructure, but
nothing was wired together — every endpoint accepted unauthenticated
calls regardless of guard configuration. Same vulnerability class
the Go fix on this PR closes; this commit closes it on the rust
side too so the two stacks stay aligned.

Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer
SocketAddr off the tonic Request and runs Guard::check_whitelist on
its IP, then applies the helper to the same set the Go side covers:
DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount,
VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable,
VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit,
VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume,
VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete
stays open as a data-plane multi-object delete.
2026-05-04 21:14:55 -07:00
Chris Lu e82789ea4b rust(volume): strip grpc-port suffix from master URL before HTTP lookup (#9276)
* rust(volume): strip grpc-port suffix from master URL before HTTP lookup

The volume server stores `master_url` in SeaweedFS's canonical
`host:httpPort.grpcPort` form (e.g. `node-a:5300.5310`). When
`lookup_volume` builds the master `/dir/lookup` URL, the appended
gRPC port turns the URL into `http://node-a:5300.5310/...`, which
reqwest rejects with "builder error". Every replicated write,
batch-delete lookup, and proxy/redirect read then fails.

Mirror Go's `pb.ServerAddress.ToHttpAddress()` with a new
`to_http_address` helper and apply it inside `lookup_volume`, the
single funnel for all three HTTP master lookups in the Rust volume
server. Other consumers of `master_url` already go through gRPC and
use `to_grpc_address` / `parse_grpc_address`.

Includes a regression test that mocks a master HTTP server and calls
`lookup_volume` with a `host:port.grpcPort` master URL — without the
fix it reproduces the exact "lookup request failed: builder error"
from issue #9274.

Fixes #9274

* rust(volume): only strip dotted suffix from address when both ports are numeric

Previously `to_http_address` rewrote any `host:foo.bar` to `host:foo`,
which would silently drop the suffix on malformed config (e.g. a
hostname like `host:abc.def` or `host:99999.19333`). Validate that
both halves of the dotted suffix parse as `u16` before stripping —
mirrors the validation that `to_grpc_address` already does in the
inverse direction. Also slice the input directly instead of going
through `format!`, since the result is just a prefix of `addr`.

Adds a test that asserts non-numeric / out-of-range dotted suffixes
are preserved unchanged.

* rust(volume): strip grpc-port suffix from peer URLs in replicate / proxy / redirect

In normal operation the master returns `Location.url` as plain
`host:port` and the gRPC port arrives in a separate field. But the
volume server already has defensive logic (`grpc_address_for_location`)
for the `host:httpPort.grpcPort` form on those same URLs, which
implies a code path where peer URLs do carry the suffix.

Apply `to_http_address` to `loc.url` / `target.url` before building
HTTP URLs in `do_replicated_request`, `proxy_request`, and
`redirect_request` to keep replicate-write, proxy-read, and redirect
paths from hitting the same `lookup request failed: builder error`
mode that #9274 documented for master lookups.

Adds a unit test exercising `redirect_request` with a `.grpcPort`
suffix on `target.url`.

* rust(volume): return Cow<str> from to_http_address to skip allocation on the no-suffix path

Most addresses pass through `to_http_address` unchanged (master and
peer URLs are normally plain `host:port`), so the previous String
return type allocated on every call for nothing. Switch to `Cow<str>`:
the common pass-through borrows from the input, and only the rewrite
branch allocates. Call sites use the result via `format!`/`Display`,
which both work transparently with `Cow<str>`.

Adds a test asserting the variant is Borrowed in the no-rewrite cases
and Owned only when the suffix is stripped.

* rust(volume): cover bracketed IPv6 literals in to_http_address tests

The current implementation already handles IPv6 correctly because
`rfind(':')` lands on the colon after the closing bracket, leaving
the dotted suffix logic unchanged. Add explicit test coverage so the
behavior is locked down for dual-stack deployments — both the strip
case (`[::1]:9333.19333` -> `[::1]:9333`) and the various passthrough
cases (no suffix, non-numeric, out-of-range, trailing colon).

* rust(volume): parse both ports in one tuple pattern match in to_http_address

Combine the two `is_ok()` checks into a single `if let (Ok(_), Ok(_))`
tuple match — equivalent semantics, slightly tighter expression of
intent. No behavior change.
2026-04-29 00:51:10 -07:00
Chris Lu 08d59750ef rust(volume): export Prometheus metrics for scrubbing operations (#9266)
* rust(volume): export Prometheus metrics for scrubbing operations

Mirrors #9264 in the Rust volume server. Adds three metrics that match
the Go names so the same dashboards/alerts work against either binary:

  - SeaweedFS_volumeServer_scrub_last_time_seconds (gauge)
  - SeaweedFS_volumeServer_scrub_volume_failures   (counter)
  - SeaweedFS_volumeServer_scrub_shard_failures    (counter)

Metrics are aggregated at the volume / EC shard level, labelled by
VolumeScrubMode (UNKNOWN/INDEX/FULL/LOCAL) to match Go's
req.GetMode().String().

* rust(volume): record scrub metrics before post-scrub error check

Address PR feedback:
  - Move metric emission before the mark_broken_volumes_readonly error
    check so scrub failures are persisted even when the follow-up
    mark-readonly admin action fails (matches Go's volume_grpc_scrub.go).
  - Extract the duplicated metric block into emit_scrub_metrics() shared
    by both ScrubVolume and ScrubEcVolume. The shard-failures family
    stays untouched on regular volume scrubs to mirror Go.
2026-04-28 13:29:32 -07:00
Chris Lu 9d6d068f41 feat(seaweed-volume): cross-disk EC shard reconciliation (#9212) (#9252)
* fix(seaweed-volume): fall back to idx dir when reading .vif

EcVolume::new and read_ec_shard_config only looked for .vif at the
data dir. With the cross-disk reconcile path (where shards live on
one disk and .ecx / .ecj / .vif live on a sibling disk —
seaweedfs/seaweedfs#9212 / #9244), this would either write a stub
.vif on the shard disk and lose the real EC config + dat_file_size
or fall back to default ratios despite a perfectly good .vif being
present elsewhere on the same volume server.

Add a small `locate_vif_path` helper that prefers the data dir and
falls back to the idx dir when it differs, and thread the data dir
+ idx dir pair through `read_ec_shard_config`. Three call sites in
grpc_server.rs (VolumeEcShardsGenerate, VolumeEcShardsRebuild, scrub)
updated; the scrub path passes the same dir for both args because
`find_ec_dir` is the only locator there.

* feat(seaweed-volume): primitives for cross-disk EC shard reconcile

Adds the three small helpers the reconcile pass needs:

- DiskLocation::mount_ec_shards_with_idx_dir — mounts shards on this
  disk while pointing the EcVolume at a sibling disk's idx dir for
  .ecx / .ecj / .vif. Mirrors loadEcShardsWithIdxDir in
  weed/storage/disk_location_ec.go. The existing mount_ec_shards is
  kept as a thin wrapper over it.

- EcVolume::has_shard — `pub` accessor over the internal Vec<Option>
  shard slot so the reconcile pass can skip shards that are already
  registered.

- pub(crate) re-exports of parse_collection_volume_id and
  parse_ec_shard_extension under names parse_collection_volume_id_pub
  and is_ec_shard_extension so the reconcile module can call them
  without re-implementing the parsers.

No behaviour change. Reconciliation logic in the next commit.

* feat(seaweed-volume): cross-disk EC shard reconciliation (#9212)

Closes the loader half of seaweedfs/seaweedfs#9212 on the Rust side,
mirroring the Go fix in seaweedfs/seaweedfs#9244. With the auto-load
in feat/rust-load-all-ec-shards-9212 in place, the only remaining gap
is shards that landed on a disk without their `.ecx` — for example
when ec.balance / ec.rebuild moved them onto a destination node's
second disk while leaving the index files on the disk that already
held the volume. Without this, those orphan shards stay invisible to
the master and ec.rebuild reports the volume as unrepairable.

After every DiskLocation has finished its per-disk EC scan, sweep the
store for shards that live on a disk without local index files and
load them by reaching across to a sibling disk's `.ecx` / `.ecj` /
`.vif`:

  - Store::reconcile_ec_shards_across_disks walks each disk for
    orphan `.ec??` files (present on disk, not yet registered to an
    EcVolume) and matches them against an `(collection, vid) ->
    EcxOwnerInfo` map of which disk owns each `.ecx`.
  - Each matched group is mounted on its physical disk's ec_volumes
    map (so heartbeat reporting carries the right disk_id per shard)
    via `mount_ec_shards_with_idx_dir`, pointing the EcVolume at the
    sibling's idx dir.
  - `index_ecx_owners` records the directory each `.ecx` was found in
    (IdxDirectory or Directory) so the loader doesn't ENOENT when the
    legacy "written before -dir.idx was set" layout puts `.ecx` in
    the data dir. This mirrors the PR #9244 review fix from
    @gemini-code-assist / @coderabbitai (see Go commit af57cc652).
  - True orphans (no `.ecx` anywhere on this server) log a warning
    and stay on disk untouched — operator can restore the index later.

Wired into Store::add_location and Store::load_new_volumes so a fresh
restart and any later disk additions both pick up cross-disk shards.

Tests cover all four behaviour shapes:
- shards on dir0 + .ecx on dir1 → reconciled to dir0's ec_volumes
- .ecx in owner's data dir (legacy layout) → reconciled correctly
- self-contained disks → reconcile is a no-op
- truly-orphan shards (no .ecx anywhere) → left on disk, logged

* fix(seaweed-volume): propagate EcVolume::new errors instead of unwrap

mount_ec_shards_with_idx_dir built the missing EcVolume inside an
entry().or_insert_with() closure, which can't return a Result — so
any EcVolume::new failure (e.g. .ecx open error, .ecj create error,
malformed .vif) panicked the volume server via unwrap(). The
constructor already returns Result<>, so propagate it as
VolumeError::Io instead.

Reported in PR #9252 review by @gemini-code-assist (high) and
@coderabbitai (critical).

* perf(seaweed-volume): use DirEntry::metadata in collect_orphan_ec_shards

Replaced the extra fs::metadata(&path) lookup with ent.metadata() so
we don't pay an additional stat syscall per directory entry beyond
what read_dir already returned. Drops the now-unused std::path::Path
import alongside.

Reported in PR #9252 review by @gemini-code-assist.

* fix(seaweed-volume): scrub uses EcVolume's real dir_idx for split-disk volumes

After cross-disk reconciliation an EcVolume can legitimately have
ecv.dir != ecv.dir_idx (shards on one disk, .ecx / .ecj / .vif on a
sibling). The scrub path collapsed both args to find_ec_dir's single
answer, so read_ec_shard_config fell back to the wrong .vif location
for exactly the split-disk layout this PR loads — skewing
shard-count detection and verification results.

Use ecv.dir / ecv.dir_idx directly so scrub reads the metadata from
where the volume's index files actually live.

Reported in PR #9252 review by @coderabbitai.

* feat(seaweed-volume): primitives for split-disk EC volume operations

Reconciliation can mount the same `vid` on multiple DiskLocations
with disjoint shard subsets. The existing first-match `find_ec_volume`
isn't enough for read/unmount/delete/decode paths that need to act on
a specific shard or aggregate across the whole volume — they have to
walk every location and find the right home for each shard.

Add the small Store-level lookup primitives Go's findEcShard /
CollectEcShards already provide:

- `Store::find_ec_shard_location(vid, shard_id)` — returns the index
  of the location that has `(vid, shard_id)` mounted, if any.
- `Store::find_ec_volume_with_shard(vid, shard_id)` — same idea but
  returns the EcVolume directly.
- `Store::collect_ec_shard_dirs(vid, max_shard_count)` — returns
  the EcVolume to use for metadata plus per-shard data dirs (None
  when the shard isn't mounted on any disk). Mirrors
  `Store.CollectEcShards` in `weed/storage/store_ec.go`.

And the EcVolume accessors callers need:

- `EcVolume::has_shard(shard_id)` — was already added for the cross-
  disk reconcile but is now a load-bearing primitive for placement
  decisions on a per-shard basis. Pulled into the dedicated commit.
- `EcVolume::ecx_actual_dir()` — exposes the directory the `.ecx`
  was actually opened from. The decoder needs it for the .ecx
  lookup when shards are split across data dirs and `.ecx` lives on
  a sibling idx dir.

Plus a small defensive change to `DiskLocation::unmount_ec_shards`:
only decrement the per-shard gauge for shards that were actually
mounted. Without this, the upcoming `Store::unmount_ec_shards`
fan-out to every location would underflow the metric whenever a
shard is requested for unmount on a sibling disk that doesn't have
it.

No behaviour change at the call sites yet — wiring follows in the
next commits.

* fix(seaweed-volume): unmount_ec_shards visits every location with the vid

Store::unmount_ec_shards and Store::unmount_ec_shard returned after
the first DiskLocation with the volume id, even if that location did
not contain the requested shard. With reconciled split-disk volumes
(shards 0/12 on disk 0, shard 1 on disk 1 — the issue #9212 layout
this PR loads), VolumeEcShardsUnmount for a later-disk shard became a
silent no-op and Store::delete_ec_shards could remove the shard file
while leaving an in-memory shard + open file handle stale on the
later location.

Walk all locations that have the EcVolume and ask each to unmount
whatever subset of `shard_ids` it actually has — the
`DiskLocation::unmount_ec_shards` defensive guard from the previous
commit makes the fan-out safe (no metric underflow when a sibling
disk is asked to unmount a shard it doesn't hold).

* fix(seaweed-volume): VolumeEcShardRead reads from the shard's home disk

VolumeEcShardRead resolved the EcVolume via first-match
`find_ec_volume(vid)` and then looked up the requested shard on that
single EcVolume. With reconciled split-disk volumes (the layout
seaweedfs/seaweedfs#9212 produces — shards 0/12 on disk 0, shard 1
on disk 1), a request for shard 1 hit disk 0 first and returned
"shard 1 not mounted" even though it was happily mounted on disk 1.

Switch to `find_ec_volume_with_shard(vid, shard_id)` so the lookup
walks every location and returns the EcVolume whose disk actually
holds the shard. The deleted-needle check still works because every
per-disk EcVolume for the same vid points at the same `.ecx` file
(post-reconcile, both disks open the same sealed index).

* fix(seaweed-volume): VolumeEcShardsToVolume aggregates shards across disks

VolumeEcShardsToVolume resolved a single EcVolume via
`find_ec_volume(vid)` and then checked `ec_vol.shards[i]` for each
data shard. With reconciled split-disk volumes that's the wrong
view: the first-match EcVolume only carries the shards on its disk,
so the presence check would either reject the request as
"missing shard" or — if shards happened to be on the first disk —
fall through to `write_dat_file_from_shards(&dir, ...)` which only
reads from the EcVolume's single dir.

Mirror Go's CollectEcShards by aggregating per-shard data dirs
across every location with the volume:

- Add `Store::collect_ec_shard_dirs` (in the previous primitives
  commit) returning the EcVolume to use for metadata + per-shard
  dir slots.
- Extend `find_dat_file_size` and `write_dat_file_from_shards` with
  `_with_dirs` variants that take the `.ec00` dir and per-shard
  dirs separately, so a decoded volume whose shards live on
  several disks can still be reconstructed. The original signatures
  delegate to the new ones with the same dir for all shards, so
  every existing caller keeps working unchanged.
- Rewire VolumeEcShardsToVolume through the helpers — presence
  check sees the union, dat_file_size reads `.ec00` from the right
  disk and `.ecx` from the EcVolume's actual idx dir, the decoder
  reads each shard from its own home dir.

* test(seaweed-volume): split-disk read / unmount / delete / collect

Five tests exercising the four behaviour shapes the PR #9252 review
flagged on multi-location EC volumes. Each builds the cross-disk
split layout from issue #9212 (shards 0 and 12 on disk 0, shard 1 +
.ecx on disk 1) via the new `build_split_disk_store` helper and
asserts:

- `find_ec_shard_location` / `find_ec_volume_with_shard` route to
  the disk that actually holds each shard (not first-match).
- `Store::unmount_ec_shards([1])` reaches disk 1 and removes shard 1
  while leaving disk 0's unrelated shards mounted (used to be a
  silent no-op).
- `Store::unmount_ec_shard(vid, 1)` ditto for the single-shard
  variant.
- `Store::delete_ec_shards` removes both the on-disk file and the
  in-memory mount on the right disk; previously deletion could
  remove the file while the in-memory shard with its open file
  handle survived on a different location.
- `collect_ec_shard_dirs` reports the right per-shard data dir for
  each location and `None` for unmounted shards.

* fix(seaweed-volume): retry same-disk legacy .ecx layout in reconcile

The unconditional `owner.location == loc_idx` skip missed the layout
where `idx_directory` is configured but the owner's `.ecx` / `.ecj` /
`.vif` still live in `loc.directory` (the legacy "written before
-dir.idx was set" shape). In that case the per-disk loader's
mount_ec_shards used `loc.idx_directory` and ENOENT'd, then this
branch suppressed the only recovery path — the owner disk's own
shards stayed unloaded after startup.

Tighten the skip so it only fires when the discovered owner dir is
already `loc.idx_directory` (the loader-already-tried-and-failed
case). When `owner.idx_dir` differs (legacy data-dir layout), queue
a same-disk retry through `mount_ec_shards_with_idx_dir(...,
&owner.idx_dir)` so reconcile becomes the recovery path.

Reported in PR #9252 review by @coderabbitai.

* fix(seaweed-volume): roll back partial mounts on cross-disk reconcile failure

mount_ec_shards_with_idx_dir adds shards one at a time and
increments the `ec_shards` gauge per shard that successfully attaches.
A mid-loop failure (e.g. an EcVolumeShard::open error after the
first few shards already attached) used to leave the EcVolume
half-mounted with stale metric increments — the warn!() branch only
logged the error.

Mirror DiskLocation::handle_found_ecx_file's recovery path: drive
the cleanup through `loc.unmount_ec_shards(vid, &shard_ids)` after
a failed mount. The defensive change in #9251 makes
unmount_ec_shards only decrement the gauge for shards that were
actually mounted and drops the EcVolume when it reaches zero
shards, so the rollback is safe even though some of `shard_ids`
never attached.

Reported in PR #9252 review by @coderabbitai.

* test(seaweed-volume): cover the two reconcile fixes from PR #9252 review

Two new tests in store_ec_reconcile:

- test_reconcile_recovers_same_disk_legacy_ecx_layout — sets up the
  layout where idx_directory is configured but the owner's .ecx
  lives in loc.directory. The per-disk loader's mount_ec_shards
  uses loc.idx_directory and fails; reconcile should retry on the
  same disk with the owner's actual idx_dir and the owner's own
  shards must come back online.

- test_reconcile_rolls_back_partial_mounts_on_failure — sabotages
  one of the orphan shard files (replaces it with a directory of
  the same name) so EcVolumeShard::open errors out partway through
  mount_ec_shards_with_idx_dir. Asserts the post-condition that no
  EcVolume entry retains a "shard mounted" claim that doesn't
  correspond to a real shard file.
2026-04-27 19:01:30 -07:00
Chris Lu 49e83a26cb feat(seaweed-volume): auto-load EC shards on startup (#9212) (#9251)
* feat(seaweed-volume): auto-load EC shards on startup

The Rust volume server's load_existing_volumes only scanned .dat
files; EC shards on disk stayed invisible until something explicitly
issued VolumeEcShardsMount. Strict superset of the issue
seaweedfs/seaweedfs#9212 reports for Go: after a fresh restart, every
local EC shard was missing from the master's view.

Port loadAllEcShards from weed/storage/disk_location_ec.go:

- DiskLocation::load_all_ec_shards walks Directory (and IdxDirectory
  if separate) sorted, groups .ec?? shard files by (collection, vid),
  validates and mounts each group when its matching .ecx is found.
- handle_found_ecx_file: validate_ec_volume + mount_ec_shards path,
  with cleanup when .dat exists and validation fails (incomplete
  encoding) or load fails.
- check_orphaned_shards: cleans up shard remnants whose .ecx never
  arrived AND whose stale .dat is still present (interrupted
  encoding); leaves them on disk otherwise so cross-disk
  reconciliation / operator recovery can find them.
- check_dat_file_exists / parse_collection_volume_id /
  parse_ec_shard_extension: small helpers mirroring Go's checkDatFileExists,
  parseCollectionVolumeId, and the `\.ec\d{2,3}` regex.
- Wire through load_existing_volumes after the .dat scan; failures
  log but don't fail the disk's startup.

Tests:
- test_parse_ec_shard_extension covers .ec00–.ec255 and the rejection
  of .ec0, .ec999, .ecx, .ecj, .dat, and missing leading dot.
- test_load_all_ec_shards_mounts_pairs_with_ecx: shards + .ecx + .vif
  on disk get mounted into ec_volumes after load_existing_volumes.
- test_load_all_ec_shards_keeps_orphan_shards_when_no_dat: orphan
  shards (no .ecx, no .dat) stay on disk untouched
  (distributed-EC scenario).
- test_load_all_ec_shards_cleans_orphan_shards_when_dat_exists:
  orphan shards alongside a stale .dat get cleaned up
  (interrupted-encoding scenario).

Prerequisite for porting the cross-disk orphan-shard reconciliation
in seaweedfs/seaweedfs#9244 to Rust.

* fix(seaweed-volume): dedupe filenames when scanning data + idx dirs

load_all_ec_shards scans both `directory` and `idx_directory` (when
they differ) so the loop can pair `.ec??` shards with their `.ecx`
regardless of which dir owns the index. If the same filename is
present in both — possible in idempotent legacy layouts that
pre-date `-dir.idx` — the previous implementation processed it
twice. mount_ec_shards increments the per-shard `ec_shards` metric
inside the loop, so a duplicated `.ec??` entry would double-count
the gauge.

Use a HashSet<String> while accumulating entries so each filename
is processed exactly once.

Reported in PR #9251 review by @gemini-code-assist.

* fix(seaweed-volume): drive partial-mount cleanup through unmount_ec_shards

handle_found_ecx_file calls mount_ec_shards which adds shards one at
a time. mount_ec_shards increments the `ec_shards` gauge per shard
that successfully attaches. If mount fails halfway, plain
ec_volumes.remove(vid) drops the EcVolume but leaves the gauge
incremented for whatever did mount.

Drive the cleanup branches through unmount_ec_shards instead — it
mirror-decrements the gauge per shard and only then drops the
EcVolume. Same shape applied to both .dat-exists and distributed-EC
fallbacks.

Reported in PR #9251 review by @gemini-code-assist.

* docs(seaweed-volume): clarify parse_ec_shard_extension shard-id range

Doc previously said `.ec00`–`.ec999` but the implementation rejects
any shard id > 255 (matches the `EcVolumeShard` u8 typed shard id
and Go's `strconv.ParseInt(... 10, 64)` + `> 255` guard). Fix the
doc to say `.ec00`–`.ec255` and explain why the 3-digit form is
still recognised.

Reported in PR #9251 review by @coderabbitai.
2026-04-27 16:41:46 -07:00
Chris Lu 933ae6e386 fix(seaweed-volume): port EC shard placement fix to Rust (#9212, mirrors #9245) (#9250)
* feat(seaweed-volume): add DiskLocation::has_ecx_file_on_disk

Mirrors `DiskLocation.HasEcxFileOnDisk` from the Go side
(seaweedfs/seaweedfs#9245). Reports whether this disk has a sealed
.ecx index file for (collection, vid) by stat'ing the IdxDirectory
first, then falling back to Directory if different — covers the
legacy "written before -dir.idx was set" layout. Skips entries that
are directories so a stray dir named `<col>_<vid>.ecx` doesn't
register as a present index file.

Unlike has_ec_volume() this does not require the EC volume to be
mounted in memory, which makes it the right primitive for placement
decisions during ec.balance / ec.rebuild flows where shards may
arrive before any VolumeEcShardsMount has happened on the receiving
disk.

Wiring + tests in follow-up commits.

* feat(seaweed-volume): add Store::find_ec_shard_target_location

Mirrors `Store.FindEcShardTargetLocation` from the Go side
(seaweedfs/seaweedfs#9245). Single canonical placement primitive for
new EC shard / index files. Selection order:

  1. a disk that already has the EC volume mounted (in-memory),
  2. a disk that owns the .ecx file on disk (volume not yet mounted),
  3. any HDD with free space,
  4. any disk with free space.

Step 2 is the missing primitive that pinned subsequent shards to the
first-shard disk during ec.rebuild — rebuild only sets
CopyEcxFile=true on the first shard, then relies on auto-select to
land later shards on the same disk. Without an on-disk check
has_ec_volume returns false (no mount yet) and the fallback picked
"any HDD with free space," splitting shards from their .ecx across
disks of the same node and producing the orphan-shard layout
seaweedfs/seaweedfs#9212 reports.

Implementation walks store.locations once with tier scoring; the
highest-tier disk wins, ties broken by free count. The earlier
4-pass waterfall in find_free_location_predicate would have
re-acquired locks per pass.

ec_free_shard_count returns the free count in shard slots (not
volume-equivalent slots). The pre-existing find_free_location*
helpers divide by DATA_SHARDS_COUNT at the end; that truncation can
exclude a disk that has room for several individual shards
(MaxVolumeCount=1, EcShardCount=1, dsc=10 → reports 0 despite 9
free slots), which would re-route subsequent shards off the
.ecx-owning disk and re-introduce the orphan layout. Keep the result
in shard slots throughout. The unlimited-disk branch
(MaxVolumeCount==0) reports a synthetic large free count
decremented by current usage so unlimited disks stay eligible and
tie-breaks still prefer the less-loaded one.

data_shard_count is taken as a parameter rather than read from
DATA_SHARDS_COUNT so custom-ratio builds can swap the default
without touching this helper.

Tests cover: pinning to .ecx on disk, mounted-wins-over-stray-.ecx,
HDD fallback, MaxVolumeCount=0 unlimited handling, and the
tight-provisioning truncation case.

* fix(seaweed-volume): route EC shard auto-select through new helper

VolumeEcShardsCopy and the ReceiveFile EC branch both used a 3-tier
inline waterfall: in-memory has_ec_volume → any HDD → any disk. That
checked in-memory state only and missed disks that own the .ecx on
disk but haven't been mounted yet — the orphan-shard placement
hazard from seaweedfs/seaweedfs#9212.

Replace both with a single call to
Store::find_ec_shard_target_location, which adds the .ecx-on-disk
tier between mounted and HDD, and accounts for free space in shard
slots so tight-provisioning configurations don't incorrectly skip a
disk that still has room for individual shards.

Pass DATA_SHARDS_COUNT as the data-shard count for free-slot maths;
the helper takes it as a parameter so custom-ratio builds can swap
the default without touching this file.

* fix(seaweed-volume): grow UNLIMITED_FREE budget and saturate the math

ec_free_shard_count's unlimited branch (MaxVolumeCount=0) used to
clamp to a constant `1` once usage exceeded `1 << 30 ≈ 1e9` shard
slots. With several unlimited disks all past that threshold, every
placement decision among them tied at 1 — tie-break degraded to
"first eligible disk."

Bump the synthetic budget to `1 << 60 ≈ 1.15e18` and use
saturating arithmetic so even pathological usage never wraps i64.
Clamp the return value to `≥ 1` so the disk stays eligible for
placement at any load. Tie-breaks among unlimited disks now keep
preferring the less-loaded one across all realistic deployments.

Reported in PR #9250 review by @gemini-code-assist.
2026-04-27 16:40:39 -07:00
Chris Lu 4c4d53ce23 fix(seaweed-volume): accept redb aliases for --index (#9237)
fix(seaweed-volume): accept redb aliases for --index and rename kinds

The Rust volume server's disk-backed index uses redb internally
(see RedbNeedleMap), but --index only accepted the legacy `leveldb`
spellings, contradicting the wiki and forcing users to read source to
figure out what value to pass.

- --index now accepts memory|redb|redbMedium|redbLarge as the canonical
  names, with leveldb/leveldbMedium/leveldbLarge kept as aliases.
- Rename NeedleMapKind variants LevelDb*->Redb* so the in-tree names
  match the actual backend.
- Update help text and add a parse-table test covering both names.

Refs #9234.
2026-04-27 01:44:40 -07:00
Chris Lu 045ace29d5 fix(seaweed-volume): parse host:port.grpcPort in master address (#9235)
The Go ServerAddress format encodes an optional explicit gRPC port as
host:port.grpcPort. The Rust heartbeat client only handled host:port
(falling back to port+10000), so feeding it host:port.grpcPort yielded
a malformed gRPC target like "host:port.grpcPort", which manifests as
checkWithMaster transport errors.

Mirror pb.ServerToGrpcAddress(): if the part after the last ':' contains
a '.' followed by a valid u16, treat that suffix as the explicit gRPC
port; otherwise keep the +10000 default.

Refs #9234.
2026-04-27 01:44:11 -07:00
Chris Lu 503b6f2744 fix(seaweed-volume): ceil EC shard slots in maybe_adjust_volume_max (#9232)
Mirrors the volume-server side of seaweedfs/seaweedfs#9196: compute the
EC-shard contribution to maxVolumeCount with proper ceiling division
((N + D - 1) / D) instead of (N + D) / D, which over-counts by one slot
whenever the per-location EC-shard count is zero or an exact multiple of
DataShardsCount (10). The most common case -- a location with no EC
shards -- silently inflated maxVolumeCount by 1 on every recalculation.

The matching low-disk effective_max_count path in heartbeat.rs already
uses the correct ceiling form, and the master-side topology changes from
that PR have no Rust counterpart.
2026-04-26 22:31:56 -07:00
dependabot[bot] 352ffdffe1 build(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /seaweed-volume (#9216)
build(deps): bump rustls-webpki in /seaweed-volume

Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.10 to 0.103.13.
- [Release notes](https://github.com/rustls/webpki/releases)
- [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13)

---
updated-dependencies:
- dependency-name: rustls-webpki
  dependency-version: 0.103.13
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-24 11:44:20 -07:00
Chris Lu 036191c78a Merge branch 'master' of https://github.com/seaweedfs/seaweedfs 2026-04-23 11:09:59 -07:00
Chris Lu ae93f87a46 adjust logo 2026-04-23 10:05:51 -07:00
Chris Lu f438cc3544 fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard

ReceiveFile for an EC shard calls os.Create(filePath) which opens the
path with O_TRUNC. When the shard is already mounted, the in-memory
EcVolume holds a file descriptor against the same inode, so a second
ReceiveFile call for the same (volume, shard) truncates the live shard
file beneath the reader.

Reproducer: generate and mount shard 0 for a populated volume, capture
the on-disk size, then send a smaller payload for the same shard via
ReceiveFile. The current handler accepts the overwrite and leaves the
shard truncated in place; this test pins that behavior. When the fix
lands the server should reject (or rename-then-swap) and this test
must be inverted.

* fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard

ReceiveFile used os.Create on EC shard paths, which opens with
O_TRUNC and truncates in place. When an EC shard is already
mounted, the in-memory EcVolume holds file descriptors against the
same inodes, so the truncation corrupts the live shard beneath any
ongoing read. On retries of an EC task this produced the "missing
parts" class of errors in #9184.

The fix rejects any ReceiveFile for an EC volume that currently
has mounted shards. The caller must unmount before retrying —
silent truncation is never an acceptable outcome. Non-EC writes and
ReceiveFile for volumes that have never been mounted on this server
continue to work as before.

Tests:
- TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard,
  attempts an overwrite, asserts the error response and that the
  on-disk file and live reads are undisturbed.
- TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case
  contract that a first write to a target still succeeds.

* fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard

Mirror the Go-side change: reject receive_file for any EC volume that
currently has mounted shards on this server. std::fs::File::create
truncates in place and the in-memory EcVolume holds fds on the same
inodes, so an overwrite would corrupt live readers.
2026-04-22 16:47:01 -07:00
Chris Lu c4e1885053 fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug

The plugin-worker EC task sends shards via ReceiveFile, which picks
Locations[0] as the target directory regardless of the admin planner's
TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there
is no wire channel to honor the plan.

Adds StartSingleVolumeClusterWithDataDirs to the integration framework
so tests can launch a volume server with N data directories. The new
repro asserts the current (buggy) behavior: sending three distinct EC
shards via ReceiveFile leaves all three files in dir[0] and the other
dirs empty. When the fix adds disk_id to ReceiveFileInfo, this
assertion must flip to verify the planned placement is respected.

* fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement

Before this change, VolumeServer.ReceiveFile for EC shards always
selected the first HDD location (Locations[0]). The plugin-worker EC
task had no way to pass the admin planner's per-shard disk
assignment — ReceiveFileInfo carried no disk_id field — so every
received EC shard piled onto a single disk per destination server.
On multi-disk servers this caused uneven load (one disk absorbing all
EC shard I/O), frequent ENOSPC retries, and a growing EC backlog
under sustained ingest (see issue #9184).

Changes:
- proto: add disk_id to ReceiveFileInfo, mirroring
  VolumeEcShardsCopyRequest.disk_id.
- worker: DistributeEcShards tracks the planner-assigned disk per
  shard; sendShardFileToDestination forwards that disk id. Metadata
  files (ecx/ecj/vif) inherit the disk of the first data shard
  targeting the same node so they land next to the shards.
- server: ReceiveFile honors disk_id when > 0 with bounds
  validation; disk_id=0 (unset) falls back to the same
  auto-selection pattern as VolumeEcShardsCopy (prefer disk that
  already has shards for this volume, then any HDD with free space,
  then any location with free space).

Tests updated:
- TestReceiveFileEcShardHonorsDiskID asserts three shards sent with
  disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively.
- TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range
  disk_id rejection path.

* fix(volume-rust): honor disk_id in ReceiveFile for EC shards

Mirror the Go-side change: when disk_id > 0 place the EC shard on the
requested disk; when unset, auto-select with the same preference order
as volume_ec_shards_copy (disk already holding shards, then any HDD,
then any disk).

* fix(volume): compare disk_id as uint32 to avoid 32-bit overflow

On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a
high-bit uint32 to a negative int, bypassing the bounds check before the
index operation. Compare in the uint32 domain instead.

* test(ec): fail invalid-disk_id test on transport error

Previously a transport-level error from CloseAndRecv silently passed the
test by returning early, masking any real gRPC failure. Fail loudly so
only the structured ReceiveFileResponse rejection path counts as a pass.

* docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test

Documents the load-bearing assumption that shards are never mounted in
this test, so loc.FindEcVolume always returns false and auto-select
falls through to the first HDD. Saves future readers from re-deriving
the expected directory for the DiskId=0 case.

* fix(test): preserve baseDir/volume path for single-dir clusters

StartSingleVolumeClusterWithDataDirs started naming the data directory
volume0 even in the dataDirCount=1 case, which broke Scrub tests that
reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile /
CorruptEcxFile. Keep the legacy name for single-dir clusters; only use
the indexed "volumeN" layout when multiple disks are requested.
2026-04-22 10:30:13 -07:00
Chris Lu 45578a42e9 fix(volume): keep vacuum running past dangling .idx entries (#9115)
* fix(volume): keep vacuum running past dangling .idx entries

Vacuum compaction aborted entirely on the first .idx entry whose offset
pointed past the end of the .dat file, surfacing as `cannot hydrate
needle from file: EOF` and stalling progress on every other volume.

In both Go and Rust:

- During compaction, skip an unreadable needle and continue. The bytes
  it pointed at were already unreachable via reads, so dropping the
  index reference makes the post-vacuum volume consistent. Real EIO
  still bails out so a disk fault is not silently papered over.

- At volume load, do a single linear scan of the .idx and confirm
  every (offset + actual size) fits inside .dat. The pre-existing
  integrity check only looked at the last 10 entries, so deeper
  corruption (e.g. left over from a crashed batched write) went
  undetected and only surfaced later as a vacuum EOF. A failure now
  marks the volume read-only at load time so an operator can react.

Refs #8928

* fix(volume): only skip permanent-corruption needle reads during vacuum

Address PR review feedback (gemini-code-assist + coderabbit):

The original patch skipped any non-EIO read failure, which would silently
drop needles on transient errors — Windows hardware bad-sector errors
(ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network
timeouts and EROFS would also slip through and shrink the volume.

Switch to an explicit whitelist of permanent-corruption shapes:

- Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of
  range" errors with %w so callers can match via errors.Is.
- copyDataBasedOnIndexFile now skips only when the read failure is
  io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid,
  or ErrorCorrupted. Anything else (real disk faults, environmental
  errors, Windows hardware codes) aborts the compaction so an
  operator notices.
- Mirror the same whitelist in the Rust volume server, matching on
  io::ErrorKind::UnexpectedEof and the NeedleError corruption variants
  (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort).

Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows
t.TempDir() cleanup can release the .dat/.idx handles.

Refs #8928

* fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch

Address PR review: the fallback branch in ReadBytes returned an
unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller
using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap
with %w so the whitelist applies, while leaving the existing direct
sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize
retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry
still triggers.

Refs #8928

* fix(volume): integrate dangling-idx check into existing index load walk

Address PR review (gemini-code-assist, medium): the structural .idx
check used to do a second linear scan of the index file at every volume
load, doubling the disk-I/O cost on servers managing many volumes.

Track the largest (offset + actual size) seen during the existing
needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`,
`NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`,
`DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`,
exposed as `MaxNeedleEnd()` on the NeedleMapper interface.
`volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size
after the load is complete — pure numeric comparison, no extra I/O.

The standalone `verifyIndexFitsInDat` helper and its caller in
`CheckVolumeDataIntegrity` are removed; the test that used to drive
the helper directly now exercises the new path via
`LoadCompactNeedleMap`.

Mirror the same change in the Rust volume server: track
`max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()`
on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum.
The Rust load walk already happens in `load_from_idx` for both map
kinds, so the structural check becomes free.

Refs #8928
2026-04-16 22:01:34 -07:00
Chris Lu 300e906330 admin: report file and delete counts for EC volumes (#9060)
* admin: report file and delete counts for EC volumes

The admin bucket size fix (#9058) left object counts at zero for
EC-encoded data because VolumeEcShardInformationMessage carried no file
count. Billing/monitoring dashboards therefore still under-report
objects once a bucket is EC-encoded.

Thread file_count and delete_count end-to-end:

- Add file_count/delete_count to VolumeEcShardInformationMessage (proto
  fields 8 and 9) and regenerate master_pb.
- Compute them lazily on volume servers by walking the .ecx index once
  per EcVolume, cache on the struct, and keep the cache in sync inside
  DeleteNeedleFromEcx (distinguishing live vs already-tombstoned
  entries so idempotent deletes do not drift the counts).
- Populate the new proto fields from EcVolume.ToVolumeEcShardInformationMessage
  and carry them through the master-side EcVolumeInfo / topology sync.
- Aggregate in admin collectCollectionStats, deduping per volume id:
  every node holding shards of an EC volume reports the same counts, so
  summing across nodes would otherwise multiply the object count by the
  number of shard holders.

Regression tests cover the initial .ecx walk, live/tombstoned delete
bookkeeping (including idempotent and missing-key cases), and the admin
dedup path for an EC volume reported by multiple nodes.

* ec: include .ecj journal in EcVolume delete count

The initial delete count only reflected .ecx tombstones, missing any
needle that was journaled in .ecj but not yet folded into .ecx — e.g.
on partial recovery. Expand initCountsLocked to take the union of
.ecx tombstones and .ecj journal entries, deduped by needle id, so:

  - an id that is both tombstoned in .ecx and listed in .ecj counts once
  - a duplicate .ecj entry counts once
  - an .ecj id with a live .ecx entry is counted as deleted (not live)
  - an .ecj id with no matching .ecx entry is still counted

Covered by TestEcVolumeFileAndDeleteCountEcjUnion.

* ec: report delete count authoritatively and tombstone once per delete

Address two issues with the previous EcVolume file/delete count work:

1. The delete count was computed lazily on first heartbeat and mixed
   in a .ecj-union fallback to "recover" partial state. That diverged
   from how regular volumes report counts (always live from the needle
   map) and had drift cases when .ecj got reconciled. Replace with an
   eager walk of .ecx at NewEcVolume time, maintained incrementally on
   every DeleteNeedleFromEcx call. Semantics now match needle_map_metric:
   FileCount is the total number of needles ever recorded in .ecx
   (live + tombstoned), DeleteCount is the tombstones — so live =
   FileCount - DeleteCount. Drop the .ecj-union logic entirely.

2. A single EC needle delete fanned out to every node holding a replica
   of the primary data shard and called DeleteNeedleFromEcx on each,
   which inflated the per-volume delete total by the replica factor.
   Rewrite doDeleteNeedleFromRemoteEcShardServers to try replicas in
   order and stop at the first success (one tombstone per delete), and
   only fall back to other shards when the primary shard has no home
   (ErrEcShardMissing sentinel), not on transient RPC errors.

Admin aggregation now folds EC counts correctly: FileCount is deduped
per volume id (every shard holder has an identical .ecx) and DeleteCount
is summed across nodes (each delete tombstones exactly one node). Live
object count = deduped FileCount - summed DeleteCount.

Tests updated to match the new semantics:
  - EC volume counts seed FileCount as total .ecx entries (live +
    tombstoned), DeleteCount as tombstones.
  - DeleteNeedleFromEcx keeps FileCount constant and increments
    DeleteCount only on live->tombstone transitions.
  - Admin dedup test uses distinct per-node delete counts (5 + 3 + 2)
    to prove they're summed, while FileCount=100 is applied once.

* ec: test fixture uses real vid; admin warns on skewed ec counts

- writeFixture now builds the .ecx/.ecj/.ec00/.vif filenames from the
  actual vid passed in, instead of hardcoding "_1". The existing tests
  all use vid=1 so behaviour is unchanged, but the helper no longer
  silently diverges from its documented parameter.
- collectCollectionStats logs a glog warning when an EC volume's summed
  delete count exceeds its deduped file count, surfacing the anomaly
  (stale heartbeat, counter drift, etc.) instead of silently dropping
  the volume from the object count.

* ec: derive file/delete counts from .ecx/.ecj file sizes

seedCountsFromEcx walked the full .ecx index at volume load, which is
wasted work: .ecx has fixed-size entries (NeedleMapEntrySize) and .ecj
has fixed-size deletion records (NeedleIdSize), so both counts are pure
file-size arithmetic.

  fileCount   = ecxFileSize / NeedleMapEntrySize
  deleteCount = ecjFileSize / NeedleIdSize

Rip out the cached counters, countsLock, seedCountsFromEcx, and the
recordDelete helper. Track ecjFileSize directly on the EcVolume struct,
seed it from Stat() at load, and bump it on every successful .ecj append
inside DeleteNeedleFromEcx under ecjFileAccessLock. Skip the .ecj write
entirely when the needle is already tombstoned so the derived delete
count stays idempotent on repeat deletes. Heartbeats now compute counts
in O(1).

Tests updated: the initial fixture pre-populates .ecj with two ids to
verify the file-size derivation end-to-end, and the delete test keeps
its idempotent-re-delete / missing-needle invariants (unchanged
externally, now enforced by the early return rather than a cache guard).

* ec: sync Rust volume server with Go file/delete count semantics

Mirror the Go-side EC file/delete count work in the Rust volume server
so mixed Go/Rust clusters report consistent bucket object counts in
the admin dashboard.

- Add file_count (8) and delete_count (9) to the Rust copy of
  VolumeEcShardInformationMessage (seaweed-volume/proto/master.proto).
- EcVolume gains ecj_file_size, seeded from the journal's metadata on
  open and bumped inside journal_delete on every successful append.
- file_and_delete_count() returns counts derived in O(1) from
  ecx_file_size / NEEDLE_MAP_ENTRY_SIZE and
  ecj_file_size / NEEDLE_ID_SIZE, matching Go's FileAndDeleteCount.
- to_volume_ec_shard_information_messages populates the new proto
  fields instead of defaulting them to zero.
- mark_needle_deleted_in_ecx now returns a DeleteOutcome enum
  (NotFound / AlreadyDeleted / Tombstoned) so journal_delete can skip
  both the .ecj append and the size bump when the needle is missing
  or already tombstoned, keeping the derived delete_count idempotent
  on repeat or no-op deletes.
- Rust's EcVolume::new no longer replays .ecj into .ecx on load. Go's
  RebuildEcxFile is only called from specific decode/rebuild gRPC
  handlers, not on volume open, and replaying on load was hiding the
  deletion journal from the new file-size-derived delete counter.
  rebuild_ecx_from_journal is kept as dead_code for future decode
  paths that may want the same replay semantics.

Also clean up the Go FileAndDeleteCount to drop unnecessary runtime
guards against zero constants — NeedleMapEntrySize and NeedleIdSize
are compile-time non-zero.

test_ec_volume_journal updated to pre-populate the .ecx with the
needles it deletes, and extended to verify that repeat and
missing-id deletes do not drift the derived counts.

* ec: document enterprise-reserved proto field range on ec shard info

Both OSS master.proto copies now note that fields 10-19 are reserved
for future upstream additions while 20+ are owned by the enterprise
fork. Enterprise already pins data_shards/parity_shards at 20/21, so
keeping OSS additions inside 8-19 avoids wire-level collisions for
mixed deployments.

* ec(rust): resolve .ecx/.ecj helpers from ecx_actual_dir

ecx_file_name() and ecj_file_name() resolved from self.dir_idx, but
new() opens the actual files from ecx_actual_dir (which may fall back
to the data dir when the idx dir does not contain the index). After a
fallback, read_deleted_needles() and rebuild_ecx_from_journal() would
read/rebuild the wrong (nonexistent) path while heartbeats reported
counts from the file actually in use — silently dropping deletes.

Point idx_base_name() at ecx_actual_dir, which is initialized to
dir_idx and only diverges after a successful fallback, so every call
site agrees with the file new() has open. The pre-fallback call in
new() (line 142) still returns the dir_idx path because
ecx_actual_dir == dir_idx at that point.

Update the destroy() sweep to build the dir_idx cleanup paths
explicitly instead of leaning on the helpers, so post-fallback stale
files in the idx dir are still removed.

* ec: reset ecj size after rebuild; rollback ecx tombstone on ecj failure

Two EC delete-count correctness fixes applied symmetrically to Go and
Rust volume servers.

1. rebuild_ecx_from_journal (Rust) now sets ecj_file_size = 0 after
   recreating the empty journal, matching the on-disk truth.
   Previously the cached size still reflected the pre-rebuild journal
   and file_and_delete_count() would keep reporting stale delete
   counts. The Go side has no equivalent bug because RebuildEcxFile
   runs in an offline helper that does not touch an EcVolume struct.

2. DeleteNeedleFromEcx / journal_delete used to tombstone the .ecx
   entry before writing the .ecj record. If the .ecj append then
   failed, the needle was permanently marked deleted but the
   heartbeat-reported delete_count never advanced (it is derived from
   .ecj file size), and a retry would see AlreadyDeleted and early-
   return, leaving the drift permanent.

   Both languages now capture the entry's file offset and original
   size bytes during the mark step, attempt the .ecj append, and on
   failure roll the .ecx tombstone back by writing the original size
   bytes at the known offset. A rollback that itself errors is
   logged (glog / tracing) but cannot re-sync the files — this is
   the same failure mode a double disk error would produce, and is
   unavoidable without a full on-disk transaction log.

Go: wrap MarkNeedleDeleted in a closure that captures the file
offset into an outer variable, then pass the offset + oldSize to the
new rollbackEcxTombstone helper on .ecj seek/write errors.

Rust: DeleteOutcome::Tombstoned now carries the size_offset and a
[u8; SIZE_SIZE] copy of the pre-tombstone size field. journal_delete
destructures on Tombstoned and calls restore_ecx_size on .ecj append
failure.

* test(ec): widen admin /health wait to 180s for cold CI

TestEcEndToEnd starts master, 14 volume servers, filer, 2 workers and
admin in sequence, then waited only 60s for admin's HTTP server to come
up. On cold GitHub runners the tail of the earlier subprocess startups
eats most of that budget and the wait occasionally times out (last hit
on run 24374773031). The local fast path is still ~20s total, so the
bump only extends the timeout ceiling, not the happy path.

* test(ec): fork volume servers in parallel in TestEcEndToEnd

startWeed is non-blocking (just cmd.Start()), so the per-process fork +
mkdir + log-file-open overhead for 14 volume servers was serialized for
no reason. On cold CI disks that overhead stacks up and eats into the
subsequent admin /health wait, which is how run 24374773031 flaked.

Wrap the volume-server loop in a sync.WaitGroup and guard runningCmds
with a mutex so concurrent appends are safe. startWeed still calls
t.Fatalf on failure, which is fine from a goroutine for a fatal test
abort; the fail-fast isn't something we rely on for precise ordering.

* ec: fsync ecx before ecj, truncate on failure, harden rebuild

Four correctness fixes covering both volume servers.

1. Durability ordering (Go + Rust). After marking the .ecx tombstone
   we now fsync .ecx before touching .ecj, so a crash between the two
   files cannot leave the journal with an entry for a needle whose
   tombstone is still sitting in page cache. Once the fsync returns,
   the tombstone is the source of truth: reads see "deleted",
   delete_count may under-count by one (benign, idempotent retries)
   but never over-reports. If the fsync itself fails we restore the
   original size bytes and surface the error. The .ecj append is then
   followed by its own Sync so the reported delete_count matches the
   on-disk journal once the write returns.

2. .ecj truncation on append failure. write_all may have extended the
   journal on disk before sync_all / Sync errors out, leaving the
   cached ecj_file_size out of sync with the physical length and
   drifting delete_count permanently after restart. Both languages
   now capture the pre-append size, truncate the file back via
   set_len / Truncate on any write or sync failure, and only then
   restore the .ecx tombstone. Truncation errors are logged — same-fd
   length resets cannot realistically fail — but cannot themselves
   re-sync the files.

3. Atomic rebuild_ecx_from_journal (Rust, dead code today but wired
   up on any future decode path). Previously a failed
   mark_needle_deleted_in_ecx call was swallowed with `let _ = ...`
   and the journal was still removed, silently losing tombstones.
   We now bubble up any non-NotFound error, fsync .ecx after the
   whole replay succeeds, and only then drop and recreate .ecj.
   NotFound is still ignored (expected race between delete and encode).

4. Missing-.ecx hardening (Rust). mark_needle_deleted_in_ecx used to
   return Ok(NotFound) when self.ecx_file was None, hiding a closed or
   corrupt volume behind what looks like an idempotent no-op. It now
   returns an io::Error carrying the volume id so callers (e.g.
   journal_delete) fail loudly instead.

Existing Go and Rust EC test suites stay green.

* ec: make .ecx immutable at runtime; track deletes in memory + .ecj

Refactors both volume servers so the sealed sorted .ecx index is never
mutated during normal operation. Runtime deletes are committed to the
.ecj deletion journal and tracked in an in-memory deleted-needle set;
read-path lookups consult that set to mask out deleted ids on top of
the immutable .ecx record. Mirrors the intended design on both Go and
Rust sides.

EcVolume gains a `deletedNeedles` / `deleted_needles` set seeded from
.ecj in NewEcVolume / EcVolume::new. DeleteNeedleFromEcx /
journal_delete:

  1. Looks the needle up read-only in .ecx.
  2. Missing needle -> no-op.
  3. Pre-existing .ecx tombstone (from a prior decode/rebuild) ->
     mirror into the in-memory set, no .ecj append.
  4. Otherwise append the id to .ecj, fsync, and only then publish
     the id into the set. A partial write is truncated back to the
     pre-append length so the on-disk journal and the in-memory set
     cannot drift.

FindNeedleFromEcx / find_needle_from_ecx now return
TombstoneFileSize when the id is in the in-memory set, even though
the bytes on disk still show the original size.

FileAndDeleteCount:
  fileCount   = .ecx size / NeedleMapEntrySize (unchanged)
  deleteCount = len(deletedNeedles) (was: .ecj size / NeedleIdSize)

The RebuildEcxFile / rebuild_ecx_from_journal decode-time helpers
still fold .ecj into .ecx — that is the one place tombstones land in
the physical index, and it runs offline on closed files. Rust's
rebuild helper now also clears the in-memory set when it succeeds.

Dead code removed on the Rust side: `DeleteOutcome`,
`mark_needle_deleted_in_ecx`, `restore_ecx_size`. Go drops the
runtime `rollbackEcxTombstone` path. Neither helper was needed once
.ecx stopped being a runtime mutation target.

TestEcVolumeSyncEnsuresDeletionsVisible (issue #7751) is rewritten
as TestEcVolumeDeleteDurableToJournal, which exercises the full
durability chain: delete -> .ecj fsync -> FindNeedleFromEcx masks
via the in-memory set -> raw .ecx bytes are *unchanged* -> Close +
RebuildEcxFile folds the journal into .ecx -> raw bytes now show
the tombstone, as CopyFile in the decode path expects.
2026-04-13 21:10:36 -07:00
Chris Lu 10b0bdce02 feat: pass expected_data_size from clients for size-aware assignment (#9032)
* feat: pass expected_data_size from clients for size-aware assignment

Add expected_data_size field to AssignRequest (master proto) and
AssignVolumeRequest (filer proto) so clients can hint how large the
data will be. The master uses this instead of the 1MB default when
tracking pending volume sizes for weighted assignment.

- Add expected_data_size to master.proto AssignRequest
- Add expected_data_size to filer.proto AssignVolumeRequest
- Wire through filer AssignVolume handler
- Wire through HTTP submit handler (uses actual upload size)
- Add ExpectedDataSize to VolumeAssignRequest in operation package
- Topology.PickForWrite accepts optional expectedDataSize parameter

* fix: guard integer conversions in expected_data_size path

- common.go: clamp OriginalDataSize to non-negative before uint64 cast
- topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast

* fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize

- HTTP /dir/assign now parses optional "dataSize" query parameter
  and passes it to PickForWrite instead of hardcoded 0
- Add test assertion for PickForWrite with non-zero expectedDataSize
2026-04-11 11:30:47 -07:00
Chris Lu 3d17bab544 fix(seaweed-volume): eliminate global S3 tier registry races in tests
Multiple Rust tests were racing on the shared global S3TierRegistry by
calling clear(), which wiped entries registered by concurrently running
tests.  Use test-specific backend IDs and targeted remove() instead of
clear() so tests no longer interfere with each other.
2026-04-07 23:11:55 -07:00
Chris Lu 0220b67115 fix(seaweed-volume): fix flaky Rust unit tests
- Increase volume_size_limit in preallocate test from 1KB to 100MB so
  disk-free fluctuations between get_disk_stats calls cannot make the
  integer-division results equal.
- Add readiness synchronization to both spawn_fake_s3_server helpers so
  the test thread waits until axum is about to serve before proceeding.
- Fix test_remote_vif_load_blocks_writes_but_allows_delete: register a
  dummy S3 backend with a test-specific ID so the volume can load its
  remote .vif without racing with other tests on the global registry.
2026-04-07 22:11:31 -07:00
Chris Lu 0da1794856 fix(rust): remove transitive openssl dependency from seaweed-volume
reqwest's default features include native-tls which depends on
openssl-sys, causing builds to fail on musl targets where OpenSSL
headers are not available. Since we already use rustls-tls, disable
default features to eliminate the openssl-sys dependency entirely.
2026-04-04 14:07:01 -07:00
Chris Lu 9add18e169 fix(volume-rust): fix volume balance between Go and Rust servers (#8915)
Two bugs prevented reliable volume balancing when a Rust volume server
is the copy target:

1. find_last_append_at_ns returned None for delete tombstones (Size==0
   in dat header), falling back to file mtime truncated to seconds.
   This caused the tail step to re-send needles from the last sub-second
   window. Fix: change `needle_size <= 0` to `< 0` since Size==0 delete
   needles still have a valid timestamp in their tail.

2. VolumeTailReceiver called read_body_v2 on delete needles, which have
   no DataSize/Data/flags — only checksum+timestamp+padding after the
   header. Fix: skip read_body_v2 when size == 0, reject negative sizes.

Also:
- Unify gRPC server bind: use TcpListener::bind before spawn for both
  TLS and non-TLS paths, propagating bind errors at startup.
- Add mixed Go+Rust cluster test harness and integration tests covering
  VolumeCopy in both directions, copy with deletes, and full balance
  move with tail tombstone propagation and source deletion.
- Make FindOrBuildRustBinary configurable for default vs no-default
  features (4-byte vs 5-byte offsets).
2026-04-04 09:13:23 -07:00
Chris Lu 995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
Chris Lu bb23939b36 fix(volume-rust): resolve gRPC bind address from hostname
SocketAddr::parse() only accepts numeric IPs, so binding the gRPC
server to "localhost:18833" panicked. Use tokio::net::lookup_host()
to resolve hostnames before passing to tonic's serve_with_shutdown.
2026-04-02 18:36:45 -07:00
Chris Lu 2a6f27eb08 Suppress unused_mut warning for admin_router on non-unix builds 2026-04-01 23:20:40 -07:00
Chris Lu 08f48e62c9 Fix missing std::io::Read import for Windows build in ec_encoder
The #[cfg(not(unix))] fallback path uses f.read() which requires
the Read trait to be in scope.
2026-04-01 23:16:11 -07:00
Chris Lu e29b685c20 Gate pprof dependency behind cfg(unix) to fix Windows build
The pprof crate uses Unix-only APIs (nix, libc::pthread_t,
libc::siginfo_t, etc.) that don't exist on Windows. Move it to
[target.'cfg(unix)'.dependencies] and gate all profiling/debug
module usage with #[cfg(unix)].
2026-04-01 21:32:24 -07:00