mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
master
52 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
f724828bcb |
fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941)
* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)
EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.
validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
- shards smaller than expected: an interrupted local encode left partial
shards and the .dat is the complete source -> reclaim the .dat.
- shards equal to expected: a valid (or still-distributing) EC volume ->
keep; the shards may be the only copy.
- shards larger than expected: the .dat is the stale/partial side (e.g. an
interrupted decode left a half-written .dat next to the real shards) ->
keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.
handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.
pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).
Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.
Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.
* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments
The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.
Also condense the verbose comments added in this change to the repo's
concise style.
|
||
|
|
18cdb3819b |
fix(ec): crash-safe ecx-journal fold and shard rebuild (fsync before publish, no short-read-as-success) (#9938)
* fix(ec): make ecx-journal fold and shard rebuild crash-safe Two EC rebuild paths could silently lose or corrupt data: RebuildEcxFile folded the .ecj deletion journal into .ecx (in-place WriteAt tombstones) and then unlinked the journal without flushing the .ecx writes first. A crash could persist the unlink ahead of the tombstones, resurrecting deleted needles on the next load. It also read journal records with a bare n!=size break, so a torn tail silently dropped the remaining tombstones before the unlink. Now: read records with io.ReadFull (io.EOF ends cleanly, a torn tail aborts and leaves .ecj in place for retry), fsync .ecx before removing the journal. rebuildEcFiles treated a zero/short ReadAt as a clean end-of-input and discarded the read error, so a truncated or unreadable input shard produced truncated regenerated shards that were then published as restored redundancy; the regenerated shards were also never fsynced on the no-sidecar path. Now: derive the expected shard size from the present inputs up front (rejecting a divergent/zero-size input), drive the loop by that size, fail on any short read or short write, and fsync every regenerated shard before it is mounted/renamed. Rust volume server mirrors the rebuild fix: rebuild_ec_files now checks the read_at byte count (it previously discarded it, the same truncation bug). The Rust ecx fold already synced .ecx before removing the journal. Custom EC ratios are unaffected: the shard size derives from the input shards and the loop uses the .vif-resolved data/parity counts, never a hardcoded 10+4. * storage: close ecx journal files via defer in RebuildEcxFile Per review: a single deferred Close per file replaces the per-error-path manual closes, so new early returns cannot leak descriptors. The journal is still closed explicitly before its unlink since Windows cannot delete an open file; the deferred second Close is a harmless no-op. |
||
|
|
34f9b91d69 |
fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards A leftover empty .dat stub (a phantom from the pre-fix loader; zero needles) next to a distributed EC volume's local shards made startup classify the volume as an interrupted local encode: validateEcVolume requires >= dataShards local shards when a .dat is present, fails with the 1-2 shards a distributed volume keeps per disk, and the cleanup deletes those shards -- the only copies of that part of the volume. Repeated across restart waves this destroys enough shards cluster-wide to make the volume unrecoverable. Go: - loadExistingVolume: hoist the empty-stub sweep above the EC presence checks. Previously the .vif-next-to-.ecx guard returned before the sweep ever ran, so exactly the dangerous layout (stub + .ecx + local shards) kept its stub and then lost its shards in loadAllEcShards. - validateEcVolume / checkDatFileExists: treat a .dat <= a superblock (zero needles) as absent. An empty .dat cannot be the encode source, so it must never gate shard deletion; this also covers stubs without a .vif, which the sweep cannot prove are EC leftovers. Rust mirror (seaweed-volume): the same gate in validate_ec_volume and check_dat_file_exists (the Rust sweep already ran before validation); the volume-load skip keeps a plain existence check so fresh, needle-less volumes still load. Regression tests in Go and Rust reproduce the production layout (a zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with and without a .vif) and fail without the fix with the shards deleted. * fix(ec): gate source volume deletion on a recoverable shard set After EC encode, the shell command and the (plugin) worker task refused to delete the source volume unless every shard was present, and aborted otherwise -- leaving the source .dat next to live shards, exactly the mixed state the startup cleanup mishandles. Replace the full-set requirement with a recoverability gate shared by both callers (RequireRecoverableShardSet): deleting a non-empty source .dat requires at least dataShards distinct shards cluster-wide. Below that the source is kept and the encode fails as before. A degraded but recoverable set (>= dataShards, < total) now proceeds with a warning instead of aborting: the missing shards can be rebuilt from the survivors, while keeping the source would preserve the dangerous mixed state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an empty .dat has nothing to lose. dataShards/totalShards stay parameters so enterprise custom EC ratios share the helper verbatim. * test(ec): use recoverable shard verification gate |
||
|
|
4f8af455bf |
feat(storage): sweep leftover empty EC .dat stubs on volume server startup (#9927)
* feat(storage): sweep leftover empty EC .dat stubs on volume server startup An EC volume keeps no local .dat. The pre-fix loader left empty 8-byte superblock .dat stubs next to EC metadata (one per lone .vif). Left in place each loads as a phantom empty volume, and the same vid's stub on two disks of one server blocks Rust startup via the duplicate-vid check in Store::add_location -- the prior fix stops creating new stubs but does not clean up existing ones. On startup, when a .dat is empty (<= a superblock, i.e. zero needles) and its .vif marks the volume erasure-coded, remove the stub (+ empty .idx) instead of loading it. The real data is in the EC shards, so the empty stub holds nothing to lose. Non-EC empty .dat files (e.g. freshly allocated volumes) are left alone. Done in both Rust (load_existing_volumes) and Go (loadExistingVolume), with regression tests that fail without the sweep. * refactor(storage): extract empty EC .dat stub sweep into its own function Move the startup stub-sweep into remove_empty_ec_dat_stub (Rust) and removeEmptyEcDatStub + vifIsEcVolume (Go) for clearer logic, and look up the .vif in both the data and idx directories (each read at most once) so a stub is still found when -dir.idx is configured. Adds direct tests for the idx-directory lookup on both engines. |
||
|
|
79ac279fe1 |
fix(ec): don't mix EC shards from different encode runs (#9880)
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns (encode time in unix nanos). It rides in .vif and the read request so a read can be scoped to the encode run that produced the index. * fix(ec): stamp each encode and reject cross-run shard reads Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the shard's owning volume (resolved together via FindEcVolumeWithShard, so a multi-disk server validates the disk that actually serves the bytes) and reject a shard from a different encode run, recovering from parity. A zero on either side (pre-upgrade volume) skips the guard. * fix(ec): stamp the encode identity on the worker-generated .vif The worker-local encode path now writes EncodeTsNs (and the resolved EC ratio) into the .vif, so the read guard is not silently off for volumes encoded by the maintenance worker. * fix(ec): wipe stale EC artifacts before re-encoding VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and removes its on-disk shard/index/sidecar files before writing fresh ones, so a retried encode never builds on a partial prior run and the unlink frees the inodes instead of leaving open fds serving old bytes. * fix(ec): unmount EC shards across all disks UnmountEcShards walked only the first disk holding the shard, leaving a duplicate copy mounted on a sibling disk (split-disk reconciled volumes) still serving and heartbeating. Traverse every disk and emit one deletion delta per disk. * fix(ec): delete orphan shards without a local .ecx deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx, so it could not clean an orphan .ecNN left by a failed copy on a disk with no index. Delete the requested shard files unconditionally; the index-file (.ecx/.ecj/.vif) routing stays gated as before. * fix(ec): clear stale EC shards cluster-wide before re-encoding ec.encode unmounts and deletes EC shards for the target volumes on every node before regenerating: fatal for the shards the topology reports (mounted leftovers), best-effort for the rest (a sweep that catches unmounted failed-copy orphans). A down node is a no-op. * fix(ec): don't nil EC fds on close so reads can't race eviction A reader resolves an EcVolume/shard under the lock then reads after it is released, so an eviction that nils ecxFile/ecdFile would race that read and panic. Close the fds without nilling the fields: the field is now write-once (no data race) and a concurrent read hits a closed fd, getting a clean error that the caller recovers from parity. * fix(ec): wipe stale EC artifacts on every disk and surface failures The pre-encode wipe only deleted beside the source volume, so a stale shard on a sibling disk survived and could be mounted against the new index at reconcile. Sweep every disk. Removal also ignored os.Remove errors, reporting a failed cleanup as success and letting a stale shard join the next generation; surface the first real failure (treating already-gone as success) from removeStaleEcArtifacts and the shard delete. * fix(ec): log when a local shard is skipped for a different encode run The cross-run guard returned errShardNotLocal, indistinguishable in logs from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so operators can tell "wrong encode generation" from "shard not here". * fix(ec): surface metadata removal failures in the shard delete path deleteEcShardIdsForEachLocation still dropped os.Remove errors on the .ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index condition this path prevents, so route those through removeFileIfExists and return the first real failure instead of reporting cleanup as success. * fix(ec): fail orphan cleanup when a reachable node's delete fails The pre-encode orphan sweep swallowed every error for unreported (node, volume) pairs. That is only safe for an unreachable node, which cannot receive this encode's new generation. A reachable node whose delete genuinely failed (permission/IO) keeps an orphan shard that a later copy re-stamps with the new run's volume-level .vif identity, so the read guard would accept stale data. Surface those; stay best-effort only for unreachable nodes (gRPC Unavailable / no status). * fix(ec): guard ecjFile under its lock in the EC delete path EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved its .ecx lookup before a concurrent eviction (the generate-time UnloadEcVolume) could then reach the journal append with a nil fd. Bail with a clear "volume closed" error under the lock instead. * fix(ec): reject an unstamped shard when the caller has an encode identity The read guard required both identities nonzero, so a current (stamped) caller accepted a holder with identity 0 and could be served a stale pre-upgrade shard. Reject when the caller is stamped and the holder differs (including unstamped); stay lenient only when the caller itself has no identity (pre-upgrade reader). A skipped shard recovers from parity. * fix(ec): full-teardown delete so cluster cleanup wipes a whole generation The pre-encode cluster sweep deleted only the listed canonical shards on remote nodes, leaving index/sidecar (and, on builds with versioned generations, those too) behind. Add a full_teardown flag to VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode cleanup paths set it. Other delete callers (balance/decode/repair) are unchanged. * fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while Close nils it under the lock, a data race on the field. Take the lock first, then nil-check inside, in both. * fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success An old volume server silently ignores full_teardown and returns success for an ordinary delete, so the caller wrongly believes the generation was wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done in the response; the worker destination cleanup fails when it is absent, and the shell cluster sweep fails for a reported (mounted) leftover while staying best-effort for an unreported node. encode_ts_ns stays an accepted transient (an old server just skips the new read guard, no regression). * fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown A reachable pre-upgrade server ignores full_teardown and returns success without wiping an orphan, which a later copy then folds into the new generation. Treat a missing full_teardown_done ack as fatal for every reachable node (best-effort only for a gRPC-unreachable one), not just for topology-reported pairs. * fix(ec): return the served shard identity and validate it client-side The encode identity was only enforced server-side, so a pre-upgrade server ignored the request field and served bytes unchecked. Echo the served shard's EncodeTsNs on every read response chunk and have the client reject a mismatch (including 0 from an old server), so the guard holds regardless of server version; a rejected read recovers from parity. * fix(ec): reject a short/empty remote shard read instead of serving zeros doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and returned success with a partly zero-filled, unvalidated buffer (the server stamps the identity only on chunks that carry bytes). A non-deleted interval must arrive whole: require n == len(buf), exempting the is_deleted short-circuit (n=0), matching readLocalEcShardInterval's local check. A short read now fails so the caller recovers from parity. * test(ec): fake volume server echoes the full_teardown acknowledgement The worker now fails a teardown delete that isn't acknowledged (so a pre-upgrade server can't silently skip the wipe). The fake server's no-op VolumeEcShardsDelete returned an empty response, which the worker read as a skipped teardown and aborted the encode. Echo full_teardown_done. * feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif and rejects a read served from a shard of a different run; full_teardown wipes a whole generation and acknowledges it. The Rust volume server had none of it. Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume, stamp it on every read response, and reject a request/response mismatch on both the server and the distributed-read client (recovering from parity); handle full_teardown by evicting the volume and wiping every EC artifact on each disk, echoing full_teardown_done so the caller can detect a server that ignored it. * fix(ec): remove a stale .vif on full teardown of a shard-only node A shard copy installs shards + .ecx before .vif, so an interrupted copy after a teardown could mount the new files under the previous run's identity / version / shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during full teardown, gated on .idx absence so a source-volume holder keeps its live .vif. In Rust this lives in a teardown-only helper so the reconcile / load- fallback paths (which share the base removal) still preserve .vif. * fix(ec): treat a missing teardown ack as fatal, not as an unreachable node isNodeUnreachable returned true for any non-gRPC-status error, so a reachable pre-upgrade server's missing full_teardown_done ack (a plain error) was classified unreachable and the unreported pair was silently skipped. Classify only a real codes.Unavailable as unreachable, and wrap the missing ack in a sentinel the sweep treats as fatal regardless. A genuinely down node still surfaces as Unavailable from the RPC and stays best-effort. * fix(ec): reject a short shard read in the local EC needle reader read_ec_shard_needle ignored the byte count from shard.read_at and appended the whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later length check and parsed as garbage. Require n == buf.len() per interval, erroring on a short read like the local interval reader already does. * fix(ec): probe reachability before skipping a node that returns Unavailable The pre-encode sweep skipped any node whose teardown delete returned codes.Unavailable, but a reachable volume server in maintenance mode also returns that code for the maintenance-gated delete, so its stale EC files were left behind on a node that can still receive the new generation. Confirm with a non-maintenance-gated empty-target Ping: skip only when the node fails the probe too (genuinely unreachable). * fix(ec): use try_exists for the teardown .vif .idx guard The teardown-only .vif removal gated on Path::exists(), which returns false on a permission/IO stat error, so a stat failure on a present .idx would read as a shard-only node and delete the live source volume's .vif. Gate on try_exists() == Ok(false) instead, preserving the sidecar on any stat error. * fix(ec): only skip a sweep node when a Ping confirms it is transport-down The pre-encode sweep skipped a node whenever its teardown delete and a liveness Ping both failed, but it treated ANY Ping error as down — an application-level Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a reachable node's stale generation in place. Classify the Ping tri-state and skip only when it transport-fails with codes.Unavailable; a reachable or inconclusive node stays fatal. * fix(ec): exclude sweep-skipped nodes from the encode's rebalance The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance then recollected the current topology — a node that recovered between the two could become a copy target and receive the new generation while still holding its stale, never-cleared shards. Have the sweep return the skipped set and exclude those nodes from the rebalance for this encode, so a node we could not clean cannot receive the new generation. Standalone ec.balance is unaffected. * fix(ec): re-sweep recovered nodes before generation so they aren't stranded A node skipped as down by the pre-encode sweep is excluded from the rebalance, but it can recover and become the generation host — mounting all shards locally, then being excluded from distribution. Union-only verification accepts all shards on one node and deletes the originals: a single point of failure. Re-sweep the skipped nodes just before generation; one whose teardown now succeeds leaves the skipped set and rebalances normally, while a node still down stays skipped. * fix(ec): abort the encode if a selected source is still skipped after re-sweep The re-sweep un-skips a recovered node, but the source was selected before it and a node can stay down through the re-sweep then recover just in time to be the generation host — mounting all shards locally while still excluded from the rebalance, which union-only verification accepts before deleting the originals. Abort the encode when a selected source remains skipped after the re-sweep. * fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch If a volume is not EC at the batch-delete classification but is encoded to EC and its .dat deleted before the regular-volume mutation, the mutation returns an exact "not found" that the filer chunk-GC treats as completed, dropping the delete. Recheck EC presence under the mutation lock and return a retriable 503 with the "try again" token so the filer requeues it onto the EC path. * fix(ec): recheck EC state before the regular batch-delete mutation ec.encode mounts EC shards (copied from the .dat) before deleting the originals, so a volume can be EC while its .dat still exists. The batch delete only rechecked EC after a NotFound, so a successful regular-volume delete in that window wrote a tombstone to the soon-removed .dat — the delete was lost and the needle resurrected from the pre-tombstone shards. Recheck has_ec_volume under the write lock before delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path. * fix(volume): make the metrics push test independent of test order test_push_metrics_once asserted the pushed body contains the request-counter family without ever touching the counter — a CounterVec with no children emits nothing, so the assertion only held when another test had already created a labelset in the shared registry. Create one in the test itself. |
||
|
|
1c9039d3ac |
fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart (#9874)
* fix(seaweed-volume): stop EC shard deletion from phantom .dat on restart On startup load_existing_volumes() scans .vif/.idx entries (not just .dat). For distributed EC, a volume's .vif can be mirrored onto a disk whose .ecx lives on a sibling disk, so the per-disk ecx check is false and the loader falls through to Volume::new, which always creates the .dat if missing -> a phantom 8-byte superblock stub. The store-level prune_incomplete_ec_with_sibling_dat then treats that stub as the authoritative source and deletes the real EC shards on sibling disks. Go guards the same case (disk_location.go: 'Without this guard NewVolume below would create a phantom empty .dat') but only same-disk. Fix A (root cause): in load_existing_volumes, don't create a .dat during load. Skip the entry when there is no local .dat AND the .vif does not reference remote files -- remote-tiered volumes have no local .dat but must still load via the remote path. Uses the robust check_dat_file_exists helper so a transient stat error doesn't skip a real volume. New volumes go through create_volume(). Covers the cross-disk .vif/.ecx split Go's same-disk hasEcxFile() misses. Fix B (defense in depth, Go + Rust): when the EC .vif records no source size (dat_file_size==0), require the sibling .dat to be strictly larger than a bare superblock, so an empty 8-byte stub can't pass the credibility gate. Previously it fell back to SUPER_BLOCK_SIZE, which an 8-byte stub exactly meets. Adds regression tests reproducing the cross-disk lone-.vif phantom and the 8-byte stub gate; updates an existing prune test to use a real collection so its .ecx lookup matches the loaders. * fix(storage): don't create phantom .dat from lone .vif on Go volume load Mirror Fix A on the Go side. loadExistingVolume scans .vif/.idx entries, and for distributed EC a .vif can be mirrored onto a disk whose .ecx is on a sibling disk. The same-disk hasEcxFile() guard does not fire there, so the loader falls through to NewVolume(createDatIfMissing=true) and writes an 8-byte phantom .dat, which the sibling-.dat prune then uses to delete the real EC shards on sibling disks. Skip the entry when there is no local .dat AND the .vif has no remote file (via MaybeLoadVolumeInfo); remote-tiered volumes have no local .dat but must still load. Adds TestLoneVifDoesNotCreatePhantomDat (fails without the guard) and TestRemoteTier_DiskScanLoadsRemoteOnlyVolume (fails if the guard skips a remote-only volume). |
||
|
|
ab7be7867d |
security: hot-reload JWT signing keys on SIGHUP (#9826)
* security: reload JWT signing keys on SIGHUP Signing keys were read once in the server constructors and never refreshed. After a key rotation (Secret update, divergent reads) the in-memory key stayed stale and every request kept failing "wrong jwt" until the affected process was restarted. Add Guard.UpdateSigningKeys and call it from the master, volume and filer reload paths and the s3 reload hook, next to the existing whitelist refresh. Make the global chunk-read JWT cache reloadable via an atomic swap, and register the master's Reload with grace.OnReload -- it was never wired, so the master ignored SIGHUP entirely. Mirror the same refresh in the Rust volume server's SIGHUP handler. * security: swap signing keys behind an atomic pointer Addresses review feedback on the in-place key swap: SigningKey is a []byte, so reassigning the Guard fields while a request handler reads them is a data race that can tear the multi-word slice header and read out of bounds. Hold the four signing-key fields in an immutable signingConfig snapshot behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a reader sees either the old keys or the new ones. Reads go through new SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors. The Rust guard is already safe: every read and the SIGHUP write go through the shared RwLock<Guard>. * security: fold whitelist + auth state into the atomic snapshot Review follow-up. UpdateSigningKeys still wrote isWriteActive while the request path read it (and the whitelist maps) unsynchronized, so a SIGHUP under load could expose an inconsistent mix of activation bits and whitelist contents. Move all hot-reloadable Guard state -- keys, expirations, whitelist, and the activation flags -- into a single immutable guardState swapped behind one atomic.Pointer. The Update* methods take a small mutex to serialize the read-modify-write; readers stay lock-free. The concurrency test now also rotates the whitelist and probes IsWhiteListed under -race. Also read each signing key once per branch in the volume/filer JWT auth checks, so a reload landing mid-check can't take the allow-fast-path after auth was enabled or verify against a different key than the branch saw. |
||
|
|
e264e9883e |
fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM under load (#9780)
* fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM The Rust volume server buffered the entire upload body with to_bytes(usize::MAX) and only checked the file-size limit afterward, so a single large upload — or many concurrent uploads, since the in-flight byte throttle defaults to 0 (unlimited) — could exhaust memory and get the process OOM-killed under load. The read path had two more single-request OOM vectors: `vec![0u8; manifest.size]` allocated from an attacker-controlled chunk-manifest size, and gzip decompression was unbounded (gzip bomb). - Bound the upload body read by file_size_limit_bytes (plus a margin for multipart framing), mirroring Go's io.LimitReader(sizeLimit+1), and reject oversize before the whole body is buffered. - Validate manifest.size (reject negative / oversized) before allocating. - Cap gzip output in maybe_decompress_gzip and route the inline GzDecoder sites through it. * fix(seaweed-volume): address review - chunk offset, 32-bit cast, decompress errors - Validate chunk.offset before indexing in chunk-manifest expansion: a negative offset wrapped to a huge usize and underflowed `end - offset` (panic from a crafted manifest). Reject negative, skip out-of-range, use saturating math. - Use usize::try_from for the upload body limit instead of `as usize`, so a >usize::MAX file_size_limit on 32-bit caps at usize::MAX rather than silently truncating to a tiny value. - maybe_decompress_gzip now returns Result<_, GunzipError> distinguishing a decode failure (callers fall back to raw bytes, as before) from hitting the size cap (TooLarge), which now returns 413 instead of silently serving the still-compressed bytes. * fix(seaweed-volume): inflate manifest chunks into the result window to cap peak memory The chunk-manifest expansion still doubled memory: `result` was already allocated at manifest.size (<=2 GiB) and each compressed chunk was inflated into a separate Vec (also up to 2 GiB), so a single request could peak near 4 GiB. Decompress compressed chunks directly into their result[offset..] window (bounded by the remaining space) so a chunk never allocates a second large buffer; peak stays at ~manifest.size. Bytes past the window are dropped (matching the prior truncation), and a fully-undecodable chunk still falls back to its raw bytes. * fix(seaweed-volume): fall back to raw chunk bytes on any decode failure Per review: the gzip fallback must run on any decode error, not only when no bytes were decoded. Clear the partially-written output and copy the chunk's raw bytes (truncated to the window), restoring the prior decode-failure behavior. |
||
|
|
dfa86b4313 |
volume: keep volume writable after a deletion-tail compaction (#9776)
makeupDiff replays post-snapshot changes onto the compacted volume. For a replayed deletion it appended a tombstone to the new .dat but recorded the .idx entry with offset 0. When that deletion is the last replayed change the tombstone lands at the .dat tail, and the post-commit integrity check skips offset-0 entries, so it sees 32 trailing bytes it can't account for and flips the volume read-only, reloading it as a SortedFileNeedleMap instead of the writable map. Record the tombstone's real .dat offset, matching the normal delete path; the needle map still treats it as deleted off the negative size, so lookups are unchanged. Mirror the same fix into the Rust volume server. |
||
|
|
05c6500453 |
volume: fix maxVolumeCount dead zone that stalled writes on auto-sized disks (#9755)
* volume: don't drop the last writable slot on auto-sized disks MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk with room for exactly one volume (free between 1x and 2x the size limit) reported 0 slots. The master then never grew a writable volume and every assign drained its retry budget, so writes failed with context deadline exceeded. Count the full volumes that actually fit, floored at one for an auto-sized disk that has free space. * mini: show disk and volume capacity in the startup banner Print free space, volume size, total volume count and free volume count under the data directory line, so a volume size limit that outstrips the disk is visible at startup instead of surfacing later as failed writes. |
||
|
|
3674f9d04d |
fix(storage): keep EC .vif when deleting a coexisting regular volume (#9723)
* fix(storage): keep EC .vif when deleting a coexisting regular volume A regular volume and an EC volume for the same id share <base>.vif. When EC shards are distributed onto a server that still holds the regular volume — the encode source, or any replica the planner targets — the post-encode VolumeDelete ran removeVolumeFiles and stripped the shared .vif, leaving the freshly built EC volume without its info file. Skip the .vif in removeVolumeFiles when an EC volume for the same id exists on the disk (mounted, or a sealed .ecx on disk). The regular volume's .dat/.idx still go; the EC sidecars survive. A two-server end-to-end test encodes a volume whose source and a stub replica both also receive shards, and asserts the final on-disk layout: both .dat/.idx gone, each server holding only its assigned shards plus .ecx/.vif. Storage unit tests cover the with-EC and no-EC cases, and the Rust seaweed-volume port carries the same guard and tests. * test(storage): assert .idx is removed in the no-EC destroy case Strengthen TestDestroyRemovesVifWhenNoEc to confirm the full regular volume cleanup (.dat, .idx, .vif) when no EC volume coexists. |
||
|
|
77dcb20a74 |
writeJson: drop unused JSONP branch (#9686)
* writeJson: drop unused JSONP branch No in-tree caller uses ?callback=. Always serve application/json with X-Content-Type-Options: nosniff. * seaweed-volume: drop unused JSONP branch Mirror Go: always serve application/json with X-Content-Type-Options: nosniff. * writeJson: drop unreachable StatusNotModified check bodyAllowedForStatus already returns early for 304. * test/volume_server: rename and rewrite JSONP test to assert callback is ignored CI: /status?callback=myFunc now returns plain application/json with X-Content-Type-Options: nosniff. |
||
|
|
5b42287c22 |
fix(storage): surface stat error on zero-size idx scrub, mirror to rust (#9612)
fix(storage): harden zero-size idx scrub and mirror to rust When a zero-size .idx is found, openIndex stats the backing .dat through v.DataBackend: wrap that GetStat failure with %w, fix the indices typo, and guard both openIndex and scrubVolumeData against a nil DataBackend (closed or remote-only volumes) instead of panicking. Add rust scrub tests for empty (superblock-only .dat, zero-size .idx) and healthy volumes, keeping the volume server in parity with the go zero-size scrub handling. |
||
|
|
cfc08fbf6c |
fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563) (#9565)
* fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0 so the on-disk needle header carries Size=0; TombstoneFileSize is only the .idx sentinel for "this entry is deleted" and is never written into a needle header. ReadBytes' size check therefore mismatched on every tombstone (-1 != 0), returned ErrorSizeMismatch, and triggered the 4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any volume large enough that offset+32 GB exceeds dat fileSize the retry read EOF, CheckVolumeDataIntegrity reported corruption, and the loader set noWriteOrDelete = true. Every volume whose last 10 .idx entries included a deletion went read-only on startup — i.e. any healthy volume where the most recent operations included a delete. Pass Size(0) so the size check matches the on-disk tombstone header. Add a regression test that writes three needles, deletes one, and asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx tail. Without this fix the test reproduces the exact log shape from the bug report: read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF The Rust port guards its integrity-check size comparison with !size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never hits this path, so no Rust mirror change is needed. * test(seaweed-volume): mirror Go regression for deletion-tombstone integrity The Rust integrity check already guards its size-mismatch comparison with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs with body_size=0, so the Go regression fixed in the previous commit does not apply. Lock that guarantee in with a parallel reload test: write three needles, delete one, sync, reopen via Volume::new, assert the volume is not flipped read-only. Catches any future change that removes the deleted-entry guard or re-introduces a size-strict path in check_volume_data_integrity for tombstones. * fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity CheckVolumeDataIntegrity relies on identity comparison against io.EOF and ErrorSizeMismatch to walk back through the last ten .idx entries and tolerate a partial truncation at the tail (the "fix and continue" loop). The live-needle branch in doCheckAndFixVolumeData already returns those sentinels unwrapped; the deletion branch wrapped them in fmt.Errorf, so a genuine .dat truncation past a tombstone offset broke the recovery and flipped the volume read-only. Mirror the live-needle handling: both verifyDeletedNeedleIntegrity and doCheckAndFixVolumeData now short-circuit on io.EOF / ErrorSizeMismatch and pass them through unwrapped. Other errors keep their existing context wrapping. Also tighten the regression test to capture lastAppendAtNs and assert it's non-zero, so a future regression that skips the tombstone body (and therefore never populates AppendAtNs) is caught even when the err check still passes. |
||
|
|
7c252e1f16 |
fix(volume): reopen .idx writable after MarkVolumeWritable (fixes #9515) (#9526)
* fix(volume): reopen .idx writable after MarkVolumeWritable When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable only flipped noWriteOrDelete back to false and rewrote .vif, so writes still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its writable form (in-memory or leveldb small/medium/large) before flipping the flag. Mirror the same fix in seaweed-volume: the Rust load path leaves CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume boots read-only, so post-MarkVolumeWritable puts silently succeeded in-memory only and were lost on the next restart. set_writable now reattaches an append-mode writer when one is missing. * fix(volume): keep old needle map until replacement is built; defer writable flag Go: build the writable needle map into a local before swapping. A construction failure now leaves v.nm pointing at the original SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of stranding the volume with v.nm == nil. Rust: attach the .idx writer before flipping no_write_or_delete to false. A transient open/metadata failure used to leave the volume marked writable with no writer attached, and subsequent puts would silently skip the on-disk append. |
||
|
|
c11ff6657b |
fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525)
* fix(ec): mirror EC sidecars onto every shard-bearing disk at startup
In a multi-disk volume server, ec.balance and ec.rebuild can land shards
on a disk that does not also hold the matching .ecx / .ecj / .vif index
files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks
already loads those shards by pointing the EcVolume at the sibling
disk's index files; reads work, but any failure on the index-owning
disk silently disables every shard on the other disk, even though those
shards are physically fine.
This change adds mirrorEcMetadataToShardDisks, a startup pass that
physically replicates .ecx / .ecj / .vif onto each disk that holds
shards but is missing them. Each copy is atomic (tmp + fsync + rename)
and idempotent (a destination that already has the sidecar is
preserved). After mirroring, the cross-disk reconciler prefers the
local IdxDirectory so the EcVolume mounts self-contained; the
cross-disk virtual mount remains as a fallback for volumes whose mirror
failed (read-only target, out of space, partial copy on a previous
boot).
The same-disk invariant the EC lifecycle (encode / decode / balance /
vacuum / repair) was already documented as promising is now actually
restored at boot, so a future failure of one disk in a split-shards
layout no longer takes the other disk's shards with it.
Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj /
.vif from dir1) and idempotency (an existing destination .ecx is not
overwritten with the owner's copy).
* fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check
hasAllEcSidecarsLocally checked only the modern destination path
(IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk
that still had a legacy .ecx in its data dir (written before -dir.idx
was set) would report "not present" and the mirror would write a
second copy to IdxDirectory, leaving two .ecx files on disk.
Matches HasEcxFileOnDisk's open-with-fallback contract: check the
modern path first, then the opposite directory. Factored the
exists-and-not-a-dir check into a small statRegular helper so the
fallback ladder stays readable.
* rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup
Port of the Go fix (commit
|
||
|
|
2a41e76101 |
fix(ec): blanket-clean every destination over the full shard range (#9512)
* fix(ec): blanket-clean every destination over the full shard range The previous cleanup pass walked t.sources only, with the shard ids the topology had reported at detection time. In the wild, a destination can end up with EC shards mounted that the topology snapshot didn't list — shards on a sibling disk that hadn't heartbeated, or shards left over from a concurrent attempt's mount step. FindEcVolume still returns true, so the next ReceiveFile trips the mounted-volume guard. Cleanup now unions t.sources (with ShardIds) and t.targets and issues unmount + delete over [0..totalShards-1] on each. Both RPCs are idempotent on missing shards, so the wider sweep is free. Two new tests cover the gap: shards mounted beyond what t.sources lists, and a target-only destination with no source row. * log(ec): include disk_id in EC unmount/delete/refusal log lines The current logs identify the volume and shard but leave disk_id off, which makes the cross-server cleanup story hard to follow when multiple disks of one server hold pieces of the same volume: UnmountEcShards 4121.1 -> add disk_id ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id volume server X:Y deletes ec shards from 4121 [...] -> add disk_id ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids ReceiveFile's refusal now names the disk_ids actually holding the mount so operators can see whether the next cleanup pass needs to target a sibling disk. Added Store.FindEcVolumeDiskIds / Store::find_ec_volume_disk_ids as the supporting primitive. Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard, heartbeat delete log in diff_ec_shard_delta_messages, refusal in the ReceiveFile handler). * test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server The plugin-worker EC tests boot a fake volume server that embeds UnimplementedVolumeServerServer. After the worker started calling VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the default Unimplemented response surfaced as fourteen "method not implemented" errors and TestErasureCodingExecutionEncodesShards failed. Both RPCs are no-ops here — nothing on the fake server has mounted state or persisted shard files to remove. |
||
|
|
d51454adf4 |
rust(seaweed-volume): distributed EC read across peer servers (#9516)
* feat(seaweed-volume): distributed EC read across peer servers
EcVolume::read_ec_shard_needle previously errored with NotFound when
any interval's shard wasn't local. In an RS(10,4)-across-N deployment
each server holds one shard, so every read needed >=9 peer fetches and
post-EC GETs returned 404 on volumes whose shards lived on more than
one server.
Mirror of weed/storage/store_ec.go's readOneEcShardInterval ->
readRemoteEcShardInterval -> recoverOneRemoteEcShardInterval chain:
* server/store_ec.rs (new): entry point
read_ec_shard_needle_distributed. Snapshots locate-needle + local
reads under the Store sync lock, drops the lock, then async-fetches
missing intervals via the peer's VolumeEcShardRead RPC. Falls back
to Reed-Solomon reconstruction (read every other shard at the same
(shard_offset, size) and run rs.reconstruct) when the direct peer
read fails. Refreshes the per-EcVolume shard_locations cache from
the master's LookupEcVolume RPC using Go's freshness thresholds
(11s / 7min / 37min).
* erasure_coding/ec_volume.rs: shard_locations now sits behind a
std::sync::RwLock so the read path can refresh the map without
holding the Store write lock. Adds shard_locations_refresh_time
(Mutex<Option<Instant>>) for the staleness heuristic. Mirrors Go's
ShardLocationsLock / ShardLocationsRefreshTime fields. set/get
helpers updated for interior mutability.
* server/handlers.rs: GET handler now tries the local-only fast
path first, then falls through to the distributed path on
NotFound.
* review: address PR 9516 feedback on distributed EC read
Five of the six PR-review comments addressed; the sixth (JWT on
outgoing peer gRPC) is deferred with an explicit TODO because the
crate-wide outgoing-JWT signing surface doesn't exist yet — adding it
in this one call site would split the credential plumbing across
peer paths that already lack it (copy_file_from_source, batch_delete,
…). Revisit when an outgoing-JWT helper lands.
Fixed in this commit:
* Handlers: drop the two-tier (local-first, then distributed)
read in handlers.rs. read_ec_shard_needle_distributed already
does the local-first pass under the same store read lock; the
redundant outer attempt re-read local intervals twice for any
needle that spanned mixed-locality shards.
* Scanner snapshot: replace inline locate-needle math with
`ecv.locate_needle(needle_id)`. Same routine the local-only
read path uses, so byte-identical on shard-size + interval
boundaries.
* EcVolume::set_shard_locations also advances
shard_locations_refresh_time so the staleness check honors
callers that populate the cache directly without going through
the master LookupEcVolume RPC.
* parse_grpc_address moved from grpc_server.rs into
grpc_client.rs as `pub` and is reused by both grpc_server.rs
and the new store_ec module. Single source of truth for the
HTTP↔gRPC port-offset convention.
* Reconstruction (recover_one_remote_ec_shard_interval) now seeds
bufs from locally-mounted survivor shards BEFORE the remote
fan-out. Previously the fan-out was remote-only, so when the
shard_locations cache was cold or the master lookup failed,
reconstruction errored even though enough siblings were on
local disk to recover the missing interval.
* review: tighten parse_grpc_address; atomic shard-locations cache swap
Two follow-up findings from the PR 9516 review round 2:
* `parse_grpc_address` now validates BOTH port components in the
dotted form (`host:port.grpcPort`) — previously a non-numeric
HTTP port like `host:abc.18080` slipped through and tripped a
less-useful downstream URI parse error. The implicit form
(`host:port` → port + 10000) also gains an overflow check so
inputs like `host:60000` (which silently wrap past u16) are
rejected here instead of producing an opaque connection
failure later. Six unit tests cover each rejection path.
* `EcVolume::set_shard_locations` no longer bumps the per-volume
refresh timestamp. The previous fix introduced a freshness
race: a multi-shard population that inserts shard-by-shard
would flip `needs_refresh == false` on the first write, letting
a concurrent reader observe a half-populated map already
marked "fresh" and return NotFound for the not-yet-inserted
shards. Added `EcVolume::replace_shard_locations(map)` for the
atomic bulk swap; `write_back_shard_locations` in the
distributed-read path uses it so the cache transitions
old → fresh in a single observable step.
|
||
|
|
3a8389cd68 |
fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process. |
||
|
|
de28c4df61 |
fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) (#9480)
* fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) handleFoundEcxFile only checks for .dat in the same disk location as the EC shards. In a multi-disk volume server an interrupted encode can leave .ec?? + .ecx on disk B while the source .dat still lives on disk A: the per-disk loader sees no .dat next to .ecx, mistakes the leftover for a distributed-EC layout, and mounts the partial shards. The volume server then heartbeats both a regular replica and an EC shard for the same vid and the master keeps both. Sweep the store after per-disk loading and before the cross-disk reconcile to delete partial EC files when a healthy .dat for the same (collection, vid) exists on a sibling disk. Push DeletedEcShardsChan for every pruned shard so master forgets the new-shard message the per-disk pass already emitted, instead of waiting for the next periodic heartbeat. * fix(seaweed-volume): mirror prune of partial EC with sibling .dat (#9478) Rust port of the same Store-level prune added to weed/storage. The per-disk EC loader in disk_location.rs only checks for .dat in the same disk as the EC shards, so an interrupted encode that leaves .ec?? + .ecx on disk B while the source .dat sits on disk A is mounted as if it were a distributed-EC layout. The volume server then heartbeats both a regular replica and an EC shard for the same vid. Sweep the store after per-disk loading and before the cross-disk reconcile, dropping in-memory EcVolumes with fewer than DATA_SHARDS_COUNT shards when a .dat for the same (collection, vid) exists on a sibling disk, and remove all on-disk EC artefacts for them. The Rust heartbeat path already diff-emits deletes from the next ec_volumes snapshot, so no explicit delete-channel push is needed here. Tests cover both the issue 9478 layout and a distributed-EC layout with no .dat anywhere on the store, which must be left alone. * fix(storage): validate sibling .dat size before deleting partial EC (#9478) The earlier prune deleted partial EC files whenever any .dat for the same vid existed on a sibling disk — including a zero-byte shell. A shell is no more useful than the partial shard it would replace, and the partial shard might still combine with shards on other servers in a recoverable distributed-EC layout. Wiping it based on a corrupt sibling .dat is data loss masquerading as cleanup. Tighten the check: when the EC's .vif recorded a non-zero source size in datFileSize, require the sibling .dat to be at least that many bytes; otherwise fall back to "at least a superblock". The .vif value is what the encoder wrote at the moment the source was sealed, so a sibling .dat smaller than that is provably truncated. Carry the size through indexDatOwners alongside the location. The Rust port had the same gap and an additional bug behind it: EcVolume::new wasn't reading datFileSize from .vif, so the safety check always fell back to the superblock floor. Wire datFileSize through. The existing shard-size calculation in LocateEcShardNeedleInterval already uses dat_file_size when non-zero, so populating it also matches Go's behaviour there. Tests cover the truncated-sibling case in both ports. |
||
|
|
10cc06333b |
cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master. |
||
|
|
532b088262 |
fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423) When EC shards land on a target whose disk type differs from the source volume's, master heartbeats wrongly reported under the target disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the target server applies it to the in-memory EcVolume via SetDiskType so the mount notification and steady-state heartbeat both carry the source's disk type. Empty value falls back to the location's disk type (used by disk-scan reload paths). The override is not persisted with the volume — disk type stays an environmental property and .vif remains portable. * fix(ec): plumb source disk type through plugin worker (#9423) Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved), populate it from the metric the detector already collects, thread it through ec_task into the MountEcShards helper, and forward it on the VolumeEcShardsMount RPC. * fix(ec): mirror source disk type plumbing in rust volume server (#9423) The volume_ec_shards_mount handler now forwards source_disk_type into mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it overrides ec_vol.disk_type (and each mounted shard's disk_type) via the new set_disk_type method; empty value keeps the location's disk type, so disk-scan reload and reconcile paths are unchanged. Also picks up two pre-existing proto drifts that 'make gen' synced from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds in remote.proto). * feat(ec): bias placement toward preferred disk type (#9423) Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType. When PreferredDiskType is non-empty, SelectDestinations partitions suitable disks into matching/fallback tiers and runs the rack/server/ disk-diversity passes on the matching tier first; the fallback tier is only consulted if the matching pool can't satisfy ShardsNeeded. PlacementResult.SpilledToOtherDiskType lets callers warn on spillover. Empty PreferredDiskType keeps the existing single-pool behavior. * fix(ec): plumb source disk type into placement planner (#9423) diskInfosToCandidates now copies DiskInfo.DiskType into the placement candidate, and ecPlacementPlanner.selectDestinations forwards metric.DiskType as PreferredDiskType so EC shards land on disks matching the source volume's disk type when possible. A glog warning fires when placement had to spill to other disk types. * test(ec): integration coverage for source-disk-type plumbing (#9423) store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a shard physically lives on an HDD location, MountEcShards is called with sourceDiskType="ssd", and the test asserts that the in-memory EcVolume, the mounted shard, the NewEcShardsChan notification, and the steady-state heartbeat all report under the source's disk type. A companion test pins the empty-source path so disk-scan reload keeps the location's disk type. detection_disk_type_test exercises the worker plumbing: with a cluster of nodes carrying both HDD and SSD disks, planECDestinations must place every shard on SSD when metric.DiskType="ssd"; with only one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout via spillover (and log a warning). * revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423) make gen pulled two pre-existing OSS changes into the rust proto tree (LockRingUpdate / by_plugin in master.proto, listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as scope creep — none of the rust EC fix references those fields. Restore both files to origin/master so this branch only touches EC-related symbols. * fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423) partitionByDiskType used raw string comparison, so a PreferredDiskType of "hdd" never matched candidates whose DiskType is "" (the HardDriveType sentinel that weed/storage/types uses). EC encoding of an HDD source would spill onto any HDD reporting "" even when the cluster has plenty of matching capacity. Normalize both sides through normalizeDiskType, which lowercases and folds "" → "hdd", mirroring types.ToDiskType without taking a dependency on it. selectFromTier's rack-diversity pass also kept revisiting racks the preferred tier had already used when running on the fallback tier, which negated PreferDifferentRacks on spillover. Skip racks already in usedRacks so fallback placements still spread onto new racks. * fix(ec): empty-source remount must not clobber existing disk type (#9423) mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount, disk-scan reload, orphan-shard reconcile). After an RPC sets the source-derived disk type, any later call passing source_disk_type="" was resetting ec_vol.disk_type back to the location's value, which reintroduces the heartbeat drift this PR is meant to fix. Only default to the location's disk type when the EC volume is fresh (no shards mounted yet); otherwise leave the recorded type alone so empty-source reloads preserve whatever the original mount RPC set. |
||
|
|
487b93eb49 |
fix(volume): don't panic on read when needle map is nil (#9342)
* fix(volume): don't panic on read when needle map is nil A failed CommitCompact reload (and #9335's new error path for a remote-tiered volume with a stray .vif but no .idx) leaves v.nm == nil on a volume that's still in the store. readNeedle / readNeedleDataInto dereferenced v.nm with no guard, so the next GET segfaulted the http handler instead of returning an error the client could retry on another replica. Add the same v.nm == nil check the other Volume accessors already use, including the slow-read inner loop where the lock is released between iterations and a failed reload can race in. Fixes #9339. * match rust nm-nil read behavior; trim comments seaweed-volume's read_needle_with_option / re_lookup_needle_data_offset already lift Option<NeedleMap> through ok_or(NotFound). Use ErrorNotFound on the Go side too instead of a generic 500-mapped error so both volume servers respond identically when v.nm is nil. * log once when reads hit nil needle map ErrorNotFound alone hides the real cause: a half-loaded volume just returns 404s and the operator has nothing to grep for. Add a once-per- volume Errorf on the nil path, reset on successful load. Mirror the same in seaweed-volume via nm_or_not_found(). * trim comments * drop once-flag, log inline on every nil-nm read |
||
|
|
1c0e24f06a |
fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |
||
|
|
2417ba0354 |
fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints Three destructive VolumeServer gRPC endpoints (DeleteCollection, VolumeDelete, VolumeServerLeave) had no authentication checks, unlike their HTTP counterparts which are protected by the Guard whitelist. Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth helper on VolumeServer that extracts the peer IP from gRPC context and validates it against the guard whitelist. Gate all three endpoints behind this check. * fix(volume): tolerate unparseable gRPC peer address in admin auth check S3 Filer Group integration tests were failing with PermissionDenied "bad peer address: address @: missing port in address" when DeleteCollection ran across the in-process gRPC connection between filer and volume server — the peer addr surfaces as "@" there and net.SplitHostPort can't parse it. The check rejected before IsWhiteListed could exercise its allow-all path for empty-whitelist deployments. Hand the raw peer string to IsWhiteListed when SplitHostPort fails. With no whitelist configured (the test environment's mode) it accepts; with a whitelist configured the unparseable host won't match anything and the call still gets denied as it should. Adds three regression tests for IsWhiteListed pinning the empty-config allow-all, populated-list reject-unknown, and signing-key-only allow- all branches that the gRPC admin helper relies on. * refactor(security): dedup checkWhiteList through IsWhiteListed The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the same lookup logic in two places; future drift was just a matter of time. Have checkWhiteList delegate so the membership semantics live in exactly one function. Behaviour is unchanged: the new path still returns nil for isEmptyWhiteList (signing-key-only mode) and still rejects unknown hosts when a whitelist is configured. Addresses gemini medium review on PR #8876. * fix(volume): protect remaining state-altering gRPC admin endpoints DeleteCollection, VolumeDelete, and VolumeServerLeave were the truly-destructive endpoints, but AllocateVolume, VolumeMount, VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and VolumeMarkWritable also modify server state and should sit behind the same whitelist gate. Read-only endpoints (VolumeStatus, VolumeServerStatus, VolumeNeedleStatus, Ping) stay open. The check is a no-op when no whitelist is configured (the default), so existing deployments keep working; operators who lock down their volume servers via guard.white_list now get consistent coverage. Addresses gemini security-high review on PR #8876. * fix(volume): typed peer addr + audit log for gRPC admin auth Prefer a typed *net.TCPAddr when extracting the peer IP — string parsing was already a fallback for the in-process case but using the typed form first is cleaner and skips an unnecessary parse on the common path. Log failed authorization attempts at V(0) so an operator running with a whitelist sees the host that was rejected (and the raw remote address in case the IP lookup itself was the failure mode), matching what the HTTP Guard already does. Addresses gemini medium review on PR #8876. * fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints Five more master/admin-driven destructive operations live outside volume_grpc_admin.go and were missing the same whitelist gate: - VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup - ScrubVolume - VolumeEcShardsDelete VacuumVolumeCheck stays open (read-only). BatchDelete also stays open: it's the data-plane multi-object delete called from the S3 API and filer, not an admin operation; gating it would break ordinary S3 DeleteObjects calls. Addresses gemini security-high review on PR #8876. * fix(volume): simplify no-peer-info branch in gRPC admin auth The IsWhiteListed("") fallback was defending against a scenario that doesn't actually arise — real gRPC connections always populate peer info. Drop the branch and just deny when peer info is missing, which is the safer default and matches "if we don't know who the caller is, refuse". * fix(volume-rust): mirror gRPC admin auth on the rust volume server The rust volume server has the same set of destructive admin endpoints as the Go side and the same Guard infrastructure, but nothing was wired together — every endpoint accepted unauthenticated calls regardless of guard configuration. Same vulnerability class the Go fix on this PR closes; this commit closes it on the rust side too so the two stacks stay aligned. Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer SocketAddr off the tonic Request and runs Guard::check_whitelist on its IP, then applies the helper to the same set the Go side covers: DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount, VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable, VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume, VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete stays open as a data-plane multi-object delete. |
||
|
|
e82789ea4b |
rust(volume): strip grpc-port suffix from master URL before HTTP lookup (#9276)
* rust(volume): strip grpc-port suffix from master URL before HTTP lookup The volume server stores `master_url` in SeaweedFS's canonical `host:httpPort.grpcPort` form (e.g. `node-a:5300.5310`). When `lookup_volume` builds the master `/dir/lookup` URL, the appended gRPC port turns the URL into `http://node-a:5300.5310/...`, which reqwest rejects with "builder error". Every replicated write, batch-delete lookup, and proxy/redirect read then fails. Mirror Go's `pb.ServerAddress.ToHttpAddress()` with a new `to_http_address` helper and apply it inside `lookup_volume`, the single funnel for all three HTTP master lookups in the Rust volume server. Other consumers of `master_url` already go through gRPC and use `to_grpc_address` / `parse_grpc_address`. Includes a regression test that mocks a master HTTP server and calls `lookup_volume` with a `host:port.grpcPort` master URL — without the fix it reproduces the exact "lookup request failed: builder error" from issue #9274. Fixes #9274 * rust(volume): only strip dotted suffix from address when both ports are numeric Previously `to_http_address` rewrote any `host:foo.bar` to `host:foo`, which would silently drop the suffix on malformed config (e.g. a hostname like `host:abc.def` or `host:99999.19333`). Validate that both halves of the dotted suffix parse as `u16` before stripping — mirrors the validation that `to_grpc_address` already does in the inverse direction. Also slice the input directly instead of going through `format!`, since the result is just a prefix of `addr`. Adds a test that asserts non-numeric / out-of-range dotted suffixes are preserved unchanged. * rust(volume): strip grpc-port suffix from peer URLs in replicate / proxy / redirect In normal operation the master returns `Location.url` as plain `host:port` and the gRPC port arrives in a separate field. But the volume server already has defensive logic (`grpc_address_for_location`) for the `host:httpPort.grpcPort` form on those same URLs, which implies a code path where peer URLs do carry the suffix. Apply `to_http_address` to `loc.url` / `target.url` before building HTTP URLs in `do_replicated_request`, `proxy_request`, and `redirect_request` to keep replicate-write, proxy-read, and redirect paths from hitting the same `lookup request failed: builder error` mode that #9274 documented for master lookups. Adds a unit test exercising `redirect_request` with a `.grpcPort` suffix on `target.url`. * rust(volume): return Cow<str> from to_http_address to skip allocation on the no-suffix path Most addresses pass through `to_http_address` unchanged (master and peer URLs are normally plain `host:port`), so the previous String return type allocated on every call for nothing. Switch to `Cow<str>`: the common pass-through borrows from the input, and only the rewrite branch allocates. Call sites use the result via `format!`/`Display`, which both work transparently with `Cow<str>`. Adds a test asserting the variant is Borrowed in the no-rewrite cases and Owned only when the suffix is stripped. * rust(volume): cover bracketed IPv6 literals in to_http_address tests The current implementation already handles IPv6 correctly because `rfind(':')` lands on the colon after the closing bracket, leaving the dotted suffix logic unchanged. Add explicit test coverage so the behavior is locked down for dual-stack deployments — both the strip case (`[::1]:9333.19333` -> `[::1]:9333`) and the various passthrough cases (no suffix, non-numeric, out-of-range, trailing colon). * rust(volume): parse both ports in one tuple pattern match in to_http_address Combine the two `is_ok()` checks into a single `if let (Ok(_), Ok(_))` tuple match — equivalent semantics, slightly tighter expression of intent. No behavior change. |
||
|
|
08d59750ef |
rust(volume): export Prometheus metrics for scrubbing operations (#9266)
* rust(volume): export Prometheus metrics for scrubbing operations Mirrors #9264 in the Rust volume server. Adds three metrics that match the Go names so the same dashboards/alerts work against either binary: - SeaweedFS_volumeServer_scrub_last_time_seconds (gauge) - SeaweedFS_volumeServer_scrub_volume_failures (counter) - SeaweedFS_volumeServer_scrub_shard_failures (counter) Metrics are aggregated at the volume / EC shard level, labelled by VolumeScrubMode (UNKNOWN/INDEX/FULL/LOCAL) to match Go's req.GetMode().String(). * rust(volume): record scrub metrics before post-scrub error check Address PR feedback: - Move metric emission before the mark_broken_volumes_readonly error check so scrub failures are persisted even when the follow-up mark-readonly admin action fails (matches Go's volume_grpc_scrub.go). - Extract the duplicated metric block into emit_scrub_metrics() shared by both ScrubVolume and ScrubEcVolume. The shard-failures family stays untouched on regular volume scrubs to mirror Go. |
||
|
|
9d6d068f41 |
feat(seaweed-volume): cross-disk EC shard reconciliation (#9212) (#9252)
* fix(seaweed-volume): fall back to idx dir when reading .vif
EcVolume::new and read_ec_shard_config only looked for .vif at the
data dir. With the cross-disk reconcile path (where shards live on
one disk and .ecx / .ecj / .vif live on a sibling disk —
seaweedfs/seaweedfs#9212 / #9244), this would either write a stub
.vif on the shard disk and lose the real EC config + dat_file_size
or fall back to default ratios despite a perfectly good .vif being
present elsewhere on the same volume server.
Add a small `locate_vif_path` helper that prefers the data dir and
falls back to the idx dir when it differs, and thread the data dir
+ idx dir pair through `read_ec_shard_config`. Three call sites in
grpc_server.rs (VolumeEcShardsGenerate, VolumeEcShardsRebuild, scrub)
updated; the scrub path passes the same dir for both args because
`find_ec_dir` is the only locator there.
* feat(seaweed-volume): primitives for cross-disk EC shard reconcile
Adds the three small helpers the reconcile pass needs:
- DiskLocation::mount_ec_shards_with_idx_dir — mounts shards on this
disk while pointing the EcVolume at a sibling disk's idx dir for
.ecx / .ecj / .vif. Mirrors loadEcShardsWithIdxDir in
weed/storage/disk_location_ec.go. The existing mount_ec_shards is
kept as a thin wrapper over it.
- EcVolume::has_shard — `pub` accessor over the internal Vec<Option>
shard slot so the reconcile pass can skip shards that are already
registered.
- pub(crate) re-exports of parse_collection_volume_id and
parse_ec_shard_extension under names parse_collection_volume_id_pub
and is_ec_shard_extension so the reconcile module can call them
without re-implementing the parsers.
No behaviour change. Reconciliation logic in the next commit.
* feat(seaweed-volume): cross-disk EC shard reconciliation (#9212)
Closes the loader half of seaweedfs/seaweedfs#9212 on the Rust side,
mirroring the Go fix in seaweedfs/seaweedfs#9244. With the auto-load
in feat/rust-load-all-ec-shards-9212 in place, the only remaining gap
is shards that landed on a disk without their `.ecx` — for example
when ec.balance / ec.rebuild moved them onto a destination node's
second disk while leaving the index files on the disk that already
held the volume. Without this, those orphan shards stay invisible to
the master and ec.rebuild reports the volume as unrepairable.
After every DiskLocation has finished its per-disk EC scan, sweep the
store for shards that live on a disk without local index files and
load them by reaching across to a sibling disk's `.ecx` / `.ecj` /
`.vif`:
- Store::reconcile_ec_shards_across_disks walks each disk for
orphan `.ec??` files (present on disk, not yet registered to an
EcVolume) and matches them against an `(collection, vid) ->
EcxOwnerInfo` map of which disk owns each `.ecx`.
- Each matched group is mounted on its physical disk's ec_volumes
map (so heartbeat reporting carries the right disk_id per shard)
via `mount_ec_shards_with_idx_dir`, pointing the EcVolume at the
sibling's idx dir.
- `index_ecx_owners` records the directory each `.ecx` was found in
(IdxDirectory or Directory) so the loader doesn't ENOENT when the
legacy "written before -dir.idx was set" layout puts `.ecx` in
the data dir. This mirrors the PR #9244 review fix from
@gemini-code-assist / @coderabbitai (see Go commit
|
||
|
|
49e83a26cb |
feat(seaweed-volume): auto-load EC shards on startup (#9212) (#9251)
* feat(seaweed-volume): auto-load EC shards on startup
The Rust volume server's load_existing_volumes only scanned .dat
files; EC shards on disk stayed invisible until something explicitly
issued VolumeEcShardsMount. Strict superset of the issue
seaweedfs/seaweedfs#9212 reports for Go: after a fresh restart, every
local EC shard was missing from the master's view.
Port loadAllEcShards from weed/storage/disk_location_ec.go:
- DiskLocation::load_all_ec_shards walks Directory (and IdxDirectory
if separate) sorted, groups .ec?? shard files by (collection, vid),
validates and mounts each group when its matching .ecx is found.
- handle_found_ecx_file: validate_ec_volume + mount_ec_shards path,
with cleanup when .dat exists and validation fails (incomplete
encoding) or load fails.
- check_orphaned_shards: cleans up shard remnants whose .ecx never
arrived AND whose stale .dat is still present (interrupted
encoding); leaves them on disk otherwise so cross-disk
reconciliation / operator recovery can find them.
- check_dat_file_exists / parse_collection_volume_id /
parse_ec_shard_extension: small helpers mirroring Go's checkDatFileExists,
parseCollectionVolumeId, and the `\.ec\d{2,3}` regex.
- Wire through load_existing_volumes after the .dat scan; failures
log but don't fail the disk's startup.
Tests:
- test_parse_ec_shard_extension covers .ec00–.ec255 and the rejection
of .ec0, .ec999, .ecx, .ecj, .dat, and missing leading dot.
- test_load_all_ec_shards_mounts_pairs_with_ecx: shards + .ecx + .vif
on disk get mounted into ec_volumes after load_existing_volumes.
- test_load_all_ec_shards_keeps_orphan_shards_when_no_dat: orphan
shards (no .ecx, no .dat) stay on disk untouched
(distributed-EC scenario).
- test_load_all_ec_shards_cleans_orphan_shards_when_dat_exists:
orphan shards alongside a stale .dat get cleaned up
(interrupted-encoding scenario).
Prerequisite for porting the cross-disk orphan-shard reconciliation
in seaweedfs/seaweedfs#9244 to Rust.
* fix(seaweed-volume): dedupe filenames when scanning data + idx dirs
load_all_ec_shards scans both `directory` and `idx_directory` (when
they differ) so the loop can pair `.ec??` shards with their `.ecx`
regardless of which dir owns the index. If the same filename is
present in both — possible in idempotent legacy layouts that
pre-date `-dir.idx` — the previous implementation processed it
twice. mount_ec_shards increments the per-shard `ec_shards` metric
inside the loop, so a duplicated `.ec??` entry would double-count
the gauge.
Use a HashSet<String> while accumulating entries so each filename
is processed exactly once.
Reported in PR #9251 review by @gemini-code-assist.
* fix(seaweed-volume): drive partial-mount cleanup through unmount_ec_shards
handle_found_ecx_file calls mount_ec_shards which adds shards one at
a time. mount_ec_shards increments the `ec_shards` gauge per shard
that successfully attaches. If mount fails halfway, plain
ec_volumes.remove(vid) drops the EcVolume but leaves the gauge
incremented for whatever did mount.
Drive the cleanup branches through unmount_ec_shards instead — it
mirror-decrements the gauge per shard and only then drops the
EcVolume. Same shape applied to both .dat-exists and distributed-EC
fallbacks.
Reported in PR #9251 review by @gemini-code-assist.
* docs(seaweed-volume): clarify parse_ec_shard_extension shard-id range
Doc previously said `.ec00`–`.ec999` but the implementation rejects
any shard id > 255 (matches the `EcVolumeShard` u8 typed shard id
and Go's `strconv.ParseInt(... 10, 64)` + `> 255` guard). Fix the
doc to say `.ec00`–`.ec255` and explain why the 3-digit form is
still recognised.
Reported in PR #9251 review by @coderabbitai.
|
||
|
|
933ae6e386 |
fix(seaweed-volume): port EC shard placement fix to Rust (#9212, mirrors #9245) (#9250)
* feat(seaweed-volume): add DiskLocation::has_ecx_file_on_disk Mirrors `DiskLocation.HasEcxFileOnDisk` from the Go side (seaweedfs/seaweedfs#9245). Reports whether this disk has a sealed .ecx index file for (collection, vid) by stat'ing the IdxDirectory first, then falling back to Directory if different — covers the legacy "written before -dir.idx was set" layout. Skips entries that are directories so a stray dir named `<col>_<vid>.ecx` doesn't register as a present index file. Unlike has_ec_volume() this does not require the EC volume to be mounted in memory, which makes it the right primitive for placement decisions during ec.balance / ec.rebuild flows where shards may arrive before any VolumeEcShardsMount has happened on the receiving disk. Wiring + tests in follow-up commits. * feat(seaweed-volume): add Store::find_ec_shard_target_location Mirrors `Store.FindEcShardTargetLocation` from the Go side (seaweedfs/seaweedfs#9245). Single canonical placement primitive for new EC shard / index files. Selection order: 1. a disk that already has the EC volume mounted (in-memory), 2. a disk that owns the .ecx file on disk (volume not yet mounted), 3. any HDD with free space, 4. any disk with free space. Step 2 is the missing primitive that pinned subsequent shards to the first-shard disk during ec.rebuild — rebuild only sets CopyEcxFile=true on the first shard, then relies on auto-select to land later shards on the same disk. Without an on-disk check has_ec_volume returns false (no mount yet) and the fallback picked "any HDD with free space," splitting shards from their .ecx across disks of the same node and producing the orphan-shard layout seaweedfs/seaweedfs#9212 reports. Implementation walks store.locations once with tier scoring; the highest-tier disk wins, ties broken by free count. The earlier 4-pass waterfall in find_free_location_predicate would have re-acquired locks per pass. ec_free_shard_count returns the free count in shard slots (not volume-equivalent slots). The pre-existing find_free_location* helpers divide by DATA_SHARDS_COUNT at the end; that truncation can exclude a disk that has room for several individual shards (MaxVolumeCount=1, EcShardCount=1, dsc=10 → reports 0 despite 9 free slots), which would re-route subsequent shards off the .ecx-owning disk and re-introduce the orphan layout. Keep the result in shard slots throughout. The unlimited-disk branch (MaxVolumeCount==0) reports a synthetic large free count decremented by current usage so unlimited disks stay eligible and tie-breaks still prefer the less-loaded one. data_shard_count is taken as a parameter rather than read from DATA_SHARDS_COUNT so custom-ratio builds can swap the default without touching this helper. Tests cover: pinning to .ecx on disk, mounted-wins-over-stray-.ecx, HDD fallback, MaxVolumeCount=0 unlimited handling, and the tight-provisioning truncation case. * fix(seaweed-volume): route EC shard auto-select through new helper VolumeEcShardsCopy and the ReceiveFile EC branch both used a 3-tier inline waterfall: in-memory has_ec_volume → any HDD → any disk. That checked in-memory state only and missed disks that own the .ecx on disk but haven't been mounted yet — the orphan-shard placement hazard from seaweedfs/seaweedfs#9212. Replace both with a single call to Store::find_ec_shard_target_location, which adds the .ecx-on-disk tier between mounted and HDD, and accounts for free space in shard slots so tight-provisioning configurations don't incorrectly skip a disk that still has room for individual shards. Pass DATA_SHARDS_COUNT as the data-shard count for free-slot maths; the helper takes it as a parameter so custom-ratio builds can swap the default without touching this file. * fix(seaweed-volume): grow UNLIMITED_FREE budget and saturate the math ec_free_shard_count's unlimited branch (MaxVolumeCount=0) used to clamp to a constant `1` once usage exceeded `1 << 30 ≈ 1e9` shard slots. With several unlimited disks all past that threshold, every placement decision among them tied at 1 — tie-break degraded to "first eligible disk." Bump the synthetic budget to `1 << 60 ≈ 1.15e18` and use saturating arithmetic so even pathological usage never wraps i64. Clamp the return value to `≥ 1` so the disk stays eligible for placement at any load. Tie-breaks among unlimited disks now keep preferring the less-loaded one across all realistic deployments. Reported in PR #9250 review by @gemini-code-assist. |
||
|
|
4c4d53ce23 |
fix(seaweed-volume): accept redb aliases for --index (#9237)
fix(seaweed-volume): accept redb aliases for --index and rename kinds The Rust volume server's disk-backed index uses redb internally (see RedbNeedleMap), but --index only accepted the legacy `leveldb` spellings, contradicting the wiki and forcing users to read source to figure out what value to pass. - --index now accepts memory|redb|redbMedium|redbLarge as the canonical names, with leveldb/leveldbMedium/leveldbLarge kept as aliases. - Rename NeedleMapKind variants LevelDb*->Redb* so the in-tree names match the actual backend. - Update help text and add a parse-table test covering both names. Refs #9234. |
||
|
|
045ace29d5 |
fix(seaweed-volume): parse host:port.grpcPort in master address (#9235)
The Go ServerAddress format encodes an optional explicit gRPC port as host:port.grpcPort. The Rust heartbeat client only handled host:port (falling back to port+10000), so feeding it host:port.grpcPort yielded a malformed gRPC target like "host:port.grpcPort", which manifests as checkWithMaster transport errors. Mirror pb.ServerToGrpcAddress(): if the part after the last ':' contains a '.' followed by a valid u16, treat that suffix as the explicit gRPC port; otherwise keep the +10000 default. Refs #9234. |
||
|
|
503b6f2744 |
fix(seaweed-volume): ceil EC shard slots in maybe_adjust_volume_max (#9232)
Mirrors the volume-server side of seaweedfs/seaweedfs#9196: compute the EC-shard contribution to maxVolumeCount with proper ceiling division ((N + D - 1) / D) instead of (N + D) / D, which over-counts by one slot whenever the per-location EC-shard count is zero or an exact multiple of DataShardsCount (10). The most common case -- a location with no EC shards -- silently inflated maxVolumeCount by 1 on every recalculation. The matching low-disk effective_max_count path in heartbeat.rs already uses the correct ceiling form, and the master-side topology changes from that PR have no Rust counterpart. |
||
|
|
352ffdffe1 |
build(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /seaweed-volume (#9216)
build(deps): bump rustls-webpki in /seaweed-volume Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.10 to 0.103.13. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
036191c78a | Merge branch 'master' of https://github.com/seaweedfs/seaweedfs | ||
|
|
ae93f87a46 | adjust logo | ||
|
|
f438cc3544 |
fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard ReceiveFile for an EC shard calls os.Create(filePath) which opens the path with O_TRUNC. When the shard is already mounted, the in-memory EcVolume holds a file descriptor against the same inode, so a second ReceiveFile call for the same (volume, shard) truncates the live shard file beneath the reader. Reproducer: generate and mount shard 0 for a populated volume, capture the on-disk size, then send a smaller payload for the same shard via ReceiveFile. The current handler accepts the overwrite and leaves the shard truncated in place; this test pins that behavior. When the fix lands the server should reject (or rename-then-swap) and this test must be inverted. * fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard ReceiveFile used os.Create on EC shard paths, which opens with O_TRUNC and truncates in place. When an EC shard is already mounted, the in-memory EcVolume holds file descriptors against the same inodes, so the truncation corrupts the live shard beneath any ongoing read. On retries of an EC task this produced the "missing parts" class of errors in #9184. The fix rejects any ReceiveFile for an EC volume that currently has mounted shards. The caller must unmount before retrying — silent truncation is never an acceptable outcome. Non-EC writes and ReceiveFile for volumes that have never been mounted on this server continue to work as before. Tests: - TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard, attempts an overwrite, asserts the error response and that the on-disk file and live reads are undisturbed. - TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case contract that a first write to a target still succeeds. * fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard Mirror the Go-side change: reject receive_file for any EC volume that currently has mounted shards on this server. std::fs::File::create truncates in place and the in-memory EcVolume holds fds on the same inodes, so an overwrite would corrupt live readers. |
||
|
|
c4e1885053 |
fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug The plugin-worker EC task sends shards via ReceiveFile, which picks Locations[0] as the target directory regardless of the admin planner's TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there is no wire channel to honor the plan. Adds StartSingleVolumeClusterWithDataDirs to the integration framework so tests can launch a volume server with N data directories. The new repro asserts the current (buggy) behavior: sending three distinct EC shards via ReceiveFile leaves all three files in dir[0] and the other dirs empty. When the fix adds disk_id to ReceiveFileInfo, this assertion must flip to verify the planned placement is respected. * fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement Before this change, VolumeServer.ReceiveFile for EC shards always selected the first HDD location (Locations[0]). The plugin-worker EC task had no way to pass the admin planner's per-shard disk assignment — ReceiveFileInfo carried no disk_id field — so every received EC shard piled onto a single disk per destination server. On multi-disk servers this caused uneven load (one disk absorbing all EC shard I/O), frequent ENOSPC retries, and a growing EC backlog under sustained ingest (see issue #9184). Changes: - proto: add disk_id to ReceiveFileInfo, mirroring VolumeEcShardsCopyRequest.disk_id. - worker: DistributeEcShards tracks the planner-assigned disk per shard; sendShardFileToDestination forwards that disk id. Metadata files (ecx/ecj/vif) inherit the disk of the first data shard targeting the same node so they land next to the shards. - server: ReceiveFile honors disk_id when > 0 with bounds validation; disk_id=0 (unset) falls back to the same auto-selection pattern as VolumeEcShardsCopy (prefer disk that already has shards for this volume, then any HDD with free space, then any location with free space). Tests updated: - TestReceiveFileEcShardHonorsDiskID asserts three shards sent with disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively. - TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range disk_id rejection path. * fix(volume-rust): honor disk_id in ReceiveFile for EC shards Mirror the Go-side change: when disk_id > 0 place the EC shard on the requested disk; when unset, auto-select with the same preference order as volume_ec_shards_copy (disk already holding shards, then any HDD, then any disk). * fix(volume): compare disk_id as uint32 to avoid 32-bit overflow On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a high-bit uint32 to a negative int, bypassing the bounds check before the index operation. Compare in the uint32 domain instead. * test(ec): fail invalid-disk_id test on transport error Previously a transport-level error from CloseAndRecv silently passed the test by returning early, masking any real gRPC failure. Fail loudly so only the structured ReceiveFileResponse rejection path counts as a pass. * docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test Documents the load-bearing assumption that shards are never mounted in this test, so loc.FindEcVolume always returns false and auto-select falls through to the first HDD. Saves future readers from re-deriving the expected directory for the DiskId=0 case. * fix(test): preserve baseDir/volume path for single-dir clusters StartSingleVolumeClusterWithDataDirs started naming the data directory volume0 even in the dataDirCount=1 case, which broke Scrub tests that reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile / CorruptEcxFile. Keep the legacy name for single-dir clusters; only use the indexed "volumeN" layout when multiple disks are requested. |
||
|
|
45578a42e9 |
fix(volume): keep vacuum running past dangling .idx entries (#9115)
* fix(volume): keep vacuum running past dangling .idx entries Vacuum compaction aborted entirely on the first .idx entry whose offset pointed past the end of the .dat file, surfacing as `cannot hydrate needle from file: EOF` and stalling progress on every other volume. In both Go and Rust: - During compaction, skip an unreadable needle and continue. The bytes it pointed at were already unreachable via reads, so dropping the index reference makes the post-vacuum volume consistent. Real EIO still bails out so a disk fault is not silently papered over. - At volume load, do a single linear scan of the .idx and confirm every (offset + actual size) fits inside .dat. The pre-existing integrity check only looked at the last 10 entries, so deeper corruption (e.g. left over from a crashed batched write) went undetected and only surfaced later as a vacuum EOF. A failure now marks the volume read-only at load time so an operator can react. Refs #8928 * fix(volume): only skip permanent-corruption needle reads during vacuum Address PR review feedback (gemini-code-assist + coderabbit): The original patch skipped any non-EIO read failure, which would silently drop needles on transient errors — Windows hardware bad-sector errors (ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network timeouts and EROFS would also slip through and shrink the volume. Switch to an explicit whitelist of permanent-corruption shapes: - Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of range" errors with %w so callers can match via errors.Is. - copyDataBasedOnIndexFile now skips only when the read failure is io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid, or ErrorCorrupted. Anything else (real disk faults, environmental errors, Windows hardware codes) aborts the compaction so an operator notices. - Mirror the same whitelist in the Rust volume server, matching on io::ErrorKind::UnexpectedEof and the NeedleError corruption variants (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort). Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows t.TempDir() cleanup can release the .dat/.idx handles. Refs #8928 * fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch Address PR review: the fallback branch in ReadBytes returned an unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap with %w so the whitelist applies, while leaving the existing direct sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry still triggers. Refs #8928 * fix(volume): integrate dangling-idx check into existing index load walk Address PR review (gemini-code-assist, medium): the structural .idx check used to do a second linear scan of the index file at every volume load, doubling the disk-I/O cost on servers managing many volumes. Track the largest (offset + actual size) seen during the existing needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`, `NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`, `DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`, exposed as `MaxNeedleEnd()` on the NeedleMapper interface. `volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size after the load is complete — pure numeric comparison, no extra I/O. The standalone `verifyIndexFitsInDat` helper and its caller in `CheckVolumeDataIntegrity` are removed; the test that used to drive the helper directly now exercises the new path via `LoadCompactNeedleMap`. Mirror the same change in the Rust volume server: track `max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()` on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum. The Rust load walk already happens in `load_from_idx` for both map kinds, so the structural check becomes free. Refs #8928 |
||
|
|
300e906330 |
admin: report file and delete counts for EC volumes (#9060)
* admin: report file and delete counts for EC volumes The admin bucket size fix (#9058) left object counts at zero for EC-encoded data because VolumeEcShardInformationMessage carried no file count. Billing/monitoring dashboards therefore still under-report objects once a bucket is EC-encoded. Thread file_count and delete_count end-to-end: - Add file_count/delete_count to VolumeEcShardInformationMessage (proto fields 8 and 9) and regenerate master_pb. - Compute them lazily on volume servers by walking the .ecx index once per EcVolume, cache on the struct, and keep the cache in sync inside DeleteNeedleFromEcx (distinguishing live vs already-tombstoned entries so idempotent deletes do not drift the counts). - Populate the new proto fields from EcVolume.ToVolumeEcShardInformationMessage and carry them through the master-side EcVolumeInfo / topology sync. - Aggregate in admin collectCollectionStats, deduping per volume id: every node holding shards of an EC volume reports the same counts, so summing across nodes would otherwise multiply the object count by the number of shard holders. Regression tests cover the initial .ecx walk, live/tombstoned delete bookkeeping (including idempotent and missing-key cases), and the admin dedup path for an EC volume reported by multiple nodes. * ec: include .ecj journal in EcVolume delete count The initial delete count only reflected .ecx tombstones, missing any needle that was journaled in .ecj but not yet folded into .ecx — e.g. on partial recovery. Expand initCountsLocked to take the union of .ecx tombstones and .ecj journal entries, deduped by needle id, so: - an id that is both tombstoned in .ecx and listed in .ecj counts once - a duplicate .ecj entry counts once - an .ecj id with a live .ecx entry is counted as deleted (not live) - an .ecj id with no matching .ecx entry is still counted Covered by TestEcVolumeFileAndDeleteCountEcjUnion. * ec: report delete count authoritatively and tombstone once per delete Address two issues with the previous EcVolume file/delete count work: 1. The delete count was computed lazily on first heartbeat and mixed in a .ecj-union fallback to "recover" partial state. That diverged from how regular volumes report counts (always live from the needle map) and had drift cases when .ecj got reconciled. Replace with an eager walk of .ecx at NewEcVolume time, maintained incrementally on every DeleteNeedleFromEcx call. Semantics now match needle_map_metric: FileCount is the total number of needles ever recorded in .ecx (live + tombstoned), DeleteCount is the tombstones — so live = FileCount - DeleteCount. Drop the .ecj-union logic entirely. 2. A single EC needle delete fanned out to every node holding a replica of the primary data shard and called DeleteNeedleFromEcx on each, which inflated the per-volume delete total by the replica factor. Rewrite doDeleteNeedleFromRemoteEcShardServers to try replicas in order and stop at the first success (one tombstone per delete), and only fall back to other shards when the primary shard has no home (ErrEcShardMissing sentinel), not on transient RPC errors. Admin aggregation now folds EC counts correctly: FileCount is deduped per volume id (every shard holder has an identical .ecx) and DeleteCount is summed across nodes (each delete tombstones exactly one node). Live object count = deduped FileCount - summed DeleteCount. Tests updated to match the new semantics: - EC volume counts seed FileCount as total .ecx entries (live + tombstoned), DeleteCount as tombstones. - DeleteNeedleFromEcx keeps FileCount constant and increments DeleteCount only on live->tombstone transitions. - Admin dedup test uses distinct per-node delete counts (5 + 3 + 2) to prove they're summed, while FileCount=100 is applied once. * ec: test fixture uses real vid; admin warns on skewed ec counts - writeFixture now builds the .ecx/.ecj/.ec00/.vif filenames from the actual vid passed in, instead of hardcoding "_1". The existing tests all use vid=1 so behaviour is unchanged, but the helper no longer silently diverges from its documented parameter. - collectCollectionStats logs a glog warning when an EC volume's summed delete count exceeds its deduped file count, surfacing the anomaly (stale heartbeat, counter drift, etc.) instead of silently dropping the volume from the object count. * ec: derive file/delete counts from .ecx/.ecj file sizes seedCountsFromEcx walked the full .ecx index at volume load, which is wasted work: .ecx has fixed-size entries (NeedleMapEntrySize) and .ecj has fixed-size deletion records (NeedleIdSize), so both counts are pure file-size arithmetic. fileCount = ecxFileSize / NeedleMapEntrySize deleteCount = ecjFileSize / NeedleIdSize Rip out the cached counters, countsLock, seedCountsFromEcx, and the recordDelete helper. Track ecjFileSize directly on the EcVolume struct, seed it from Stat() at load, and bump it on every successful .ecj append inside DeleteNeedleFromEcx under ecjFileAccessLock. Skip the .ecj write entirely when the needle is already tombstoned so the derived delete count stays idempotent on repeat deletes. Heartbeats now compute counts in O(1). Tests updated: the initial fixture pre-populates .ecj with two ids to verify the file-size derivation end-to-end, and the delete test keeps its idempotent-re-delete / missing-needle invariants (unchanged externally, now enforced by the early return rather than a cache guard). * ec: sync Rust volume server with Go file/delete count semantics Mirror the Go-side EC file/delete count work in the Rust volume server so mixed Go/Rust clusters report consistent bucket object counts in the admin dashboard. - Add file_count (8) and delete_count (9) to the Rust copy of VolumeEcShardInformationMessage (seaweed-volume/proto/master.proto). - EcVolume gains ecj_file_size, seeded from the journal's metadata on open and bumped inside journal_delete on every successful append. - file_and_delete_count() returns counts derived in O(1) from ecx_file_size / NEEDLE_MAP_ENTRY_SIZE and ecj_file_size / NEEDLE_ID_SIZE, matching Go's FileAndDeleteCount. - to_volume_ec_shard_information_messages populates the new proto fields instead of defaulting them to zero. - mark_needle_deleted_in_ecx now returns a DeleteOutcome enum (NotFound / AlreadyDeleted / Tombstoned) so journal_delete can skip both the .ecj append and the size bump when the needle is missing or already tombstoned, keeping the derived delete_count idempotent on repeat or no-op deletes. - Rust's EcVolume::new no longer replays .ecj into .ecx on load. Go's RebuildEcxFile is only called from specific decode/rebuild gRPC handlers, not on volume open, and replaying on load was hiding the deletion journal from the new file-size-derived delete counter. rebuild_ecx_from_journal is kept as dead_code for future decode paths that may want the same replay semantics. Also clean up the Go FileAndDeleteCount to drop unnecessary runtime guards against zero constants — NeedleMapEntrySize and NeedleIdSize are compile-time non-zero. test_ec_volume_journal updated to pre-populate the .ecx with the needles it deletes, and extended to verify that repeat and missing-id deletes do not drift the derived counts. * ec: document enterprise-reserved proto field range on ec shard info Both OSS master.proto copies now note that fields 10-19 are reserved for future upstream additions while 20+ are owned by the enterprise fork. Enterprise already pins data_shards/parity_shards at 20/21, so keeping OSS additions inside 8-19 avoids wire-level collisions for mixed deployments. * ec(rust): resolve .ecx/.ecj helpers from ecx_actual_dir ecx_file_name() and ecj_file_name() resolved from self.dir_idx, but new() opens the actual files from ecx_actual_dir (which may fall back to the data dir when the idx dir does not contain the index). After a fallback, read_deleted_needles() and rebuild_ecx_from_journal() would read/rebuild the wrong (nonexistent) path while heartbeats reported counts from the file actually in use — silently dropping deletes. Point idx_base_name() at ecx_actual_dir, which is initialized to dir_idx and only diverges after a successful fallback, so every call site agrees with the file new() has open. The pre-fallback call in new() (line 142) still returns the dir_idx path because ecx_actual_dir == dir_idx at that point. Update the destroy() sweep to build the dir_idx cleanup paths explicitly instead of leaning on the helpers, so post-fallback stale files in the idx dir are still removed. * ec: reset ecj size after rebuild; rollback ecx tombstone on ecj failure Two EC delete-count correctness fixes applied symmetrically to Go and Rust volume servers. 1. rebuild_ecx_from_journal (Rust) now sets ecj_file_size = 0 after recreating the empty journal, matching the on-disk truth. Previously the cached size still reflected the pre-rebuild journal and file_and_delete_count() would keep reporting stale delete counts. The Go side has no equivalent bug because RebuildEcxFile runs in an offline helper that does not touch an EcVolume struct. 2. DeleteNeedleFromEcx / journal_delete used to tombstone the .ecx entry before writing the .ecj record. If the .ecj append then failed, the needle was permanently marked deleted but the heartbeat-reported delete_count never advanced (it is derived from .ecj file size), and a retry would see AlreadyDeleted and early- return, leaving the drift permanent. Both languages now capture the entry's file offset and original size bytes during the mark step, attempt the .ecj append, and on failure roll the .ecx tombstone back by writing the original size bytes at the known offset. A rollback that itself errors is logged (glog / tracing) but cannot re-sync the files — this is the same failure mode a double disk error would produce, and is unavoidable without a full on-disk transaction log. Go: wrap MarkNeedleDeleted in a closure that captures the file offset into an outer variable, then pass the offset + oldSize to the new rollbackEcxTombstone helper on .ecj seek/write errors. Rust: DeleteOutcome::Tombstoned now carries the size_offset and a [u8; SIZE_SIZE] copy of the pre-tombstone size field. journal_delete destructures on Tombstoned and calls restore_ecx_size on .ecj append failure. * test(ec): widen admin /health wait to 180s for cold CI TestEcEndToEnd starts master, 14 volume servers, filer, 2 workers and admin in sequence, then waited only 60s for admin's HTTP server to come up. On cold GitHub runners the tail of the earlier subprocess startups eats most of that budget and the wait occasionally times out (last hit on run 24374773031). The local fast path is still ~20s total, so the bump only extends the timeout ceiling, not the happy path. * test(ec): fork volume servers in parallel in TestEcEndToEnd startWeed is non-blocking (just cmd.Start()), so the per-process fork + mkdir + log-file-open overhead for 14 volume servers was serialized for no reason. On cold CI disks that overhead stacks up and eats into the subsequent admin /health wait, which is how run 24374773031 flaked. Wrap the volume-server loop in a sync.WaitGroup and guard runningCmds with a mutex so concurrent appends are safe. startWeed still calls t.Fatalf on failure, which is fine from a goroutine for a fatal test abort; the fail-fast isn't something we rely on for precise ordering. * ec: fsync ecx before ecj, truncate on failure, harden rebuild Four correctness fixes covering both volume servers. 1. Durability ordering (Go + Rust). After marking the .ecx tombstone we now fsync .ecx before touching .ecj, so a crash between the two files cannot leave the journal with an entry for a needle whose tombstone is still sitting in page cache. Once the fsync returns, the tombstone is the source of truth: reads see "deleted", delete_count may under-count by one (benign, idempotent retries) but never over-reports. If the fsync itself fails we restore the original size bytes and surface the error. The .ecj append is then followed by its own Sync so the reported delete_count matches the on-disk journal once the write returns. 2. .ecj truncation on append failure. write_all may have extended the journal on disk before sync_all / Sync errors out, leaving the cached ecj_file_size out of sync with the physical length and drifting delete_count permanently after restart. Both languages now capture the pre-append size, truncate the file back via set_len / Truncate on any write or sync failure, and only then restore the .ecx tombstone. Truncation errors are logged — same-fd length resets cannot realistically fail — but cannot themselves re-sync the files. 3. Atomic rebuild_ecx_from_journal (Rust, dead code today but wired up on any future decode path). Previously a failed mark_needle_deleted_in_ecx call was swallowed with `let _ = ...` and the journal was still removed, silently losing tombstones. We now bubble up any non-NotFound error, fsync .ecx after the whole replay succeeds, and only then drop and recreate .ecj. NotFound is still ignored (expected race between delete and encode). 4. Missing-.ecx hardening (Rust). mark_needle_deleted_in_ecx used to return Ok(NotFound) when self.ecx_file was None, hiding a closed or corrupt volume behind what looks like an idempotent no-op. It now returns an io::Error carrying the volume id so callers (e.g. journal_delete) fail loudly instead. Existing Go and Rust EC test suites stay green. * ec: make .ecx immutable at runtime; track deletes in memory + .ecj Refactors both volume servers so the sealed sorted .ecx index is never mutated during normal operation. Runtime deletes are committed to the .ecj deletion journal and tracked in an in-memory deleted-needle set; read-path lookups consult that set to mask out deleted ids on top of the immutable .ecx record. Mirrors the intended design on both Go and Rust sides. EcVolume gains a `deletedNeedles` / `deleted_needles` set seeded from .ecj in NewEcVolume / EcVolume::new. DeleteNeedleFromEcx / journal_delete: 1. Looks the needle up read-only in .ecx. 2. Missing needle -> no-op. 3. Pre-existing .ecx tombstone (from a prior decode/rebuild) -> mirror into the in-memory set, no .ecj append. 4. Otherwise append the id to .ecj, fsync, and only then publish the id into the set. A partial write is truncated back to the pre-append length so the on-disk journal and the in-memory set cannot drift. FindNeedleFromEcx / find_needle_from_ecx now return TombstoneFileSize when the id is in the in-memory set, even though the bytes on disk still show the original size. FileAndDeleteCount: fileCount = .ecx size / NeedleMapEntrySize (unchanged) deleteCount = len(deletedNeedles) (was: .ecj size / NeedleIdSize) The RebuildEcxFile / rebuild_ecx_from_journal decode-time helpers still fold .ecj into .ecx — that is the one place tombstones land in the physical index, and it runs offline on closed files. Rust's rebuild helper now also clears the in-memory set when it succeeds. Dead code removed on the Rust side: `DeleteOutcome`, `mark_needle_deleted_in_ecx`, `restore_ecx_size`. Go drops the runtime `rollbackEcxTombstone` path. Neither helper was needed once .ecx stopped being a runtime mutation target. TestEcVolumeSyncEnsuresDeletionsVisible (issue #7751) is rewritten as TestEcVolumeDeleteDurableToJournal, which exercises the full durability chain: delete -> .ecj fsync -> FindNeedleFromEcx masks via the in-memory set -> raw .ecx bytes are *unchanged* -> Close + RebuildEcxFile folds the journal into .ecx -> raw bytes now show the tombstone, as CopyFile in the decode path expects. |
||
|
|
10b0bdce02 |
feat: pass expected_data_size from clients for size-aware assignment (#9032)
* feat: pass expected_data_size from clients for size-aware assignment Add expected_data_size field to AssignRequest (master proto) and AssignVolumeRequest (filer proto) so clients can hint how large the data will be. The master uses this instead of the 1MB default when tracking pending volume sizes for weighted assignment. - Add expected_data_size to master.proto AssignRequest - Add expected_data_size to filer.proto AssignVolumeRequest - Wire through filer AssignVolume handler - Wire through HTTP submit handler (uses actual upload size) - Add ExpectedDataSize to VolumeAssignRequest in operation package - Topology.PickForWrite accepts optional expectedDataSize parameter * fix: guard integer conversions in expected_data_size path - common.go: clamp OriginalDataSize to non-negative before uint64 cast - topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast * fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize - HTTP /dir/assign now parses optional "dataSize" query parameter and passes it to PickForWrite instead of hardcoded 0 - Add test assertion for PickForWrite with non-zero expectedDataSize |
||
|
|
3d17bab544 |
fix(seaweed-volume): eliminate global S3 tier registry races in tests
Multiple Rust tests were racing on the shared global S3TierRegistry by calling clear(), which wiped entries registered by concurrently running tests. Use test-specific backend IDs and targeted remove() instead of clear() so tests no longer interfere with each other. |
||
|
|
0220b67115 |
fix(seaweed-volume): fix flaky Rust unit tests
- Increase volume_size_limit in preallocate test from 1KB to 100MB so disk-free fluctuations between get_disk_stats calls cannot make the integer-division results equal. - Add readiness synchronization to both spawn_fake_s3_server helpers so the test thread waits until axum is about to serve before proceeding. - Fix test_remote_vif_load_blocks_writes_but_allows_delete: register a dummy S3 backend with a test-specific ID so the volume can load its remote .vif without racing with other tests on the global registry. |
||
|
|
0da1794856 |
fix(rust): remove transitive openssl dependency from seaweed-volume
reqwest's default features include native-tls which depends on openssl-sys, causing builds to fail on musl targets where OpenSSL headers are not available. Since we already use rustls-tls, disable default features to eliminate the openssl-sys dependency entirely. |
||
|
|
9add18e169 |
fix(volume-rust): fix volume balance between Go and Rust servers (#8915)
Two bugs prevented reliable volume balancing when a Rust volume server is the copy target: 1. find_last_append_at_ns returned None for delete tombstones (Size==0 in dat header), falling back to file mtime truncated to seconds. This caused the tail step to re-send needles from the last sub-second window. Fix: change `needle_size <= 0` to `< 0` since Size==0 delete needles still have a valid timestamp in their tail. 2. VolumeTailReceiver called read_body_v2 on delete needles, which have no DataSize/Data/flags — only checksum+timestamp+padding after the header. Fix: skip read_body_v2 when size == 0, reject negative sizes. Also: - Unify gRPC server bind: use TcpListener::bind before spawn for both TLS and non-TLS paths, propagating bind errors at startup. - Add mixed Go+Rust cluster test harness and integration tests covering VolumeCopy in both directions, copy with deletes, and full balance move with tail tombstone propagation and source deletion. - Make FindOrBuildRustBinary configurable for default vs no-default features (4-byte vs 5-byte offsets). |
||
|
|
995dfc4d5d |
chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking. |
||
|
|
bb23939b36 |
fix(volume-rust): resolve gRPC bind address from hostname
SocketAddr::parse() only accepts numeric IPs, so binding the gRPC server to "localhost:18833" panicked. Use tokio::net::lookup_host() to resolve hostnames before passing to tonic's serve_with_shutdown. |
||
|
|
2a6f27eb08 | Suppress unused_mut warning for admin_router on non-unix builds | ||
|
|
08f48e62c9 |
Fix missing std::io::Read import for Windows build in ec_encoder
The #[cfg(not(unix))] fallback path uses f.read() which requires the Read trait to be in scope. |
||
|
|
e29b685c20 |
Gate pprof dependency behind cfg(unix) to fix Windows build
The pprof crate uses Unix-only APIs (nix, libc::pthread_t, libc::siginfo_t, etc.) that don't exist on Windows. Move it to [target.'cfg(unix)'.dependencies] and gate all profiling/debug module usage with #[cfg(unix)]. |