Files
seaweedfs/seaweed-volume
Chris Lu 9d6d068f41 feat(seaweed-volume): cross-disk EC shard reconciliation (#9212) (#9252)
* fix(seaweed-volume): fall back to idx dir when reading .vif

EcVolume::new and read_ec_shard_config only looked for .vif at the
data dir. With the cross-disk reconcile path (where shards live on
one disk and .ecx / .ecj / .vif live on a sibling disk —
seaweedfs/seaweedfs#9212 / #9244), this would either write a stub
.vif on the shard disk and lose the real EC config + dat_file_size
or fall back to default ratios despite a perfectly good .vif being
present elsewhere on the same volume server.

Add a small `locate_vif_path` helper that prefers the data dir and
falls back to the idx dir when it differs, and thread the data dir
+ idx dir pair through `read_ec_shard_config`. Three call sites in
grpc_server.rs (VolumeEcShardsGenerate, VolumeEcShardsRebuild, scrub)
updated; the scrub path passes the same dir for both args because
`find_ec_dir` is the only locator there.

* feat(seaweed-volume): primitives for cross-disk EC shard reconcile

Adds the three small helpers the reconcile pass needs:

- DiskLocation::mount_ec_shards_with_idx_dir — mounts shards on this
  disk while pointing the EcVolume at a sibling disk's idx dir for
  .ecx / .ecj / .vif. Mirrors loadEcShardsWithIdxDir in
  weed/storage/disk_location_ec.go. The existing mount_ec_shards is
  kept as a thin wrapper over it.

- EcVolume::has_shard — `pub` accessor over the internal Vec<Option>
  shard slot so the reconcile pass can skip shards that are already
  registered.

- pub(crate) re-exports of parse_collection_volume_id and
  parse_ec_shard_extension under names parse_collection_volume_id_pub
  and is_ec_shard_extension so the reconcile module can call them
  without re-implementing the parsers.

No behaviour change. Reconciliation logic in the next commit.

* feat(seaweed-volume): cross-disk EC shard reconciliation (#9212)

Closes the loader half of seaweedfs/seaweedfs#9212 on the Rust side,
mirroring the Go fix in seaweedfs/seaweedfs#9244. With the auto-load
in feat/rust-load-all-ec-shards-9212 in place, the only remaining gap
is shards that landed on a disk without their `.ecx` — for example
when ec.balance / ec.rebuild moved them onto a destination node's
second disk while leaving the index files on the disk that already
held the volume. Without this, those orphan shards stay invisible to
the master and ec.rebuild reports the volume as unrepairable.

After every DiskLocation has finished its per-disk EC scan, sweep the
store for shards that live on a disk without local index files and
load them by reaching across to a sibling disk's `.ecx` / `.ecj` /
`.vif`:

  - Store::reconcile_ec_shards_across_disks walks each disk for
    orphan `.ec??` files (present on disk, not yet registered to an
    EcVolume) and matches them against an `(collection, vid) ->
    EcxOwnerInfo` map of which disk owns each `.ecx`.
  - Each matched group is mounted on its physical disk's ec_volumes
    map (so heartbeat reporting carries the right disk_id per shard)
    via `mount_ec_shards_with_idx_dir`, pointing the EcVolume at the
    sibling's idx dir.
  - `index_ecx_owners` records the directory each `.ecx` was found in
    (IdxDirectory or Directory) so the loader doesn't ENOENT when the
    legacy "written before -dir.idx was set" layout puts `.ecx` in
    the data dir. This mirrors the PR #9244 review fix from
    @gemini-code-assist / @coderabbitai (see Go commit af57cc652).
  - True orphans (no `.ecx` anywhere on this server) log a warning
    and stay on disk untouched — operator can restore the index later.

Wired into Store::add_location and Store::load_new_volumes so a fresh
restart and any later disk additions both pick up cross-disk shards.

Tests cover all four behaviour shapes:
- shards on dir0 + .ecx on dir1 → reconciled to dir0's ec_volumes
- .ecx in owner's data dir (legacy layout) → reconciled correctly
- self-contained disks → reconcile is a no-op
- truly-orphan shards (no .ecx anywhere) → left on disk, logged

* fix(seaweed-volume): propagate EcVolume::new errors instead of unwrap

mount_ec_shards_with_idx_dir built the missing EcVolume inside an
entry().or_insert_with() closure, which can't return a Result — so
any EcVolume::new failure (e.g. .ecx open error, .ecj create error,
malformed .vif) panicked the volume server via unwrap(). The
constructor already returns Result<>, so propagate it as
VolumeError::Io instead.

Reported in PR #9252 review by @gemini-code-assist (high) and
@coderabbitai (critical).

* perf(seaweed-volume): use DirEntry::metadata in collect_orphan_ec_shards

Replaced the extra fs::metadata(&path) lookup with ent.metadata() so
we don't pay an additional stat syscall per directory entry beyond
what read_dir already returned. Drops the now-unused std::path::Path
import alongside.

Reported in PR #9252 review by @gemini-code-assist.

* fix(seaweed-volume): scrub uses EcVolume's real dir_idx for split-disk volumes

After cross-disk reconciliation an EcVolume can legitimately have
ecv.dir != ecv.dir_idx (shards on one disk, .ecx / .ecj / .vif on a
sibling). The scrub path collapsed both args to find_ec_dir's single
answer, so read_ec_shard_config fell back to the wrong .vif location
for exactly the split-disk layout this PR loads — skewing
shard-count detection and verification results.

Use ecv.dir / ecv.dir_idx directly so scrub reads the metadata from
where the volume's index files actually live.

Reported in PR #9252 review by @coderabbitai.

* feat(seaweed-volume): primitives for split-disk EC volume operations

Reconciliation can mount the same `vid` on multiple DiskLocations
with disjoint shard subsets. The existing first-match `find_ec_volume`
isn't enough for read/unmount/delete/decode paths that need to act on
a specific shard or aggregate across the whole volume — they have to
walk every location and find the right home for each shard.

Add the small Store-level lookup primitives Go's findEcShard /
CollectEcShards already provide:

- `Store::find_ec_shard_location(vid, shard_id)` — returns the index
  of the location that has `(vid, shard_id)` mounted, if any.
- `Store::find_ec_volume_with_shard(vid, shard_id)` — same idea but
  returns the EcVolume directly.
- `Store::collect_ec_shard_dirs(vid, max_shard_count)` — returns
  the EcVolume to use for metadata plus per-shard data dirs (None
  when the shard isn't mounted on any disk). Mirrors
  `Store.CollectEcShards` in `weed/storage/store_ec.go`.

And the EcVolume accessors callers need:

- `EcVolume::has_shard(shard_id)` — was already added for the cross-
  disk reconcile but is now a load-bearing primitive for placement
  decisions on a per-shard basis. Pulled into the dedicated commit.
- `EcVolume::ecx_actual_dir()` — exposes the directory the `.ecx`
  was actually opened from. The decoder needs it for the .ecx
  lookup when shards are split across data dirs and `.ecx` lives on
  a sibling idx dir.

Plus a small defensive change to `DiskLocation::unmount_ec_shards`:
only decrement the per-shard gauge for shards that were actually
mounted. Without this, the upcoming `Store::unmount_ec_shards`
fan-out to every location would underflow the metric whenever a
shard is requested for unmount on a sibling disk that doesn't have
it.

No behaviour change at the call sites yet — wiring follows in the
next commits.

* fix(seaweed-volume): unmount_ec_shards visits every location with the vid

Store::unmount_ec_shards and Store::unmount_ec_shard returned after
the first DiskLocation with the volume id, even if that location did
not contain the requested shard. With reconciled split-disk volumes
(shards 0/12 on disk 0, shard 1 on disk 1 — the issue #9212 layout
this PR loads), VolumeEcShardsUnmount for a later-disk shard became a
silent no-op and Store::delete_ec_shards could remove the shard file
while leaving an in-memory shard + open file handle stale on the
later location.

Walk all locations that have the EcVolume and ask each to unmount
whatever subset of `shard_ids` it actually has — the
`DiskLocation::unmount_ec_shards` defensive guard from the previous
commit makes the fan-out safe (no metric underflow when a sibling
disk is asked to unmount a shard it doesn't hold).

* fix(seaweed-volume): VolumeEcShardRead reads from the shard's home disk

VolumeEcShardRead resolved the EcVolume via first-match
`find_ec_volume(vid)` and then looked up the requested shard on that
single EcVolume. With reconciled split-disk volumes (the layout
seaweedfs/seaweedfs#9212 produces — shards 0/12 on disk 0, shard 1
on disk 1), a request for shard 1 hit disk 0 first and returned
"shard 1 not mounted" even though it was happily mounted on disk 1.

Switch to `find_ec_volume_with_shard(vid, shard_id)` so the lookup
walks every location and returns the EcVolume whose disk actually
holds the shard. The deleted-needle check still works because every
per-disk EcVolume for the same vid points at the same `.ecx` file
(post-reconcile, both disks open the same sealed index).

* fix(seaweed-volume): VolumeEcShardsToVolume aggregates shards across disks

VolumeEcShardsToVolume resolved a single EcVolume via
`find_ec_volume(vid)` and then checked `ec_vol.shards[i]` for each
data shard. With reconciled split-disk volumes that's the wrong
view: the first-match EcVolume only carries the shards on its disk,
so the presence check would either reject the request as
"missing shard" or — if shards happened to be on the first disk —
fall through to `write_dat_file_from_shards(&dir, ...)` which only
reads from the EcVolume's single dir.

Mirror Go's CollectEcShards by aggregating per-shard data dirs
across every location with the volume:

- Add `Store::collect_ec_shard_dirs` (in the previous primitives
  commit) returning the EcVolume to use for metadata + per-shard
  dir slots.
- Extend `find_dat_file_size` and `write_dat_file_from_shards` with
  `_with_dirs` variants that take the `.ec00` dir and per-shard
  dirs separately, so a decoded volume whose shards live on
  several disks can still be reconstructed. The original signatures
  delegate to the new ones with the same dir for all shards, so
  every existing caller keeps working unchanged.
- Rewire VolumeEcShardsToVolume through the helpers — presence
  check sees the union, dat_file_size reads `.ec00` from the right
  disk and `.ecx` from the EcVolume's actual idx dir, the decoder
  reads each shard from its own home dir.

* test(seaweed-volume): split-disk read / unmount / delete / collect

Five tests exercising the four behaviour shapes the PR #9252 review
flagged on multi-location EC volumes. Each builds the cross-disk
split layout from issue #9212 (shards 0 and 12 on disk 0, shard 1 +
.ecx on disk 1) via the new `build_split_disk_store` helper and
asserts:

- `find_ec_shard_location` / `find_ec_volume_with_shard` route to
  the disk that actually holds each shard (not first-match).
- `Store::unmount_ec_shards([1])` reaches disk 1 and removes shard 1
  while leaving disk 0's unrelated shards mounted (used to be a
  silent no-op).
- `Store::unmount_ec_shard(vid, 1)` ditto for the single-shard
  variant.
- `Store::delete_ec_shards` removes both the on-disk file and the
  in-memory mount on the right disk; previously deletion could
  remove the file while the in-memory shard with its open file
  handle survived on a different location.
- `collect_ec_shard_dirs` reports the right per-shard data dir for
  each location and `None` for unmounted shards.

* fix(seaweed-volume): retry same-disk legacy .ecx layout in reconcile

The unconditional `owner.location == loc_idx` skip missed the layout
where `idx_directory` is configured but the owner's `.ecx` / `.ecj` /
`.vif` still live in `loc.directory` (the legacy "written before
-dir.idx was set" shape). In that case the per-disk loader's
mount_ec_shards used `loc.idx_directory` and ENOENT'd, then this
branch suppressed the only recovery path — the owner disk's own
shards stayed unloaded after startup.

Tighten the skip so it only fires when the discovered owner dir is
already `loc.idx_directory` (the loader-already-tried-and-failed
case). When `owner.idx_dir` differs (legacy data-dir layout), queue
a same-disk retry through `mount_ec_shards_with_idx_dir(...,
&owner.idx_dir)` so reconcile becomes the recovery path.

Reported in PR #9252 review by @coderabbitai.

* fix(seaweed-volume): roll back partial mounts on cross-disk reconcile failure

mount_ec_shards_with_idx_dir adds shards one at a time and
increments the `ec_shards` gauge per shard that successfully attaches.
A mid-loop failure (e.g. an EcVolumeShard::open error after the
first few shards already attached) used to leave the EcVolume
half-mounted with stale metric increments — the warn!() branch only
logged the error.

Mirror DiskLocation::handle_found_ecx_file's recovery path: drive
the cleanup through `loc.unmount_ec_shards(vid, &shard_ids)` after
a failed mount. The defensive change in #9251 makes
unmount_ec_shards only decrement the gauge for shards that were
actually mounted and drops the EcVolume when it reaches zero
shards, so the rollback is safe even though some of `shard_ids`
never attached.

Reported in PR #9252 review by @coderabbitai.

* test(seaweed-volume): cover the two reconcile fixes from PR #9252 review

Two new tests in store_ec_reconcile:

- test_reconcile_recovers_same_disk_legacy_ecx_layout — sets up the
  layout where idx_directory is configured but the owner's .ecx
  lives in loc.directory. The per-disk loader's mount_ec_shards
  uses loc.idx_directory and fails; reconcile should retry on the
  same disk with the owner's actual idx_dir and the owner's own
  shards must come back online.

- test_reconcile_rolls_back_partial_mounts_on_failure — sabotages
  one of the orphan shard files (replaces it with a directory of
  the same name) so EcVolumeShard::open errors out partway through
  mount_ec_shards_with_idx_dir. Asserts the post-condition that no
  EcVolume entry retains a "shard mounted" claim that doesn't
  correspond to a real shard file.
2026-04-27 19:01:30 -07:00
..

SeaweedFS Volume Server (Rust)

A drop-in replacement for the SeaweedFS Go volume server, rewritten in Rust. It uses binary-compatible storage formats (.dat, .idx, .vif) and speaks the same HTTP and gRPC protocols, so it works with an unmodified Go master server.

Building

Requires Rust 1.75+ (2021 edition).

cd seaweed-volume
cargo build --release

The binary is produced at target/release/seaweed-volume.

Running

Start a Go master server first, then point the Rust volume server at it:

# Minimal
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7

# Multiple data directories
seaweed-volume --port 8080 --master localhost:9333 \
  --dir /mnt/ssd1,/mnt/ssd2 --max 100,100 --disk ssd

# With datacenter/rack topology
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
  --dataCenter dc1 --rack rack1

# With JWT authentication
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
  --securityFile /etc/seaweedfs/security.toml

# With TLS (configured in security.toml via [https.volume] and [grpc.volume] sections)
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
  --securityFile /etc/seaweedfs/security.toml

Common flags

Flag Default Description
--port 8080 HTTP listen port
--port.grpc port+10000 gRPC listen port
--master localhost:9333 Comma-separated master server addresses
--dir /tmp Comma-separated data directories
--max 8 Max volumes per directory (comma-separated)
--ip auto-detect Server IP / identifier
--ip.bind same as --ip Bind address
--dataCenter Datacenter name
--rack Rack name
--disk Disk type tag: hdd, ssd, or custom
--index memory Needle map type: memory, leveldb, leveldbMedium, leveldbLarge
--readMode proxy Non-local read mode: local, proxy, redirect
--fileSizeLimitMB 256 Max upload file size
--minFreeSpace 1 (percent) Min free disk space before marking volumes read-only
--securityFile Path to security.toml for JWT keys and TLS certs
--metricsPort 0 (disabled) Prometheus metrics endpoint port
--whiteList Comma-separated IPs with write permission
--preStopSeconds 10 Graceful drain period before shutdown
--compactionMBps 0 (unlimited) Compaction I/O rate limit
--pprof false Enable pprof HTTP handlers

Set RUST_LOG=debug (or trace, info, warn) for log level control. Set SEAWEED_WRITE_QUEUE=1 to enable batched async write processing.

Features

  • Binary compatible -- reads and writes the same .dat/.idx/.vif files as the Go server; seamless migration with no data conversion.
  • HTTP + gRPC -- full implementation of the volume server HTTP API and all gRPC RPCs including streaming operations (copy, tail, incremental copy, vacuum).
  • Master heartbeat -- bidirectional streaming heartbeat with the Go master server; volume and EC shard registration, leader failover, graceful shutdown deregistration.
  • JWT authentication -- signing key configuration via security.toml with token source precedence (query > header > cookie), file_id claims validation, and separate read/write keys.
  • TLS -- HTTPS for the HTTP API and mTLS for gRPC, configured through security.toml.
  • Erasure coding -- Reed-Solomon EC shard management: mount/unmount, read, rebuild, copy, delete, and shard-to-volume reconstruction.
  • S3 remote storage -- FetchAndWriteNeedle reads from any S3-compatible backend (AWS, MinIO, Wasabi, Backblaze, etc.) and writes locally. Supports VolumeTierMoveDatToRemote/FromRemote for tiered storage.
  • Needle map backends -- in-memory HashMap, LevelDB (via rusty-leveldb), or redb (pure Rust disk-backed) needle maps.
  • Image processing -- on-the-fly resize/crop, JPEG EXIF orientation auto-fix, WebP support.
  • Streaming reads -- large files (>1MB) are streamed via spawn_blocking to avoid blocking the async runtime.
  • Auto-compression -- compressible file types (text, JSON, CSS, JS, SVG, etc.) are gzip-compressed on upload.
  • Prometheus metrics -- counters, histograms, and gauges exported at a dedicated metrics port; optional push gateway support.
  • Graceful shutdown -- SIGINT/SIGTERM handling with configurable preStopSeconds drain period.

Testing

Rust unit tests

cd seaweed-volume
cargo test

Go integration tests

The Go test suite can target either the Go or Rust volume server via the VOLUME_SERVER_IMPL environment variable:

# Run all HTTP + gRPC integration tests against the Rust server
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 1200s \
  ./test/volume_server/grpc/... ./test/volume_server/http/...

# Run a single test
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 60s \
  -run "TestName" ./test/volume_server/http/...

# Run S3 remote storage tests
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 180s \
  -run "TestFetchAndWriteNeedle" ./test/volume_server/grpc/...

Load testing

A load test harness is available at test/volume_server/loadtest/. See that directory for usage instructions and scenarios.

Architecture

The server runs three listeners concurrently:

  • HTTP (Axum 0.7) -- admin and public routers for file upload/download, status, and stats endpoints.
  • gRPC (Tonic 0.12) -- all VolumeServer RPCs from the SeaweedFS protobuf definition.
  • Metrics (optional) -- Prometheus scrape endpoint on a separate port.

Key source modules:

Path Description
src/main.rs Entry point, server startup, signal handling
src/config.rs CLI parsing and configuration resolution
src/server/volume_server.rs HTTP router setup and middleware
src/server/handlers.rs HTTP request handlers (read, write, delete, status)
src/server/grpc_server.rs gRPC service implementation
src/server/heartbeat.rs Master heartbeat loop
src/storage/volume.rs Volume read/write/delete logic
src/storage/needle.rs Needle (file entry) serialization
src/storage/store.rs Multi-volume store management
src/security.rs JWT validation and IP whitelist guard
src/remote_storage/ S3 remote storage backend

See DEV_PLAN.md for the full development history and feature checklist.