* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns (encode time in unix nanos). It rides in .vif and the read request so a read can be scoped to the encode run that produced the index. * fix(ec): stamp each encode and reject cross-run shard reads Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the shard's owning volume (resolved together via FindEcVolumeWithShard, so a multi-disk server validates the disk that actually serves the bytes) and reject a shard from a different encode run, recovering from parity. A zero on either side (pre-upgrade volume) skips the guard. * fix(ec): stamp the encode identity on the worker-generated .vif The worker-local encode path now writes EncodeTsNs (and the resolved EC ratio) into the .vif, so the read guard is not silently off for volumes encoded by the maintenance worker. * fix(ec): wipe stale EC artifacts before re-encoding VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and removes its on-disk shard/index/sidecar files before writing fresh ones, so a retried encode never builds on a partial prior run and the unlink frees the inodes instead of leaving open fds serving old bytes. * fix(ec): unmount EC shards across all disks UnmountEcShards walked only the first disk holding the shard, leaving a duplicate copy mounted on a sibling disk (split-disk reconciled volumes) still serving and heartbeating. Traverse every disk and emit one deletion delta per disk. * fix(ec): delete orphan shards without a local .ecx deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx, so it could not clean an orphan .ecNN left by a failed copy on a disk with no index. Delete the requested shard files unconditionally; the index-file (.ecx/.ecj/.vif) routing stays gated as before. * fix(ec): clear stale EC shards cluster-wide before re-encoding ec.encode unmounts and deletes EC shards for the target volumes on every node before regenerating: fatal for the shards the topology reports (mounted leftovers), best-effort for the rest (a sweep that catches unmounted failed-copy orphans). A down node is a no-op. * fix(ec): don't nil EC fds on close so reads can't race eviction A reader resolves an EcVolume/shard under the lock then reads after it is released, so an eviction that nils ecxFile/ecdFile would race that read and panic. Close the fds without nilling the fields: the field is now write-once (no data race) and a concurrent read hits a closed fd, getting a clean error that the caller recovers from parity. * fix(ec): wipe stale EC artifacts on every disk and surface failures The pre-encode wipe only deleted beside the source volume, so a stale shard on a sibling disk survived and could be mounted against the new index at reconcile. Sweep every disk. Removal also ignored os.Remove errors, reporting a failed cleanup as success and letting a stale shard join the next generation; surface the first real failure (treating already-gone as success) from removeStaleEcArtifacts and the shard delete. * fix(ec): log when a local shard is skipped for a different encode run The cross-run guard returned errShardNotLocal, indistinguishable in logs from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so operators can tell "wrong encode generation" from "shard not here". * fix(ec): surface metadata removal failures in the shard delete path deleteEcShardIdsForEachLocation still dropped os.Remove errors on the .ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index condition this path prevents, so route those through removeFileIfExists and return the first real failure instead of reporting cleanup as success. * fix(ec): fail orphan cleanup when a reachable node's delete fails The pre-encode orphan sweep swallowed every error for unreported (node, volume) pairs. That is only safe for an unreachable node, which cannot receive this encode's new generation. A reachable node whose delete genuinely failed (permission/IO) keeps an orphan shard that a later copy re-stamps with the new run's volume-level .vif identity, so the read guard would accept stale data. Surface those; stay best-effort only for unreachable nodes (gRPC Unavailable / no status). * fix(ec): guard ecjFile under its lock in the EC delete path EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved its .ecx lookup before a concurrent eviction (the generate-time UnloadEcVolume) could then reach the journal append with a nil fd. Bail with a clear "volume closed" error under the lock instead. * fix(ec): reject an unstamped shard when the caller has an encode identity The read guard required both identities nonzero, so a current (stamped) caller accepted a holder with identity 0 and could be served a stale pre-upgrade shard. Reject when the caller is stamped and the holder differs (including unstamped); stay lenient only when the caller itself has no identity (pre-upgrade reader). A skipped shard recovers from parity. * fix(ec): full-teardown delete so cluster cleanup wipes a whole generation The pre-encode cluster sweep deleted only the listed canonical shards on remote nodes, leaving index/sidecar (and, on builds with versioned generations, those too) behind. Add a full_teardown flag to VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode cleanup paths set it. Other delete callers (balance/decode/repair) are unchanged. * fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while Close nils it under the lock, a data race on the field. Take the lock first, then nil-check inside, in both. * fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success An old volume server silently ignores full_teardown and returns success for an ordinary delete, so the caller wrongly believes the generation was wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done in the response; the worker destination cleanup fails when it is absent, and the shell cluster sweep fails for a reported (mounted) leftover while staying best-effort for an unreported node. encode_ts_ns stays an accepted transient (an old server just skips the new read guard, no regression). * fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown A reachable pre-upgrade server ignores full_teardown and returns success without wiping an orphan, which a later copy then folds into the new generation. Treat a missing full_teardown_done ack as fatal for every reachable node (best-effort only for a gRPC-unreachable one), not just for topology-reported pairs. * fix(ec): return the served shard identity and validate it client-side The encode identity was only enforced server-side, so a pre-upgrade server ignored the request field and served bytes unchecked. Echo the served shard's EncodeTsNs on every read response chunk and have the client reject a mismatch (including 0 from an old server), so the guard holds regardless of server version; a rejected read recovers from parity. * fix(ec): reject a short/empty remote shard read instead of serving zeros doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and returned success with a partly zero-filled, unvalidated buffer (the server stamps the identity only on chunks that carry bytes). A non-deleted interval must arrive whole: require n == len(buf), exempting the is_deleted short-circuit (n=0), matching readLocalEcShardInterval's local check. A short read now fails so the caller recovers from parity. * test(ec): fake volume server echoes the full_teardown acknowledgement The worker now fails a teardown delete that isn't acknowledged (so a pre-upgrade server can't silently skip the wipe). The fake server's no-op VolumeEcShardsDelete returned an empty response, which the worker read as a skipped teardown and aborted the encode. Echo full_teardown_done. * feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif and rejects a read served from a shard of a different run; full_teardown wipes a whole generation and acknowledges it. The Rust volume server had none of it. Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume, stamp it on every read response, and reject a request/response mismatch on both the server and the distributed-read client (recovering from parity); handle full_teardown by evicting the volume and wiping every EC artifact on each disk, echoing full_teardown_done so the caller can detect a server that ignored it. * fix(ec): remove a stale .vif on full teardown of a shard-only node A shard copy installs shards + .ecx before .vif, so an interrupted copy after a teardown could mount the new files under the previous run's identity / version / shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during full teardown, gated on .idx absence so a source-volume holder keeps its live .vif. In Rust this lives in a teardown-only helper so the reconcile / load- fallback paths (which share the base removal) still preserve .vif. * fix(ec): treat a missing teardown ack as fatal, not as an unreachable node isNodeUnreachable returned true for any non-gRPC-status error, so a reachable pre-upgrade server's missing full_teardown_done ack (a plain error) was classified unreachable and the unreported pair was silently skipped. Classify only a real codes.Unavailable as unreachable, and wrap the missing ack in a sentinel the sweep treats as fatal regardless. A genuinely down node still surfaces as Unavailable from the RPC and stays best-effort. * fix(ec): reject a short shard read in the local EC needle reader read_ec_shard_needle ignored the byte count from shard.read_at and appended the whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later length check and parsed as garbage. Require n == buf.len() per interval, erroring on a short read like the local interval reader already does. * fix(ec): probe reachability before skipping a node that returns Unavailable The pre-encode sweep skipped any node whose teardown delete returned codes.Unavailable, but a reachable volume server in maintenance mode also returns that code for the maintenance-gated delete, so its stale EC files were left behind on a node that can still receive the new generation. Confirm with a non-maintenance-gated empty-target Ping: skip only when the node fails the probe too (genuinely unreachable). * fix(ec): use try_exists for the teardown .vif .idx guard The teardown-only .vif removal gated on Path::exists(), which returns false on a permission/IO stat error, so a stat failure on a present .idx would read as a shard-only node and delete the live source volume's .vif. Gate on try_exists() == Ok(false) instead, preserving the sidecar on any stat error. * fix(ec): only skip a sweep node when a Ping confirms it is transport-down The pre-encode sweep skipped a node whenever its teardown delete and a liveness Ping both failed, but it treated ANY Ping error as down — an application-level Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a reachable node's stale generation in place. Classify the Ping tri-state and skip only when it transport-fails with codes.Unavailable; a reachable or inconclusive node stays fatal. * fix(ec): exclude sweep-skipped nodes from the encode's rebalance The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance then recollected the current topology — a node that recovered between the two could become a copy target and receive the new generation while still holding its stale, never-cleared shards. Have the sweep return the skipped set and exclude those nodes from the rebalance for this encode, so a node we could not clean cannot receive the new generation. Standalone ec.balance is unaffected. * fix(ec): re-sweep recovered nodes before generation so they aren't stranded A node skipped as down by the pre-encode sweep is excluded from the rebalance, but it can recover and become the generation host — mounting all shards locally, then being excluded from distribution. Union-only verification accepts all shards on one node and deletes the originals: a single point of failure. Re-sweep the skipped nodes just before generation; one whose teardown now succeeds leaves the skipped set and rebalances normally, while a node still down stays skipped. * fix(ec): abort the encode if a selected source is still skipped after re-sweep The re-sweep un-skips a recovered node, but the source was selected before it and a node can stay down through the re-sweep then recover just in time to be the generation host — mounting all shards locally while still excluded from the rebalance, which union-only verification accepts before deleting the originals. Abort the encode when a selected source remains skipped after the re-sweep. * fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch If a volume is not EC at the batch-delete classification but is encoded to EC and its .dat deleted before the regular-volume mutation, the mutation returns an exact "not found" that the filer chunk-GC treats as completed, dropping the delete. Recheck EC presence under the mutation lock and return a retriable 503 with the "try again" token so the filer requeues it onto the EC path. * fix(ec): recheck EC state before the regular batch-delete mutation ec.encode mounts EC shards (copied from the .dat) before deleting the originals, so a volume can be EC while its .dat still exists. The batch delete only rechecked EC after a NotFound, so a successful regular-volume delete in that window wrote a tombstone to the soon-removed .dat — the delete was lost and the needle resurrected from the pre-tombstone shards. Recheck has_ec_volume under the write lock before delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path. * fix(volume): make the metrics push test independent of test order test_push_metrics_once asserted the pushed body contains the request-counter family without ever touching the counter — a CounterVec with no children emits nothing, so the assertion only held when another test had already created a labelset in the shared registry. Create one in the test itself.
SeaweedFS Volume Server (Rust)
A drop-in replacement for the SeaweedFS Go volume server, rewritten in Rust. It uses binary-compatible storage formats (.dat, .idx, .vif) and speaks the same HTTP and gRPC protocols, so it works with an unmodified Go master server.
Building
Requires Rust 1.75+ (2021 edition).
cd seaweed-volume
cargo build --release
The binary is produced at target/release/seaweed-volume.
Running
Start a Go master server first, then point the Rust volume server at it:
# Minimal
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7
# Multiple data directories
seaweed-volume --port 8080 --master localhost:9333 \
--dir /mnt/ssd1,/mnt/ssd2 --max 100,100 --disk ssd
# With datacenter/rack topology
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
--dataCenter dc1 --rack rack1
# With JWT authentication
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
--securityFile /etc/seaweedfs/security.toml
# With TLS (configured in security.toml via [https.volume] and [grpc.volume] sections)
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
--securityFile /etc/seaweedfs/security.toml
Common flags
| Flag | Default | Description |
|---|---|---|
--port |
8080 |
HTTP listen port |
--port.grpc |
port+10000 |
gRPC listen port |
--master |
localhost:9333 |
Comma-separated master server addresses |
--dir |
/tmp |
Comma-separated data directories |
--max |
8 |
Max volumes per directory (comma-separated) |
--ip |
auto-detect | Server IP / identifier |
--ip.bind |
same as --ip |
Bind address |
--dataCenter |
Datacenter name | |
--rack |
Rack name | |
--disk |
Disk type tag: hdd, ssd, or custom |
|
--index |
memory |
Needle map type: memory, leveldb, leveldbMedium, leveldbLarge |
--readMode |
proxy |
Non-local read mode: local, proxy, redirect |
--fileSizeLimitMB |
256 |
Max upload file size |
--minFreeSpace |
1 (percent) |
Min free disk space before marking volumes read-only |
--securityFile |
Path to security.toml for JWT keys and TLS certs |
|
--metricsPort |
0 (disabled) |
Prometheus metrics endpoint port |
--whiteList |
Comma-separated IPs with write permission | |
--preStopSeconds |
10 |
Graceful drain period before shutdown |
--compactionMBps |
0 (unlimited) |
Compaction I/O rate limit |
--pprof |
false |
Enable pprof HTTP handlers |
Set RUST_LOG=debug (or trace, info, warn) for log level control.
Set SEAWEED_WRITE_QUEUE=1 to enable batched async write processing.
Features
- Binary compatible -- reads and writes the same
.dat/.idx/.viffiles as the Go server; seamless migration with no data conversion. - HTTP + gRPC -- full implementation of the volume server HTTP API and all gRPC RPCs including streaming operations (copy, tail, incremental copy, vacuum).
- Master heartbeat -- bidirectional streaming heartbeat with the Go master server; volume and EC shard registration, leader failover, graceful shutdown deregistration.
- JWT authentication -- signing key configuration via
security.tomlwith token source precedence (query > header > cookie), file_id claims validation, and separate read/write keys. - TLS -- HTTPS for the HTTP API and mTLS for gRPC, configured through
security.toml. - Erasure coding -- Reed-Solomon EC shard management: mount/unmount, read, rebuild, copy, delete, and shard-to-volume reconstruction.
- S3 remote storage --
FetchAndWriteNeedlereads from any S3-compatible backend (AWS, MinIO, Wasabi, Backblaze, etc.) and writes locally. SupportsVolumeTierMoveDatToRemote/FromRemotefor tiered storage. - Needle map backends -- in-memory HashMap, LevelDB (via
rusty-leveldb), or redb (pure Rust disk-backed) needle maps. - Image processing -- on-the-fly resize/crop, JPEG EXIF orientation auto-fix, WebP support.
- Streaming reads -- large files (>1MB) are streamed via
spawn_blockingto avoid blocking the async runtime. - Auto-compression -- compressible file types (text, JSON, CSS, JS, SVG, etc.) are gzip-compressed on upload.
- Prometheus metrics -- counters, histograms, and gauges exported at a dedicated metrics port; optional push gateway support.
- Graceful shutdown -- SIGINT/SIGTERM handling with configurable
preStopSecondsdrain period.
Testing
Rust unit tests
cd seaweed-volume
cargo test
Go integration tests
The Go test suite can target either the Go or Rust volume server via the VOLUME_SERVER_IMPL environment variable:
# Run all HTTP + gRPC integration tests against the Rust server
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 1200s \
./test/volume_server/grpc/... ./test/volume_server/http/...
# Run a single test
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 60s \
-run "TestName" ./test/volume_server/http/...
# Run S3 remote storage tests
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 180s \
-run "TestFetchAndWriteNeedle" ./test/volume_server/grpc/...
Load testing
A load test harness is available at test/volume_server/loadtest/. See that directory for usage instructions and scenarios.
Architecture
The server runs three listeners concurrently:
- HTTP (Axum 0.7) -- admin and public routers for file upload/download, status, and stats endpoints.
- gRPC (Tonic 0.12) -- all
VolumeServerRPCs from the SeaweedFS protobuf definition. - Metrics (optional) -- Prometheus scrape endpoint on a separate port.
Key source modules:
| Path | Description |
|---|---|
src/main.rs |
Entry point, server startup, signal handling |
src/config.rs |
CLI parsing and configuration resolution |
src/server/volume_server.rs |
HTTP router setup and middleware |
src/server/handlers.rs |
HTTP request handlers (read, write, delete, status) |
src/server/grpc_server.rs |
gRPC service implementation |
src/server/heartbeat.rs |
Master heartbeat loop |
src/storage/volume.rs |
Volume read/write/delete logic |
src/storage/needle.rs |
Needle (file entry) serialization |
src/storage/store.rs |
Multi-volume store management |
src/security.rs |
JWT validation and IP whitelist guard |
src/remote_storage/ |
S3 remote storage backend |
See DEV_PLAN.md for the full development history and feature checklist.