mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-13 23:36:45 +03:00

Files

T

History

Chris Lu f724828bcb fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941 )

* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)

EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.

validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
- shards smaller than expected: an interrupted local encode left partial
shards and the .dat is the complete source -> reclaim the .dat.
- shards equal to expected: a valid (or still-distributing) EC volume ->
keep; the shards may be the only copy.
- shards larger than expected: the .dat is the stale/partial side (e.g. an
interrupted decode left a half-written .dat next to the real shards) ->
keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.

handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.

pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).

Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.

Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.

* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments

The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.

Also condense the verbose comments added in this change to the repo's
concise style.

2026-06-12 23:51:29 -07:00

framework

test(ec): end-to-end encode over a multi-server multi-disk stuck layout (#9728 )

2026-05-28 16:44:42 -07:00

grpc

fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941 )

2026-06-12 23:51:29 -07:00

http

writeJson: drop unused JSONP branch (#9686 )

2026-05-26 01:05:07 -07:00

loadtest

go fmt

2026-04-10 17:31:14 -07:00

matrix

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

merge

Adds volume.merge command with deduplication and disk-based backend (#8441 )

2026-02-25 10:12:09 -08:00

rust

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

DEV_PLAN.md

Add volume server integration test suite and CI workflow (#8322 )

2026-02-13 00:40:56 -08:00

Makefile

Add volume server integration test suite and CI workflow (#8322 )

2026-02-13 00:40:56 -08:00

README.md

Add volume server integration test suite and CI workflow (#8322 )

2026-02-13 00:40:56 -08:00

README.md

Volume Server Integration Tests

This package contains integration tests for SeaweedFS volume server HTTP and gRPC APIs.

Run Tests

Run tests from repo root:

go test ./test/volume_server/... -v

If a weed binary is not found, the harness will build one automatically.

Optional environment variables

WEED_BINARY: explicit path to the weed executable (disables auto-build).
VOLUME_SERVER_IT_KEEP_LOGS=1: keep temporary test directories and process logs.

Current scope (Phase 0)

Shared cluster/framework utilities
Matrix profile definitions
Initial HTTP admin endpoint checks
Initial gRPC state/status checks

More API coverage is tracked in /Users/chris/dev/seaweedfs2/test/volume_server/DEV_PLAN.md.