fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941)

* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)

EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.

validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
  - shards smaller than expected: an interrupted local encode left partial
    shards and the .dat is the complete source -> reclaim the .dat.
  - shards equal to expected: a valid (or still-distributing) EC volume ->
    keep; the shards may be the only copy.
  - shards larger than expected: the .dat is the stale/partial side (e.g. an
    interrupted decode left a half-written .dat next to the real shards) ->
    keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.

handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.

pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).

Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.

Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.

* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments

The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.

Also condense the verbose comments added in this change to the repo's
concise style.
This commit is contained in:
Chris Lu
2026-06-12 23:51:29 -07:00
committed by GitHub
parent 3718301599
commit f724828bcb
8 changed files with 393 additions and 186 deletions
@@ -14,6 +14,8 @@ import (
"github.com/seaweedfs/seaweedfs/test/volume_server/matrix"
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
"github.com/seaweedfs/seaweedfs/weed/storage/volume_info"
)
// TestEcLifecycleAcrossMultipleDisks drives encode, mount, read, drop-dat,
@@ -482,7 +484,6 @@ func plantPartialEc(t testing.TB, dir, collection string, volumeID uint32, shard
}{
{".ecx", []byte("dummy ecx")},
{".ecj", nil},
{".vif", nil},
} {
p := filepath.Join(dir, sideFileName(collection, volumeID, side.ext))
f, err := os.Create(p)
@@ -497,6 +498,19 @@ func plantPartialEc(t testing.TB, dir, collection string, volumeID uint32, shard
}
f.Close()
}
// Record the encode-time source size so the prune's byte-exact gate
// recognizes the sibling .dat as the source, as a real encoded volume does.
vifPath := filepath.Join(dir, sideFileName(collection, volumeID, ".vif"))
if err := volume_info.SaveVolumeInfo(vifPath, &volume_server_pb.VolumeInfo{
Version: uint32(needle.Version3),
DatFileSize: datFileSize,
EcShardConfig: &volume_server_pb.EcShardConfig{
DataShards: uint32(erasure_coding.DataShardsCount),
ParityShards: uint32(erasure_coding.ParityShardsCount),
},
}); err != nil {
t.Fatalf("save planted .vif: %v", err)
}
}
// mirrors calculateExpectedShardSize in weed/storage/disk_location_ec.go