Files
Chris Lu 34f9b91d69 fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards

A leftover empty .dat stub (a phantom from the pre-fix loader; zero
needles) next to a distributed EC volume's local shards made startup
classify the volume as an interrupted local encode: validateEcVolume
requires >= dataShards local shards when a .dat is present, fails with
the 1-2 shards a distributed volume keeps per disk, and the cleanup
deletes those shards -- the only copies of that part of the volume.
Repeated across restart waves this destroys enough shards cluster-wide
to make the volume unrecoverable.

Go:
- loadExistingVolume: hoist the empty-stub sweep above the EC presence
  checks. Previously the .vif-next-to-.ecx guard returned before the
  sweep ever ran, so exactly the dangerous layout (stub + .ecx + local
  shards) kept its stub and then lost its shards in loadAllEcShards.
- validateEcVolume / checkDatFileExists: treat a .dat <= a superblock
  (zero needles) as absent. An empty .dat cannot be the encode source,
  so it must never gate shard deletion; this also covers stubs without
  a .vif, which the sweep cannot prove are EC leftovers.

Rust mirror (seaweed-volume): the same gate in validate_ec_volume and
check_dat_file_exists (the Rust sweep already ran before validation);
the volume-load skip keeps a plain existence check so fresh,
needle-less volumes still load.

Regression tests in Go and Rust reproduce the production layout (a
zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with
and without a .vif) and fail without the fix with the shards deleted.

* fix(ec): gate source volume deletion on a recoverable shard set

After EC encode, the shell command and the (plugin) worker task refused
to delete the source volume unless every shard was present, and aborted
otherwise -- leaving the source .dat next to live shards, exactly the
mixed state the startup cleanup mishandles.

Replace the full-set requirement with a recoverability gate shared by
both callers (RequireRecoverableShardSet): deleting a non-empty source
.dat requires at least dataShards distinct shards cluster-wide. Below
that the source is kept and the encode fails as before. A degraded but
recoverable set (>= dataShards, < total) now proceeds with a warning
instead of aborting: the missing shards can be rebuilt from the
survivors, while keeping the source would preserve the dangerous mixed
state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an
empty .dat has nothing to lose.

dataShards/totalShards stay parameters so enterprise custom EC ratios
share the helper verbatim.

* test(ec): use recoverable shard verification gate
2026-06-11 20:26:20 -07:00
..
2026-04-09 19:00:06 -07:00
2026-04-09 19:00:06 -07:00