Files
Chris Lu 3718301599 shell: stop ec.encode/ec.rebuild from destroying live EC shards (no crash needed) (#9939)
* shell: stop ec.encode/ec.rebuild from destroying live EC shards

Three operator-triggered shell paths could destroy data with no crash:

ec.encode -volumeId on an already-EC volume tore down its shards before
failing. The volume-id path never checked the id was a regular volume:
the collection lookup scans only VolumeInfos (so an EC-only id maps to
""), and volumeLocations succeeds via the EC-location fallback, so
clearPreexistingEcShards full-teardown-deleted every shard cluster-wide
before doEcEncode failed. An EC volume has no .dat, so this is its only
copy. Add assertEncodableRegularVolumes: each requested id must be a
regular volume in the topology snapshot; an EC-only or unknown id is
refused before any teardown. A volume present as both a regular .dat and
stale orphan shards (a failed-encode retry) still passes. This closes
the operator-rerun/script-retry path; a worker racing the snapshot is a
fencing problem handled separately.

ec.rebuild dry-run (the default, without -apply) still issued real
VolumeEcShardsDelete RPCs: prepareDataToRecover appended every
would-copy shard to copiedShardIds even though the copy was skipped, and
the cleanup defer deleted that set unconditionally. Now a dry-run copies
nothing and records nothing to delete (a separate would-copy counter
drives the recoverability check so the dry-run still reports its plan),
and the cleanup runs only under -apply.

ec.rebuild could also self-destruct a live shard: localShardsInfo was
overwritten per disk instead of unioned, so a shard the rebuilder holds
on a non-last disk looked remote, got copied onto itself (in-place
O_TRUNC) and then node-wide deleted. Union local shards across all
disks, and never copy/delete a shard whose only listed holder is the
rebuilder itself.

* shell: address ec destructive-guards review comments

- countLocalShards: union shards across all of the rebuilder's disks so
  slot accounting matches what prepareDataToRecover treats as local;
  first-match counting overstated slotsNeeded on multi-disk rebuilders
- VolumeEcShardsCopy: resolve SourceDataNode via
  pb.NewServerAddressFromDataNode instead of the raw node id, which may
  not be a dialable host:port
- assertEncodableRegularVolumes: skip nil DiskInfo map entries, matching
  the other topology walks in this file; rename ecOnly to hasEcShards
  since the map marks any volume with shards, not only shard-only ones
2026-06-12 22:30:17 -07:00
..
2026-02-09 01:37:56 -08:00