fix(ec): blanket-clean every destination over the full shard range (#9512)

* fix(ec): blanket-clean every destination over the full shard range

The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.

Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.

Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.

* log(ec): include disk_id in EC unmount/delete/refusal log lines

The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:

  UnmountEcShards 4121.1                              -> add disk_id
  ec volume video-recordings_4121 shard delete [1 5]  -> add per-loc disk_id
  volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
  ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids

ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.

Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).

* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server

The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
This commit is contained in:
Chris Lu
2026-05-17 11:31:37 -07:00
committed by GitHub
parent bf9110ebd3
commit 2a41e76101
10 changed files with 255 additions and 42 deletions
+14
View File
@@ -292,6 +292,20 @@ func (v *VolumeServer) VolumeEcShardsMount(ctx context.Context, req *volume_serv
return &volume_server_pb.VolumeEcShardsMountResponse{}, nil
}
// VolumeEcShardsUnmount is a no-op stub: the worker's pre-distribute
// cleanup calls it against every destination, and the fake server has no
// mounted state to clear.
func (v *VolumeServer) VolumeEcShardsUnmount(ctx context.Context, req *volume_server_pb.VolumeEcShardsUnmountRequest) (*volume_server_pb.VolumeEcShardsUnmountResponse, error) {
return &volume_server_pb.VolumeEcShardsUnmountResponse{}, nil
}
// VolumeEcShardsDelete is a no-op stub paired with VolumeEcShardsUnmount
// above; the fake server doesn't persist shard files beyond what
// ReceiveFile wrote, so there's nothing to remove.
func (v *VolumeServer) VolumeEcShardsDelete(ctx context.Context, req *volume_server_pb.VolumeEcShardsDeleteRequest) (*volume_server_pb.VolumeEcShardsDeleteResponse, error) {
return &volume_server_pb.VolumeEcShardsDeleteResponse{}, nil
}
func (v *VolumeServer) VolumeEcShardsInfo(ctx context.Context, req *volume_server_pb.VolumeEcShardsInfoRequest) (*volume_server_pb.VolumeEcShardsInfoResponse, error) {
if req == nil {
return nil, fmt.Errorf("VolumeEcShardsInfo request is nil")