Files
seaweedfs/weed/shell/command_ec_scrub.go
T
Chris Lu 9658f309d2 EC bitrot detection: per-shard checksum sidecars (#9761)
* ec: add EC bitrot checksum protobuf

EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages,
copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode.

* ec: bitrot checksum sidecar format, validation, and per-volume load

Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a
self-integrity header; validation, rolling builder, backfill primitive, and
EcVolume load on mount + removal on destroy.

* ec: capture per-shard checksums at encode; verify-and-exclude on rebuild

WriteEcFilesWithContext returns the protection computed inline during encoding.
generateMissingEcFiles verifies present inputs against the sidecar, excludes
corrupt ones, regenerates in place, and re-verifies; fail-closed unless
unsafe_ignore_sidecar, removing all generated outputs on failure.

* ec: read-only checksum scrub with Reed-Solomon arbiter

ChecksumScrub verifies each local shard against the sidecar and reconstructs
flagged shards from the clean shards so stale-sidecar false positives are not
reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum.

* ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill

Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant);
remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and
opportunistically backfills a sidecar when all shards are reachable.

* ec: volume server bitrot config flags

-ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16).

* fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply

Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge
flag value cannot overflow int64 and slip past the power-of-two check, and a
block size cannot collapse a sidecar to a few oversized blocks.

* fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path

The worker EC encode wrote the generation-0 sidecar locally but never added it
to shardFiles, so DistributeEcShards never shipped it and the distributed
holders came up unprotected. Append it to shardFiles and map the ecsum shard
type to its extension in the sender so it travels with the shards.

* fix(ec_bitrot): remove orphaned sidecars when the generation is gone

Gate sidecar removal on existingShardCount==0 alone rather than also requiring a
stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be
removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays
gated on hasEcxFile as before.

* fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles

ChecksumScrub's first return is blocks scanned, not files. Discard it so the
scrub response's TotalFiles (a needle/file count) is not inflated by the block
count for CHECKSUM mode.

* test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles

* fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast

The header stores payload_len as a uint32; bound the payload before the
conversion so a pathological manifest cannot truncate the length field and
corrupt the sidecar. A real manifest is a few KB, so this never trips.

* fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB

The block size becomes the per-shard scratch buffer the scrub/backfill path
allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent
scrub worker. Lower the upper bound from 1024 to 64 MiB.

* fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx

The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext,
which fails closed on a malformed/stale .ecsum. Without an override flag an
operator could not complete the rebuild without manually deleting the sidecar.
Expose -ecUnsafeIgnoreSidecar (default false) and thread it through.

* fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull

Guard len(payload) against a plain int constant (1 GiB) before the allocation
instead of a uint64 MaxUint32 compare, so the allocation-size value is provably
bounded (clears the CodeQL overflow alert) and the math import is no longer
needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and
remove the now-redundant readFull helper (os.File.ReadAt fills the slice or
errors).

* test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32

* refactor(ec): fold the EcFiles WithContext variants into the base functions

RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as
before) and WriteEcFiles takes it too (nil => default), removing the parallel
RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an
explicit context drop the WithContext suffix; the default-context callers pass
nil. No behavior change.

* refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles

Add a non-nil BackgroundECContext placeholder (analogous to context.Background())
and have callers with no specific layout pass it instead of a nil *ECContext.
WriteEcFiles resolves a zero/background context to the default ratio and
RebuildEcFiles resolves it from the .vif, so behavior is unchanged.

* fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif

- BackgroundECContext is now a function returning a fresh *ECContext, so callers
  cannot mutate a shared singleton or race on it (and it mirrors context.Background,
  which is also a function).
- RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but-
  unreadable .vif fails closed instead of silently rebuilding with the default
  ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
2026-05-31 18:52:44 -07:00

173 lines
5.2 KiB
Go

package shell
import (
"context"
"flag"
"fmt"
"io"
"strconv"
"strings"
"sync"
"github.com/seaweedfs/seaweedfs/weed/operation"
"github.com/seaweedfs/seaweedfs/weed/pb"
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
"google.golang.org/grpc"
)
func init() {
Commands = append(Commands, &commandEcVolumeScrub{})
}
type commandEcVolumeScrub struct {
env *CommandEnv
volumeServerAddrs []pb.ServerAddress
volumeIDs []uint32
mode volume_server_pb.VolumeScrubMode
grpcDialOption grpc.DialOption
}
func (c *commandEcVolumeScrub) Name() string {
return "ec.scrub"
}
func (c *commandEcVolumeScrub) Help() string {
return `scrubs EC volume contents on volume servers.
Supports either scrubbing only needle data, or deep scrubbing file contents as well.
Scrubbing can be limited to specific EC volume IDs for specific volume servers.
By default, all volume IDs across all servers are processed.
`
}
func (c *commandEcVolumeScrub) HasTag(CommandTag) bool {
return false
}
func (c *commandEcVolumeScrub) Do(args []string, commandEnv *CommandEnv, writer io.Writer) (err error) {
volScrubCommand := flag.NewFlagSet(c.Name(), flag.ContinueOnError)
nodesStr := volScrubCommand.String("node", "", "comma-separated list of volume server <host>:<port> (optional)")
volumeIDsStr := volScrubCommand.String("volumeId", "", "comma-separated EC volume IDs to process (optional)")
mode := volScrubCommand.String("mode", "local", "scrubbing mode (index/local/full/checksum)")
maxParallelization := volScrubCommand.Int("maxParallelization", DefaultMaxParallelization, "run up to X tasks in parallel, whenever possible")
if err = volScrubCommand.Parse(args); err != nil {
return err
}
if err = commandEnv.confirmIsLocked(args); err != nil {
return
}
c.volumeServerAddrs = []pb.ServerAddress{}
if *nodesStr != "" {
for _, addr := range strings.Split(*nodesStr, ",") {
c.volumeServerAddrs = append(c.volumeServerAddrs, pb.ServerAddress(addr))
}
} else {
dns, err := collectDataNodes(commandEnv, 0)
if err != nil {
return err
}
for _, dn := range dns {
c.volumeServerAddrs = append(c.volumeServerAddrs, pb.ServerAddress(dn.Address))
}
}
c.volumeIDs = []uint32{}
if *volumeIDsStr != "" {
for _, vids := range strings.Split(*volumeIDsStr, ",") {
vids = strings.TrimSpace(vids)
if vids == "" {
continue
}
if vid, err := strconv.ParseUint(vids, 10, 32); err == nil {
c.volumeIDs = append(c.volumeIDs, uint32(vid))
} else {
return fmt.Errorf("invalid volume ID %q", vids)
}
}
}
switch strings.ToUpper(*mode) {
case "INDEX":
c.mode = volume_server_pb.VolumeScrubMode_INDEX
case "LOCAL":
c.mode = volume_server_pb.VolumeScrubMode_LOCAL
case "FULL":
c.mode = volume_server_pb.VolumeScrubMode_FULL
case "CHECKSUM":
c.mode = volume_server_pb.VolumeScrubMode_CHECKSUM
default:
return fmt.Errorf("unsupported scrubbing mode %q", *mode)
}
fmt.Fprintf(writer, "using %s mode\n", c.mode.String())
c.env = commandEnv
return c.scrubEcVolumes(writer, *maxParallelization)
}
func (c *commandEcVolumeScrub) scrubEcVolumes(writer io.Writer, maxParallelization int) error {
var brokenVolumesStr, brokenShardsStr []string
var details []string
var totalVolumes, brokenVolumes, brokenShards, totalFiles uint64
var mu sync.Mutex
ewg := NewErrorWaitGroup(maxParallelization)
count := 0
for _, addr := range c.volumeServerAddrs {
ewg.Add(func() error {
mu.Lock()
count++
fmt.Fprintf(writer, "Scrubbing %s (%d/%d)...\n", addr.String(), count, len(c.volumeServerAddrs))
mu.Unlock()
err := operation.WithVolumeServerClient(false, addr, c.env.option.GrpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
res, err := volumeServerClient.ScrubEcVolume(context.Background(), &volume_server_pb.ScrubEcVolumeRequest{
Mode: c.mode,
VolumeIds: c.volumeIDs,
})
if err != nil {
return err
}
mu.Lock()
defer mu.Unlock()
totalVolumes += res.GetTotalVolumes()
totalFiles += res.GetTotalFiles()
brokenVolumes += uint64(len(res.GetBrokenVolumeIds()))
brokenShards += uint64(len(res.GetBrokenShardInfos()))
for _, d := range res.GetDetails() {
details = append(details, fmt.Sprintf("[%s] %s", addr, d))
}
for _, vid := range res.GetBrokenVolumeIds() {
brokenVolumesStr = append(brokenVolumesStr, fmt.Sprintf("%s:%v", addr, vid))
}
for _, si := range res.GetBrokenShardInfos() {
brokenShardsStr = append(brokenShardsStr, fmt.Sprintf("%s:%v:%v", addr, si.VolumeId, si.ShardId))
}
return nil
})
return err
})
}
if err := ewg.Wait(); err != nil {
return err
}
fmt.Fprintf(writer, "Scrubbed %d EC files and %d volumes on %d nodes\n", totalFiles, totalVolumes, len(c.volumeServerAddrs))
if brokenVolumes != 0 {
fmt.Fprintf(writer, "\nGot scrub failures on %d EC volumes and %d EC shards :(\n", brokenVolumes, brokenShards)
fmt.Fprintf(writer, "Affected volumes: %s\n", strings.Join(brokenVolumesStr, ", "))
if len(brokenShardsStr) != 0 {
fmt.Fprintf(writer, "Affected shards: %s\n", strings.Join(brokenShardsStr, ", "))
}
if len(details) != 0 {
fmt.Fprintf(writer, "Details:\n\t%s\n", strings.Join(details, "\n\t"))
}
}
return nil
}