mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
9658f309d2
* ec: add EC bitrot checksum protobuf EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages, copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode. * ec: bitrot checksum sidecar format, validation, and per-volume load Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a self-integrity header; validation, rolling builder, backfill primitive, and EcVolume load on mount + removal on destroy. * ec: capture per-shard checksums at encode; verify-and-exclude on rebuild WriteEcFilesWithContext returns the protection computed inline during encoding. generateMissingEcFiles verifies present inputs against the sidecar, excludes corrupt ones, regenerates in place, and re-verifies; fail-closed unless unsafe_ignore_sidecar, removing all generated outputs on failure. * ec: read-only checksum scrub with Reed-Solomon arbiter ChecksumScrub verifies each local shard against the sidecar and reconstructs flagged shards from the clean shards so stale-sidecar false positives are not reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum. * ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant); remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and opportunistically backfills a sidecar when all shards are reachable. * ec: volume server bitrot config flags -ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16). * fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge flag value cannot overflow int64 and slip past the power-of-two check, and a block size cannot collapse a sidecar to a few oversized blocks. * fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path The worker EC encode wrote the generation-0 sidecar locally but never added it to shardFiles, so DistributeEcShards never shipped it and the distributed holders came up unprotected. Append it to shardFiles and map the ecsum shard type to its extension in the sender so it travels with the shards. * fix(ec_bitrot): remove orphaned sidecars when the generation is gone Gate sidecar removal on existingShardCount==0 alone rather than also requiring a stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays gated on hasEcxFile as before. * fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles ChecksumScrub's first return is blocks scanned, not files. Discard it so the scrub response's TotalFiles (a needle/file count) is not inflated by the block count for CHECKSUM mode. * test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles * fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast The header stores payload_len as a uint32; bound the payload before the conversion so a pathological manifest cannot truncate the length field and corrupt the sidecar. A real manifest is a few KB, so this never trips. * fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB The block size becomes the per-shard scratch buffer the scrub/backfill path allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent scrub worker. Lower the upper bound from 1024 to 64 MiB. * fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext, which fails closed on a malformed/stale .ecsum. Without an override flag an operator could not complete the rebuild without manually deleting the sidecar. Expose -ecUnsafeIgnoreSidecar (default false) and thread it through. * fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull Guard len(payload) against a plain int constant (1 GiB) before the allocation instead of a uint64 MaxUint32 compare, so the allocation-size value is provably bounded (clears the CodeQL overflow alert) and the math import is no longer needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and remove the now-redundant readFull helper (os.File.ReadAt fills the slice or errors). * test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32 * refactor(ec): fold the EcFiles WithContext variants into the base functions RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as before) and WriteEcFiles takes it too (nil => default), removing the parallel RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an explicit context drop the WithContext suffix; the default-context callers pass nil. No behavior change. * refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles Add a non-nil BackgroundECContext placeholder (analogous to context.Background()) and have callers with no specific layout pass it instead of a nil *ECContext. WriteEcFiles resolves a zero/background context to the default ratio and RebuildEcFiles resolves it from the .vif, so behavior is unchanged. * fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif - BackgroundECContext is now a function returning a fresh *ECContext, so callers cannot mutate a shared singleton or race on it (and it mirrors context.Background, which is also a function). - RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but- unreadable .vif fails closed instead of silently rebuilding with the default ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
173 lines
5.2 KiB
Go
173 lines
5.2 KiB
Go
package shell
|
|
|
|
import (
|
|
"context"
|
|
"flag"
|
|
"fmt"
|
|
"io"
|
|
"strconv"
|
|
"strings"
|
|
"sync"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/operation"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
|
|
"google.golang.org/grpc"
|
|
)
|
|
|
|
func init() {
|
|
Commands = append(Commands, &commandEcVolumeScrub{})
|
|
}
|
|
|
|
type commandEcVolumeScrub struct {
|
|
env *CommandEnv
|
|
volumeServerAddrs []pb.ServerAddress
|
|
volumeIDs []uint32
|
|
mode volume_server_pb.VolumeScrubMode
|
|
grpcDialOption grpc.DialOption
|
|
}
|
|
|
|
func (c *commandEcVolumeScrub) Name() string {
|
|
return "ec.scrub"
|
|
}
|
|
|
|
func (c *commandEcVolumeScrub) Help() string {
|
|
return `scrubs EC volume contents on volume servers.
|
|
|
|
Supports either scrubbing only needle data, or deep scrubbing file contents as well.
|
|
|
|
Scrubbing can be limited to specific EC volume IDs for specific volume servers.
|
|
By default, all volume IDs across all servers are processed.
|
|
`
|
|
}
|
|
|
|
func (c *commandEcVolumeScrub) HasTag(CommandTag) bool {
|
|
return false
|
|
}
|
|
|
|
func (c *commandEcVolumeScrub) Do(args []string, commandEnv *CommandEnv, writer io.Writer) (err error) {
|
|
volScrubCommand := flag.NewFlagSet(c.Name(), flag.ContinueOnError)
|
|
nodesStr := volScrubCommand.String("node", "", "comma-separated list of volume server <host>:<port> (optional)")
|
|
volumeIDsStr := volScrubCommand.String("volumeId", "", "comma-separated EC volume IDs to process (optional)")
|
|
mode := volScrubCommand.String("mode", "local", "scrubbing mode (index/local/full/checksum)")
|
|
maxParallelization := volScrubCommand.Int("maxParallelization", DefaultMaxParallelization, "run up to X tasks in parallel, whenever possible")
|
|
|
|
if err = volScrubCommand.Parse(args); err != nil {
|
|
return err
|
|
}
|
|
if err = commandEnv.confirmIsLocked(args); err != nil {
|
|
return
|
|
}
|
|
|
|
c.volumeServerAddrs = []pb.ServerAddress{}
|
|
if *nodesStr != "" {
|
|
for _, addr := range strings.Split(*nodesStr, ",") {
|
|
c.volumeServerAddrs = append(c.volumeServerAddrs, pb.ServerAddress(addr))
|
|
}
|
|
} else {
|
|
dns, err := collectDataNodes(commandEnv, 0)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
for _, dn := range dns {
|
|
c.volumeServerAddrs = append(c.volumeServerAddrs, pb.ServerAddress(dn.Address))
|
|
}
|
|
}
|
|
|
|
c.volumeIDs = []uint32{}
|
|
if *volumeIDsStr != "" {
|
|
for _, vids := range strings.Split(*volumeIDsStr, ",") {
|
|
vids = strings.TrimSpace(vids)
|
|
if vids == "" {
|
|
continue
|
|
}
|
|
if vid, err := strconv.ParseUint(vids, 10, 32); err == nil {
|
|
c.volumeIDs = append(c.volumeIDs, uint32(vid))
|
|
} else {
|
|
return fmt.Errorf("invalid volume ID %q", vids)
|
|
}
|
|
}
|
|
}
|
|
|
|
switch strings.ToUpper(*mode) {
|
|
case "INDEX":
|
|
c.mode = volume_server_pb.VolumeScrubMode_INDEX
|
|
case "LOCAL":
|
|
c.mode = volume_server_pb.VolumeScrubMode_LOCAL
|
|
case "FULL":
|
|
c.mode = volume_server_pb.VolumeScrubMode_FULL
|
|
case "CHECKSUM":
|
|
c.mode = volume_server_pb.VolumeScrubMode_CHECKSUM
|
|
default:
|
|
return fmt.Errorf("unsupported scrubbing mode %q", *mode)
|
|
}
|
|
fmt.Fprintf(writer, "using %s mode\n", c.mode.String())
|
|
c.env = commandEnv
|
|
|
|
return c.scrubEcVolumes(writer, *maxParallelization)
|
|
}
|
|
|
|
func (c *commandEcVolumeScrub) scrubEcVolumes(writer io.Writer, maxParallelization int) error {
|
|
var brokenVolumesStr, brokenShardsStr []string
|
|
var details []string
|
|
var totalVolumes, brokenVolumes, brokenShards, totalFiles uint64
|
|
var mu sync.Mutex
|
|
|
|
ewg := NewErrorWaitGroup(maxParallelization)
|
|
count := 0
|
|
for _, addr := range c.volumeServerAddrs {
|
|
ewg.Add(func() error {
|
|
mu.Lock()
|
|
count++
|
|
fmt.Fprintf(writer, "Scrubbing %s (%d/%d)...\n", addr.String(), count, len(c.volumeServerAddrs))
|
|
mu.Unlock()
|
|
|
|
err := operation.WithVolumeServerClient(false, addr, c.env.option.GrpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
|
|
res, err := volumeServerClient.ScrubEcVolume(context.Background(), &volume_server_pb.ScrubEcVolumeRequest{
|
|
Mode: c.mode,
|
|
VolumeIds: c.volumeIDs,
|
|
})
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
mu.Lock()
|
|
defer mu.Unlock()
|
|
|
|
totalVolumes += res.GetTotalVolumes()
|
|
totalFiles += res.GetTotalFiles()
|
|
brokenVolumes += uint64(len(res.GetBrokenVolumeIds()))
|
|
brokenShards += uint64(len(res.GetBrokenShardInfos()))
|
|
for _, d := range res.GetDetails() {
|
|
details = append(details, fmt.Sprintf("[%s] %s", addr, d))
|
|
}
|
|
for _, vid := range res.GetBrokenVolumeIds() {
|
|
brokenVolumesStr = append(brokenVolumesStr, fmt.Sprintf("%s:%v", addr, vid))
|
|
}
|
|
for _, si := range res.GetBrokenShardInfos() {
|
|
brokenShardsStr = append(brokenShardsStr, fmt.Sprintf("%s:%v:%v", addr, si.VolumeId, si.ShardId))
|
|
}
|
|
|
|
return nil
|
|
})
|
|
return err
|
|
})
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return err
|
|
}
|
|
|
|
fmt.Fprintf(writer, "Scrubbed %d EC files and %d volumes on %d nodes\n", totalFiles, totalVolumes, len(c.volumeServerAddrs))
|
|
if brokenVolumes != 0 {
|
|
fmt.Fprintf(writer, "\nGot scrub failures on %d EC volumes and %d EC shards :(\n", brokenVolumes, brokenShards)
|
|
fmt.Fprintf(writer, "Affected volumes: %s\n", strings.Join(brokenVolumesStr, ", "))
|
|
if len(brokenShardsStr) != 0 {
|
|
fmt.Fprintf(writer, "Affected shards: %s\n", strings.Join(brokenShardsStr, ", "))
|
|
}
|
|
if len(details) != 0 {
|
|
fmt.Fprintf(writer, "Details:\n\t%s\n", strings.Join(details, "\n\t"))
|
|
}
|
|
}
|
|
return nil
|
|
}
|