mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
79ac279fe1
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns (encode time in unix nanos). It rides in .vif and the read request so a read can be scoped to the encode run that produced the index. * fix(ec): stamp each encode and reject cross-run shard reads Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the shard's owning volume (resolved together via FindEcVolumeWithShard, so a multi-disk server validates the disk that actually serves the bytes) and reject a shard from a different encode run, recovering from parity. A zero on either side (pre-upgrade volume) skips the guard. * fix(ec): stamp the encode identity on the worker-generated .vif The worker-local encode path now writes EncodeTsNs (and the resolved EC ratio) into the .vif, so the read guard is not silently off for volumes encoded by the maintenance worker. * fix(ec): wipe stale EC artifacts before re-encoding VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and removes its on-disk shard/index/sidecar files before writing fresh ones, so a retried encode never builds on a partial prior run and the unlink frees the inodes instead of leaving open fds serving old bytes. * fix(ec): unmount EC shards across all disks UnmountEcShards walked only the first disk holding the shard, leaving a duplicate copy mounted on a sibling disk (split-disk reconciled volumes) still serving and heartbeating. Traverse every disk and emit one deletion delta per disk. * fix(ec): delete orphan shards without a local .ecx deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx, so it could not clean an orphan .ecNN left by a failed copy on a disk with no index. Delete the requested shard files unconditionally; the index-file (.ecx/.ecj/.vif) routing stays gated as before. * fix(ec): clear stale EC shards cluster-wide before re-encoding ec.encode unmounts and deletes EC shards for the target volumes on every node before regenerating: fatal for the shards the topology reports (mounted leftovers), best-effort for the rest (a sweep that catches unmounted failed-copy orphans). A down node is a no-op. * fix(ec): don't nil EC fds on close so reads can't race eviction A reader resolves an EcVolume/shard under the lock then reads after it is released, so an eviction that nils ecxFile/ecdFile would race that read and panic. Close the fds without nilling the fields: the field is now write-once (no data race) and a concurrent read hits a closed fd, getting a clean error that the caller recovers from parity. * fix(ec): wipe stale EC artifacts on every disk and surface failures The pre-encode wipe only deleted beside the source volume, so a stale shard on a sibling disk survived and could be mounted against the new index at reconcile. Sweep every disk. Removal also ignored os.Remove errors, reporting a failed cleanup as success and letting a stale shard join the next generation; surface the first real failure (treating already-gone as success) from removeStaleEcArtifacts and the shard delete. * fix(ec): log when a local shard is skipped for a different encode run The cross-run guard returned errShardNotLocal, indistinguishable in logs from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so operators can tell "wrong encode generation" from "shard not here". * fix(ec): surface metadata removal failures in the shard delete path deleteEcShardIdsForEachLocation still dropped os.Remove errors on the .ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index condition this path prevents, so route those through removeFileIfExists and return the first real failure instead of reporting cleanup as success. * fix(ec): fail orphan cleanup when a reachable node's delete fails The pre-encode orphan sweep swallowed every error for unreported (node, volume) pairs. That is only safe for an unreachable node, which cannot receive this encode's new generation. A reachable node whose delete genuinely failed (permission/IO) keeps an orphan shard that a later copy re-stamps with the new run's volume-level .vif identity, so the read guard would accept stale data. Surface those; stay best-effort only for unreachable nodes (gRPC Unavailable / no status). * fix(ec): guard ecjFile under its lock in the EC delete path EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved its .ecx lookup before a concurrent eviction (the generate-time UnloadEcVolume) could then reach the journal append with a nil fd. Bail with a clear "volume closed" error under the lock instead. * fix(ec): reject an unstamped shard when the caller has an encode identity The read guard required both identities nonzero, so a current (stamped) caller accepted a holder with identity 0 and could be served a stale pre-upgrade shard. Reject when the caller is stamped and the holder differs (including unstamped); stay lenient only when the caller itself has no identity (pre-upgrade reader). A skipped shard recovers from parity. * fix(ec): full-teardown delete so cluster cleanup wipes a whole generation The pre-encode cluster sweep deleted only the listed canonical shards on remote nodes, leaving index/sidecar (and, on builds with versioned generations, those too) behind. Add a full_teardown flag to VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode cleanup paths set it. Other delete callers (balance/decode/repair) are unchanged. * fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while Close nils it under the lock, a data race on the field. Take the lock first, then nil-check inside, in both. * fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success An old volume server silently ignores full_teardown and returns success for an ordinary delete, so the caller wrongly believes the generation was wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done in the response; the worker destination cleanup fails when it is absent, and the shell cluster sweep fails for a reported (mounted) leftover while staying best-effort for an unreported node. encode_ts_ns stays an accepted transient (an old server just skips the new read guard, no regression). * fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown A reachable pre-upgrade server ignores full_teardown and returns success without wiping an orphan, which a later copy then folds into the new generation. Treat a missing full_teardown_done ack as fatal for every reachable node (best-effort only for a gRPC-unreachable one), not just for topology-reported pairs. * fix(ec): return the served shard identity and validate it client-side The encode identity was only enforced server-side, so a pre-upgrade server ignored the request field and served bytes unchecked. Echo the served shard's EncodeTsNs on every read response chunk and have the client reject a mismatch (including 0 from an old server), so the guard holds regardless of server version; a rejected read recovers from parity. * fix(ec): reject a short/empty remote shard read instead of serving zeros doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and returned success with a partly zero-filled, unvalidated buffer (the server stamps the identity only on chunks that carry bytes). A non-deleted interval must arrive whole: require n == len(buf), exempting the is_deleted short-circuit (n=0), matching readLocalEcShardInterval's local check. A short read now fails so the caller recovers from parity. * test(ec): fake volume server echoes the full_teardown acknowledgement The worker now fails a teardown delete that isn't acknowledged (so a pre-upgrade server can't silently skip the wipe). The fake server's no-op VolumeEcShardsDelete returned an empty response, which the worker read as a skipped teardown and aborted the encode. Echo full_teardown_done. * feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif and rejects a read served from a shard of a different run; full_teardown wipes a whole generation and acknowledges it. The Rust volume server had none of it. Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume, stamp it on every read response, and reject a request/response mismatch on both the server and the distributed-read client (recovering from parity); handle full_teardown by evicting the volume and wiping every EC artifact on each disk, echoing full_teardown_done so the caller can detect a server that ignored it. * fix(ec): remove a stale .vif on full teardown of a shard-only node A shard copy installs shards + .ecx before .vif, so an interrupted copy after a teardown could mount the new files under the previous run's identity / version / shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during full teardown, gated on .idx absence so a source-volume holder keeps its live .vif. In Rust this lives in a teardown-only helper so the reconcile / load- fallback paths (which share the base removal) still preserve .vif. * fix(ec): treat a missing teardown ack as fatal, not as an unreachable node isNodeUnreachable returned true for any non-gRPC-status error, so a reachable pre-upgrade server's missing full_teardown_done ack (a plain error) was classified unreachable and the unreported pair was silently skipped. Classify only a real codes.Unavailable as unreachable, and wrap the missing ack in a sentinel the sweep treats as fatal regardless. A genuinely down node still surfaces as Unavailable from the RPC and stays best-effort. * fix(ec): reject a short shard read in the local EC needle reader read_ec_shard_needle ignored the byte count from shard.read_at and appended the whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later length check and parsed as garbage. Require n == buf.len() per interval, erroring on a short read like the local interval reader already does. * fix(ec): probe reachability before skipping a node that returns Unavailable The pre-encode sweep skipped any node whose teardown delete returned codes.Unavailable, but a reachable volume server in maintenance mode also returns that code for the maintenance-gated delete, so its stale EC files were left behind on a node that can still receive the new generation. Confirm with a non-maintenance-gated empty-target Ping: skip only when the node fails the probe too (genuinely unreachable). * fix(ec): use try_exists for the teardown .vif .idx guard The teardown-only .vif removal gated on Path::exists(), which returns false on a permission/IO stat error, so a stat failure on a present .idx would read as a shard-only node and delete the live source volume's .vif. Gate on try_exists() == Ok(false) instead, preserving the sidecar on any stat error. * fix(ec): only skip a sweep node when a Ping confirms it is transport-down The pre-encode sweep skipped a node whenever its teardown delete and a liveness Ping both failed, but it treated ANY Ping error as down — an application-level Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a reachable node's stale generation in place. Classify the Ping tri-state and skip only when it transport-fails with codes.Unavailable; a reachable or inconclusive node stays fatal. * fix(ec): exclude sweep-skipped nodes from the encode's rebalance The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance then recollected the current topology — a node that recovered between the two could become a copy target and receive the new generation while still holding its stale, never-cleared shards. Have the sweep return the skipped set and exclude those nodes from the rebalance for this encode, so a node we could not clean cannot receive the new generation. Standalone ec.balance is unaffected. * fix(ec): re-sweep recovered nodes before generation so they aren't stranded A node skipped as down by the pre-encode sweep is excluded from the rebalance, but it can recover and become the generation host — mounting all shards locally, then being excluded from distribution. Union-only verification accepts all shards on one node and deletes the originals: a single point of failure. Re-sweep the skipped nodes just before generation; one whose teardown now succeeds leaves the skipped set and rebalances normally, while a node still down stays skipped. * fix(ec): abort the encode if a selected source is still skipped after re-sweep The re-sweep un-skips a recovered node, but the source was selected before it and a node can stay down through the re-sweep then recover just in time to be the generation host — mounting all shards locally while still excluded from the rebalance, which union-only verification accepts before deleting the originals. Abort the encode when a selected source remains skipped after the re-sweep. * fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch If a volume is not EC at the batch-delete classification but is encoded to EC and its .dat deleted before the regular-volume mutation, the mutation returns an exact "not found" that the filer chunk-GC treats as completed, dropping the delete. Recheck EC presence under the mutation lock and return a retriable 503 with the "try again" token so the filer requeues it onto the EC path. * fix(ec): recheck EC state before the regular batch-delete mutation ec.encode mounts EC shards (copied from the .dat) before deleting the originals, so a volume can be EC while its .dat still exists. The batch delete only rechecked EC after a NotFound, so a successful regular-volume delete in that window wrote a tombstone to the soon-removed .dat — the delete was lost and the needle resurrected from the pre-tombstone shards. Recheck has_ec_volume under the write lock before delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path. * fix(volume): make the metrics push test independent of test order test_push_metrics_once asserted the pushed body contains the request-counter family without ever touching the counter — a CounterVec with no children emits nothing, so the assertion only held when another test had already created a labelset in the shared registry. Create one in the test itself.
861 lines
34 KiB
Go
861 lines
34 KiB
Go
package shell
|
|
|
|
import (
|
|
"context"
|
|
"errors"
|
|
"flag"
|
|
"fmt"
|
|
"io"
|
|
"regexp"
|
|
"sort"
|
|
"strconv"
|
|
"sync"
|
|
"time"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/glog"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/wdclient"
|
|
|
|
"google.golang.org/grpc"
|
|
"google.golang.org/grpc/codes"
|
|
"google.golang.org/grpc/status"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/operation"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/volume_replica"
|
|
)
|
|
|
|
func init() {
|
|
Commands = append(Commands, &commandEcEncode{})
|
|
}
|
|
|
|
type commandEcEncode struct {
|
|
}
|
|
|
|
func (c *commandEcEncode) Name() string {
|
|
return "ec.encode"
|
|
}
|
|
|
|
func (c *commandEcEncode) Help() string {
|
|
return `apply erasure coding to a volume
|
|
|
|
ec.encode [-collection=""] [-fullPercent=95 -quietFor=1h] [-verbose] [-sourceDiskType=<disk_type>] [-diskType=<disk_type>]
|
|
ec.encode [-collection=""] [-volumeId=<volume_id>] [-verbose] [-diskType=<disk_type>]
|
|
|
|
This command will:
|
|
1. freeze one volume
|
|
2. apply erasure coding to the volume
|
|
3. (optionally) re-balance encoded shards across multiple volume servers
|
|
|
|
The erasure coding is 10.4. So ideally you have more than 14 volume servers, and you can afford
|
|
to lose 4 volume servers.
|
|
|
|
If the number of volumes are not high, the worst case is that you only have 4 volume servers,
|
|
and the shards are spread as 4,4,3,3, respectively. You can afford to lose one volume server.
|
|
|
|
If you only have less than 4 volume servers, with erasure coding, at least you can afford to
|
|
have 4 corrupted shard files.
|
|
|
|
The -collection parameter supports regular expressions for pattern matching:
|
|
- Use exact match: ec.encode -collection="^mybucket$"
|
|
- Match multiple buckets: ec.encode -collection="bucket.*"
|
|
- Match all collections: ec.encode -collection=".*"
|
|
|
|
Options:
|
|
-verbose: show detailed reasons why volumes are not selected for encoding
|
|
-sourceDiskType: filter source volumes by disk type (hdd, ssd, or empty for all)
|
|
-diskType: target disk type for EC shards (hdd, ssd, or empty for default hdd)
|
|
|
|
Examples:
|
|
# Encode SSD volumes to SSD EC shards (same tier)
|
|
ec.encode -collection=mybucket -sourceDiskType=ssd -diskType=ssd
|
|
|
|
# Encode SSD volumes to HDD EC shards (tier migration to cheaper storage)
|
|
ec.encode -collection=mybucket -sourceDiskType=ssd -diskType=hdd
|
|
|
|
# Encode all volumes to SSD EC shards
|
|
ec.encode -collection=mybucket -diskType=ssd
|
|
|
|
Re-balancing algorithm:
|
|
` + ecBalanceAlgorithmDescription
|
|
}
|
|
|
|
func (c *commandEcEncode) HasTag(CommandTag) bool {
|
|
return false
|
|
}
|
|
|
|
func (c *commandEcEncode) Do(args []string, commandEnv *CommandEnv, writer io.Writer) (err error) {
|
|
|
|
encodeCommand := flag.NewFlagSet(c.Name(), flag.ContinueOnError)
|
|
volumeId := encodeCommand.Int("volumeId", 0, "the volume id")
|
|
collection := encodeCommand.String("collection", "", "the collection name")
|
|
fullPercentage := encodeCommand.Float64("fullPercent", 95, "the volume reaches the percentage of max volume size")
|
|
quietPeriod := encodeCommand.Duration("quietFor", time.Hour, "select volumes without no writes for this period")
|
|
maxParallelization := encodeCommand.Int("maxParallelization", DefaultMaxParallelization, "run up to X tasks in parallel, whenever possible")
|
|
forceChanges := encodeCommand.Bool("force", false, "force the encoding even if the cluster has less than recommended 4 nodes")
|
|
shardReplicaPlacement := encodeCommand.String("shardReplicaPlacement", "", "replica placement for EC shards, or master default if empty")
|
|
sourceDiskTypeStr := encodeCommand.String("sourceDiskType", "", "filter source volumes by disk type (hdd, ssd, or empty for all)")
|
|
diskTypeStr := encodeCommand.String("diskType", "", "target disk type for EC shards (hdd, ssd, or empty for default hdd)")
|
|
applyBalancing := encodeCommand.Bool("rebalance", true, "re-balance EC shards after creation (default: true)")
|
|
verbose := encodeCommand.Bool("verbose", false, "show detailed reasons why volumes are not selected for encoding")
|
|
|
|
if err = encodeCommand.Parse(args); err != nil {
|
|
return nil
|
|
}
|
|
if err = commandEnv.confirmIsLocked(args); err != nil {
|
|
return
|
|
}
|
|
rp, err := parseReplicaPlacementArg(commandEnv, *shardReplicaPlacement)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// Parse source disk type filter (optional)
|
|
var sourceDiskType *types.DiskType
|
|
if *sourceDiskTypeStr != "" {
|
|
sdt := types.ToDiskType(*sourceDiskTypeStr)
|
|
sourceDiskType = &sdt
|
|
}
|
|
|
|
// Parse target disk type for EC shards
|
|
diskType := types.ToDiskType(*diskTypeStr)
|
|
|
|
// collect topology information
|
|
topologyInfo, _, err := collectTopologyInfo(commandEnv, 0)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
if !*forceChanges {
|
|
var nodeCount int
|
|
eachDataNode(topologyInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
nodeCount++
|
|
})
|
|
if nodeCount < erasure_coding.ParityShardsCount {
|
|
glog.V(0).Infof("skip erasure coding with %d nodes, less than recommended %d nodes", nodeCount, erasure_coding.ParityShardsCount)
|
|
return nil
|
|
}
|
|
}
|
|
|
|
var volumeIds []needle.VolumeId
|
|
var balanceCollections []string
|
|
if vid := needle.VolumeId(*volumeId); vid != 0 {
|
|
// volumeId is provided
|
|
volumeIds = append(volumeIds, vid)
|
|
balanceCollections = collectCollectionsForVolumeIds(topologyInfo, volumeIds)
|
|
} else {
|
|
// apply to all volumes for the given collection pattern (regex)
|
|
volumeIds, balanceCollections, err = collectVolumeIdsForEcEncode(commandEnv, *collection, sourceDiskType, *fullPercentage, *quietPeriod, *verbose)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
}
|
|
if len(volumeIds) == 0 {
|
|
fmt.Println("No volumes, nothing to do.")
|
|
return nil
|
|
}
|
|
|
|
// Collect volume ID to collection name mapping for the sync operation
|
|
volumeIdToCollection := collectVolumeIdToCollection(topologyInfo, volumeIds)
|
|
|
|
// Collect volume locations BEFORE EC encoding starts to avoid race condition
|
|
// where the master metadata is updated after EC encoding but before deletion
|
|
fmt.Printf("Collecting volume locations for %d volumes before EC encoding...\n", len(volumeIds))
|
|
volumeLocationsMap, err := volumeLocations(commandEnv, volumeIds)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to collect volume locations before EC encoding: %w", err)
|
|
}
|
|
|
|
// Pre-flight check: verify the target disk type has capacity for EC shards
|
|
// This prevents encoding shards only to fail during rebalance
|
|
_, totalFreeEcSlots, err := collectEcNodesForDC(commandEnv, "", diskType)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to check EC shard capacity: %w", err)
|
|
}
|
|
|
|
// Calculate required slots: each volume needs TotalShardsCount (14) shards distributed
|
|
requiredSlots := len(volumeIds) * erasure_coding.TotalShardsCount
|
|
if totalFreeEcSlots < 1 {
|
|
// No capacity at all on the target disk type
|
|
if diskType != types.HardDriveType {
|
|
return fmt.Errorf("no free ec shard slots on disk type '%s'. The target disk type has no capacity.\n"+
|
|
"Your volumes are likely on a different disk type. Try:\n"+
|
|
" ec.encode -collection=%s -diskType=hdd\n"+
|
|
"Or omit -diskType to use the default (hdd)", diskType, *collection)
|
|
}
|
|
return fmt.Errorf("no free ec shard slots. only %d left on disk type '%s'", totalFreeEcSlots, diskType)
|
|
}
|
|
|
|
if totalFreeEcSlots < requiredSlots {
|
|
fmt.Printf("Warning: limited EC shard capacity. Need %d slots for %d volumes, but only %d slots available on disk type '%s'.\n",
|
|
requiredSlots, len(volumeIds), totalFreeEcSlots, diskType)
|
|
fmt.Printf("Rebalancing may not achieve optimal distribution.\n")
|
|
}
|
|
|
|
// encode all requested volumes...
|
|
skippedNodes, err := doEcEncode(commandEnv, writer, volumeIdToCollection, volumeIds, *maxParallelization, topologyInfo)
|
|
if err != nil {
|
|
return fmt.Errorf("ec encode for volumes %v: %w", volumeIds, err)
|
|
}
|
|
// ...re-balance ec shards, excluding nodes the orphan sweep could not reach so
|
|
// a recovered node's stale orphan is never paired with a new-generation shard...
|
|
if err := EcBalance(commandEnv, balanceCollections, "", rp, diskType, *maxParallelization, *applyBalancing, skippedNodes); err != nil {
|
|
return fmt.Errorf("re-balance ec shards for collection(s) %v: %w", balanceCollections, err)
|
|
}
|
|
// A partial encode followed by source deletion is unrecoverable.
|
|
if err := verifyEcShardsBeforeDelete(commandEnv, volumeIds, diskType); err != nil {
|
|
return fmt.Errorf("verify EC shards before deleting originals: %w", err)
|
|
}
|
|
// ...then delete original volumes using pre-collected locations.
|
|
fmt.Printf("Deleting original volumes after EC encoding...\n")
|
|
if err := doDeleteVolumesWithLocations(commandEnv, volumeIds, volumeLocationsMap, *maxParallelization); err != nil {
|
|
return fmt.Errorf("delete original volumes after EC encoding: %w", err)
|
|
}
|
|
fmt.Printf("Successfully completed EC encoding for %d volumes\n", len(volumeIds))
|
|
|
|
return nil
|
|
}
|
|
|
|
func volumeLocations(commandEnv *CommandEnv, volumeIds []needle.VolumeId) (map[needle.VolumeId][]wdclient.Location, error) {
|
|
res := map[needle.VolumeId][]wdclient.Location{}
|
|
for _, vid := range volumeIds {
|
|
ls, ok := commandEnv.MasterClient.GetLocationsClone(uint32(vid))
|
|
if !ok {
|
|
return nil, fmt.Errorf("volume %d not found", vid)
|
|
}
|
|
res[vid] = ls
|
|
}
|
|
|
|
return res, nil
|
|
}
|
|
|
|
func doEcEncode(commandEnv *CommandEnv, writer io.Writer, volumeIdToCollection map[needle.VolumeId]string, volumeIds []needle.VolumeId, maxParallelization int, topologyInfo *master_pb.TopologyInfo) (skippedNodes map[pb.ServerAddress]struct{}, err error) {
|
|
if !commandEnv.isLocked() {
|
|
return nil, fmt.Errorf("lock is lost")
|
|
}
|
|
locations, err := volumeLocations(commandEnv, volumeIds)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("failed to get volume locations for EC encoding: %w", err)
|
|
}
|
|
|
|
// Clear EC shards left by a previous failed/partial encode so a retry
|
|
// starts clean and never mixes two encode runs. A node skipped here as
|
|
// unreachable is excluded from the later balance: it may still hold a stale
|
|
// orphan that, paired with a new-generation shard from a balance copy, would
|
|
// mix generations on that node.
|
|
skippedNodes, err = clearPreexistingEcShards(commandEnv, topologyInfo, volumeIds, volumeIdToCollection, maxParallelization)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("clear pre-existing ec shards before encoding: %w", err)
|
|
}
|
|
|
|
// Build a map of (volumeId, serverAddress) -> freeVolumeCount.
|
|
// Key by dn.Address so it matches wdclient.Location.Url. In deployments
|
|
// where dn.Id is a short name (e.g. Kubernetes StatefulSet pod name)
|
|
// while dn.Address is a FQDN:port, keying by dn.Id would never match the
|
|
// location Url during the health-check lookup below.
|
|
freeVolumeCountMap := make(map[string]int) // key: volumeId-serverAddress
|
|
eachDataNode(topologyInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
addr := dn.Address
|
|
if addr == "" {
|
|
addr = dn.Id // older nodes use ip:port as id
|
|
}
|
|
for _, diskInfo := range dn.DiskInfos {
|
|
for _, v := range diskInfo.VolumeInfos {
|
|
key := fmt.Sprintf("%d-%s", v.Id, addr)
|
|
freeVolumeCountMap[key] = int(diskInfo.FreeVolumeCount)
|
|
}
|
|
}
|
|
})
|
|
|
|
// Filter replicas by free capacity BEFORE marking volumes readonly so that
|
|
// a failed health check does not strand volumes in readonly state.
|
|
filteredLocations := make(map[needle.VolumeId][]wdclient.Location)
|
|
for _, vid := range volumeIds {
|
|
var filteredLocs []wdclient.Location
|
|
for _, l := range locations[vid] {
|
|
key := fmt.Sprintf("%d-%s", vid, l.Url)
|
|
if freeCount, found := freeVolumeCountMap[key]; found && freeCount >= 2 {
|
|
filteredLocs = append(filteredLocs, l)
|
|
}
|
|
}
|
|
if len(filteredLocs) == 0 {
|
|
return nil, fmt.Errorf("no healthy replicas (FreeVolumeCount >= 2) found for volume %d to use as source for EC encoding", vid)
|
|
}
|
|
filteredLocations[vid] = filteredLocs
|
|
}
|
|
|
|
// mark volumes as readonly
|
|
ewg := NewErrorWaitGroup(maxParallelization)
|
|
for _, vid := range volumeIds {
|
|
for _, l := range locations[vid] {
|
|
ewg.Add(func() error {
|
|
if err := markVolumeReplicaWritable(commandEnv.option.GrpcDialOption, vid, l, false, false); err != nil {
|
|
return fmt.Errorf("mark volume %d as readonly on %s: %v", vid, l.Url, err)
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// Sync replicas and select the best one for each volume (with highest file count)
|
|
// This addresses data inconsistency risk in multi-replica volumes (issue #7797)
|
|
// by syncing missing entries between replicas before encoding
|
|
bestReplicas := make(map[needle.VolumeId]wdclient.Location)
|
|
for _, vid := range volumeIds {
|
|
collection := volumeIdToCollection[vid]
|
|
|
|
// Sync missing entries between replicas, then select the best one
|
|
bestLoc, selectErr := volume_replica.SyncAndSelectBestReplica(commandEnv.option.GrpcDialOption, vid, collection, filteredLocations[vid], "", writer)
|
|
if selectErr != nil {
|
|
return nil, fmt.Errorf("failed to sync and select replica for volume %d: %v", vid, selectErr)
|
|
}
|
|
bestReplicas[vid] = bestLoc
|
|
}
|
|
|
|
// Re-attempt the orphan sweep on the nodes skipped as unreachable, now that
|
|
// any node that recovered during readonly-marking and replica sync answers
|
|
// again. A node whose teardown now succeeds is clean (and the generation host
|
|
// re-wipes its own disks regardless), so it leaves the skipped set and can be
|
|
// a balance source/target — otherwise its shards would never distribute off
|
|
// it. A node that is still down stays skipped and excluded, preserving the
|
|
// leniency for a genuinely-down node; such a node also cannot be the
|
|
// generation host below, since VolumeEcShardsGenerate would fail to read .dat.
|
|
if err := resweepSkippedNodes(commandEnv, skippedNodes, volumeIds, volumeIdToCollection, maxParallelization); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// A selected generation host still in skippedNodes after the re-sweep was
|
|
// transport-down when we tried to clean it, so its stale orphans were never
|
|
// removed and EcBalance excludes it as both source and target. If it recovers
|
|
// just in time for generation, all shards land on a node we can neither clean
|
|
// nor balance off — a single point of failure that union-only verification
|
|
// still accepts, after which the originals are deleted. Abort instead.
|
|
for _, vid := range volumeIds {
|
|
genHost := bestReplicas[vid].ServerAddress()
|
|
if _, stillSkipped := skippedNodes[genHost]; stillSkipped {
|
|
return nil, fmt.Errorf("generate ec shards for volume %d aborted: selected source %s is still skipped after the orphan re-sweep", vid, genHost)
|
|
}
|
|
}
|
|
|
|
// generate ec shards using the best replica for each volume
|
|
ewg.Reset()
|
|
for _, vid := range volumeIds {
|
|
target := bestReplicas[vid]
|
|
collection := volumeIdToCollection[vid]
|
|
ewg.Add(func() error {
|
|
if err := generateEcShards(commandEnv.option.GrpcDialOption, vid, collection, target.ServerAddress()); err != nil {
|
|
return fmt.Errorf("generate ec shards for volume %d on %s: %v", vid, target.Url, err)
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// mount all ec shards for the converted volume
|
|
shardIds := erasure_coding.AllShardIds()
|
|
|
|
ewg.Reset()
|
|
for _, vid := range volumeIds {
|
|
target := bestReplicas[vid]
|
|
collection := volumeIdToCollection[vid]
|
|
ewg.Add(func() error {
|
|
if err := mountEcShards(commandEnv.option.GrpcDialOption, collection, vid, target.ServerAddress(), shardIds); err != nil {
|
|
return fmt.Errorf("mount ec shards for volume %d on %s: %v", vid, target.Url, err)
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
return skippedNodes, nil
|
|
}
|
|
|
|
// clearPreexistingEcShards removes EC shards and index files left over from a
|
|
// previous (failed or partial) encode of the given volume ids, on every node
|
|
// that still reports them, so a fresh encode regenerates from a clean slate.
|
|
// Scans all disk types. The normal .dat/.idx — the source of truth for this
|
|
// encode — is untouched; only orphaned EC artifacts are deleted.
|
|
//
|
|
// Returns the set of nodes skipped as unreachable. A skipped node may still hold
|
|
// an un-deleted orphan from a prior run; if it recovers it must be kept out of
|
|
// this encode's shard distribution, or the balance could install the new
|
|
// generation alongside the stale orphan and mix generations on one node.
|
|
func clearPreexistingEcShards(commandEnv *CommandEnv, topologyInfo *master_pb.TopologyInfo, volumeIds []needle.VolumeId, volumeIdToCollection map[needle.VolumeId]string, maxParallelization int) (skipped map[pb.ServerAddress]struct{}, err error) {
|
|
wanted := make(map[uint32]bool, len(volumeIds))
|
|
for _, vid := range volumeIds {
|
|
wanted[uint32(vid)] = true
|
|
}
|
|
|
|
// Note which (node, vid) pairs the topology already reports EC shards for:
|
|
// those are mounted leftovers and cleaning them is required (fatal on
|
|
// error). Every other (node, vid) is swept best-effort to catch UNMOUNTED
|
|
// orphans left by a failed copy — invisible to the heartbeat, so absent
|
|
// here. A node that is down or holds nothing is a harmless no-op; a node
|
|
// unreachable now also cannot receive this encode's new generation, so a
|
|
// surviving orphan there keeps its old identity and the read guard rejects
|
|
// it. Always delete the full shard-id range so a wider custom ratio's
|
|
// leftovers are covered too.
|
|
reportedKey := func(addr pb.ServerAddress, vid uint32) string {
|
|
return string(addr) + "\x00" + strconv.Itoa(int(vid))
|
|
}
|
|
reported := make(map[string]struct{})
|
|
var nodes []pb.ServerAddress
|
|
eachDataNode(topologyInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
addr := pb.NewServerAddressFromDataNode(dn)
|
|
nodes = append(nodes, addr)
|
|
for _, diskInfo := range dn.DiskInfos {
|
|
for _, ecInfo := range diskInfo.EcShardInfos {
|
|
if wanted[ecInfo.Id] {
|
|
reported[reportedKey(addr, ecInfo.Id)] = struct{}{}
|
|
}
|
|
}
|
|
}
|
|
})
|
|
|
|
allShardIds := make([]erasure_coding.ShardId, erasure_coding.MaxShardCount)
|
|
for i := range allShardIds {
|
|
allShardIds[i] = erasure_coding.ShardId(i)
|
|
}
|
|
|
|
if len(reported) > 0 {
|
|
fmt.Printf("clearing stale EC shards reported for %d (node,volume) pair(s) before regenerating...\n", len(reported))
|
|
}
|
|
// Nodes skipped as unreachable, accumulated across the concurrent sweep tasks.
|
|
skipped = make(map[pb.ServerAddress]struct{})
|
|
var skippedMu sync.Mutex
|
|
ewg := NewErrorWaitGroup(maxParallelization)
|
|
for _, addr := range nodes {
|
|
for _, vid := range volumeIds {
|
|
fatal := false
|
|
if _, ok := reported[reportedKey(addr, uint32(vid))]; ok {
|
|
fatal = true
|
|
}
|
|
collection := volumeIdToCollection[vid]
|
|
ewg.Add(func() error {
|
|
if err := unmountAndDeleteEcShardsQuiet(commandEnv.option.GrpcDialOption, collection, vid, addr, allShardIds); err != nil {
|
|
// Surface a reachable node whose delete genuinely failed (its orphan would
|
|
// be re-stamped by a later copy installing the new .vif). A missing
|
|
// full_teardown ack from a reachable pre-upgrade node is fatal too: it may
|
|
// still hold an orphan a later copy would re-stamp into the new generation.
|
|
// Stay best-effort only for a node that is truly unreachable: codes.Unavailable
|
|
// alone is ambiguous — a genuinely-down node and a reachable Rust volume
|
|
// server in maintenance mode both return it (a Go server returns Unknown for
|
|
// maintenance, already fatal above). Confirm with a non-maintenance-gated Ping
|
|
// before skipping; skip only when the Ping itself transport-failed (nodeDown).
|
|
// A reachable maintenance node (nodeUp) CAN receive this generation, and an
|
|
// inconclusive Ping (nodeLivenessUnknown, e.g. a pre-Ping server returning
|
|
// Unimplemented — which means the node is up) does not prove the node is down,
|
|
// so both stay fatal rather than silently leaving a stale EC generation.
|
|
if fatal || errors.Is(err, errFullTeardownNotAcked) || !isNodeUnreachable(err) ||
|
|
classifyNodeLiveness(pingVolumeServer(commandEnv.option.GrpcDialOption, addr)) != nodeDown {
|
|
return fmt.Errorf("clear stale ec shards for volume %d on %s: %w", vid, addr, err)
|
|
}
|
|
glog.V(1).Infof("orphan sweep: volume %d on %s skipped (unreachable): %v", vid, addr, err)
|
|
skippedMu.Lock()
|
|
skipped[addr] = struct{}{}
|
|
skippedMu.Unlock()
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return nil, err
|
|
}
|
|
return skipped, nil
|
|
}
|
|
|
|
// resweepSkippedNodes re-attempts the orphan teardown on the nodes that the
|
|
// initial sweep skipped as unreachable, just before shard generation. A node
|
|
// that recovered in the meantime — and is therefore eligible to host this
|
|
// encode's generation — has its teardown retried; if it now fully succeeds it is
|
|
// removed from skipped so the rebalance can use it as a source and move its
|
|
// shards off, instead of stranding all shards on the single generation host and
|
|
// collapsing fault tolerance. A node still transport-down stays skipped (the
|
|
// same leniency the initial sweep grants), and a node that came back reachable
|
|
// but whose delete genuinely failed is fatal, exactly as in the initial sweep,
|
|
// so a stale generation is never silently left behind. Mutates skipped in place.
|
|
func resweepSkippedNodes(commandEnv *CommandEnv, skipped map[pb.ServerAddress]struct{}, volumeIds []needle.VolumeId, volumeIdToCollection map[needle.VolumeId]string, maxParallelization int) error {
|
|
if len(skipped) == 0 {
|
|
return nil
|
|
}
|
|
|
|
allShardIds := make([]erasure_coding.ShardId, erasure_coding.MaxShardCount)
|
|
for i := range allShardIds {
|
|
allShardIds[i] = erasure_coding.ShardId(i)
|
|
}
|
|
|
|
addrs := make([]pb.ServerAddress, 0, len(skipped))
|
|
for addr := range skipped {
|
|
addrs = append(addrs, addr)
|
|
}
|
|
|
|
fmt.Printf("re-checking %d node(s) skipped by the orphan sweep before generating shards...\n", len(addrs))
|
|
|
|
// A node still down on every retried vid stays skipped; one that fully
|
|
// succeeds is un-skipped. Track per-node whether any retry still failed
|
|
// (down) so a node whose state is mixed across vids never gets un-skipped.
|
|
stillDown := make(map[pb.ServerAddress]struct{})
|
|
var mu sync.Mutex
|
|
ewg := NewErrorWaitGroup(maxParallelization)
|
|
for _, addr := range addrs {
|
|
for _, vid := range volumeIds {
|
|
collection := volumeIdToCollection[vid]
|
|
ewg.Add(func() error {
|
|
if err := unmountAndDeleteEcShardsQuiet(commandEnv.option.GrpcDialOption, collection, vid, addr, allShardIds); err != nil {
|
|
// Same decision as the initial sweep: a reachable node whose delete
|
|
// genuinely failed (or did not ack a full teardown, or whose liveness is
|
|
// inconclusive) is fatal, since it could hold an orphan a later copy
|
|
// re-stamps into this generation. Only a node still transport-down stays
|
|
// skipped.
|
|
if errors.Is(err, errFullTeardownNotAcked) || !isNodeUnreachable(err) ||
|
|
classifyNodeLiveness(pingVolumeServer(commandEnv.option.GrpcDialOption, addr)) != nodeDown {
|
|
return fmt.Errorf("re-clear stale ec shards for volume %d on %s: %w", vid, addr, err)
|
|
}
|
|
glog.V(1).Infof("orphan re-sweep: volume %d on %s still skipped (unreachable): %v", vid, addr, err)
|
|
mu.Lock()
|
|
stillDown[addr] = struct{}{}
|
|
mu.Unlock()
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return err
|
|
}
|
|
|
|
for _, addr := range addrs {
|
|
if _, down := stillDown[addr]; !down {
|
|
delete(skipped, addr)
|
|
glog.V(0).Infof("orphan re-sweep: node %s recovered and was cleaned; it will participate in the EC rebalance", addr)
|
|
}
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// isNodeUnreachable reports whether err means the volume server could not be
|
|
// reached at all, as opposed to an RPC that reached the node and failed. Only an
|
|
// unreachable node is safe to skip in the orphan sweep. A dead peer surfaces as
|
|
// a gRPC codes.Unavailable from the RPC (the dial is lazy, so it never fails at
|
|
// connect time); any non-status error reached node logic and is treated as
|
|
// reachable, so the sweep stays fatal rather than silently leaving stale state.
|
|
func isNodeUnreachable(err error) bool {
|
|
if err == nil {
|
|
return false
|
|
}
|
|
st, ok := status.FromError(err)
|
|
return ok && st.Code() == codes.Unavailable
|
|
}
|
|
|
|
// nodeLiveness is the tri-state result of a pingVolumeServer probe.
|
|
type nodeLiveness int
|
|
|
|
const (
|
|
// nodeUp: Ping succeeded — the node is reachable (e.g. a Rust volume server
|
|
// in maintenance mode that fails the delete but answers Ping).
|
|
nodeUp nodeLiveness = iota
|
|
// nodeDown: Ping itself transport-failed with codes.Unavailable — the node is
|
|
// confirmed unreachable. The only state the orphan sweep may skip.
|
|
nodeDown
|
|
// nodeLivenessUnknown: Ping reached failing logic with any non-Unavailable
|
|
// code (Internal, ResourceExhausted, Unimplemented from a pre-Ping server, …)
|
|
// or a non-status error. This does NOT prove the node is down, so it is fatal.
|
|
nodeLivenessUnknown
|
|
)
|
|
|
|
// classifyNodeLiveness maps a pingVolumeServer error into the tri-state. A nil
|
|
// error is nodeUp, a transport codes.Unavailable is nodeDown (reusing the same
|
|
// rule as isNodeUnreachable), and every other Ping failure is nodeLivenessUnknown.
|
|
func classifyNodeLiveness(pingErr error) nodeLiveness {
|
|
if pingErr == nil {
|
|
return nodeUp
|
|
}
|
|
if isNodeUnreachable(pingErr) {
|
|
return nodeDown
|
|
}
|
|
return nodeLivenessUnknown
|
|
}
|
|
|
|
func verifyEcShardsBeforeDelete(commandEnv *CommandEnv, volumeIds []needle.VolumeId, diskType types.DiskType) error {
|
|
// Shard relocations from the preceding EC balance reach the master via
|
|
// volume-server heartbeats, so freshly distributed shards may not all be
|
|
// visible in the master topology immediately. Poll a few times before
|
|
// concluding the shard set is incomplete, so a heartbeat-propagation lag is
|
|
// not mistaken for missing data (which would abort the encode). Genuine loss
|
|
// still fails after the retries are exhausted.
|
|
const maxAttempts = 10
|
|
const retryInterval = 2 * time.Second
|
|
|
|
var lastErr error
|
|
for attempt := 0; attempt < maxAttempts; attempt++ {
|
|
topoInfo, _, err := collectTopologyInfo(commandEnv, 0)
|
|
if err != nil {
|
|
return fmt.Errorf("fetch topology for shard verification: %w", err)
|
|
}
|
|
|
|
lastErr = nil
|
|
for _, vid := range volumeIds {
|
|
nodeShards, _ := collectEcNodeShardsInfo(topoInfo, vid, diskType)
|
|
|
|
var union erasure_coding.ShardBits
|
|
for _, info := range nodeShards {
|
|
union = erasure_coding.ShardBits(uint32(union) | info.Bitmap())
|
|
}
|
|
|
|
totalShards := erasure_coding.TotalShardsCount
|
|
if err := erasure_coding.RequireFullShardSet(uint32(vid), union, totalShards); err != nil {
|
|
summary := make([]string, 0, len(nodeShards))
|
|
for node, info := range nodeShards {
|
|
summary = append(summary, fmt.Sprintf("%s=%s", node, info.String()))
|
|
}
|
|
sort.Strings(summary)
|
|
lastErr = fmt.Errorf("volume %d: %w (observed: %v)", vid, err, summary)
|
|
break
|
|
}
|
|
|
|
glog.V(0).Infof("EC shard verification ok for volume %d on diskType %q: %d/%d shards present across %d nodes",
|
|
vid, diskType.ReadableString(), union.Count(), totalShards, len(nodeShards))
|
|
}
|
|
|
|
if lastErr == nil {
|
|
return nil
|
|
}
|
|
if attempt < maxAttempts-1 {
|
|
glog.V(0).Infof("EC shard verification incomplete (attempt %d/%d), waiting for shard locations to propagate: %v",
|
|
attempt+1, maxAttempts, lastErr)
|
|
time.Sleep(retryInterval)
|
|
}
|
|
}
|
|
|
|
glog.Errorf("EC shard verification failed after %d attempts: %v", maxAttempts, lastErr)
|
|
return lastErr
|
|
}
|
|
|
|
// doDeleteVolumesWithLocations deletes volumes using pre-collected location information
|
|
// This avoids race conditions where master metadata is updated after EC encoding
|
|
func doDeleteVolumesWithLocations(commandEnv *CommandEnv, volumeIds []needle.VolumeId, volumeLocationsMap map[needle.VolumeId][]wdclient.Location, maxParallelization int) error {
|
|
if !commandEnv.isLocked() {
|
|
return fmt.Errorf("lock is lost")
|
|
}
|
|
|
|
ewg := NewErrorWaitGroup(maxParallelization)
|
|
for _, vid := range volumeIds {
|
|
locations, found := volumeLocationsMap[vid]
|
|
if !found {
|
|
fmt.Printf("warning: no locations found for volume %d, skipping deletion\n", vid)
|
|
continue
|
|
}
|
|
|
|
for _, l := range locations {
|
|
ewg.Add(func() error {
|
|
if err := deleteVolume(commandEnv.option.GrpcDialOption, vid, l.ServerAddress(), false, false); err != nil {
|
|
return fmt.Errorf("deleteVolume %s volume %d: %v", l.Url, vid, err)
|
|
}
|
|
fmt.Printf("deleted volume %d from %s\n", vid, l.Url)
|
|
return nil
|
|
})
|
|
}
|
|
}
|
|
if err := ewg.Wait(); err != nil {
|
|
return err
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
func generateEcShards(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, collection string, sourceVolumeServer pb.ServerAddress) error {
|
|
|
|
fmt.Printf("generateEcShards %d (collection %q) on %s ...\n", volumeId, collection, sourceVolumeServer)
|
|
|
|
err := operation.WithVolumeServerClient(false, sourceVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
|
|
_, genErr := volumeServerClient.VolumeEcShardsGenerate(context.Background(), &volume_server_pb.VolumeEcShardsGenerateRequest{
|
|
VolumeId: uint32(volumeId),
|
|
Collection: collection,
|
|
})
|
|
return genErr
|
|
})
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
func collectVolumeIdsForEcEncode(commandEnv *CommandEnv, collectionPattern string, sourceDiskType *types.DiskType, fullPercentage float64, quietPeriod time.Duration, verbose bool) (vids []needle.VolumeId, matchedCollections []string, err error) {
|
|
// compile regex pattern for collection matching
|
|
collectionRegex, err := compileCollectionPattern(collectionPattern)
|
|
if err != nil {
|
|
return nil, nil, fmt.Errorf("invalid collection pattern '%s': %v", collectionPattern, err)
|
|
}
|
|
|
|
// collect topology information
|
|
topologyInfo, volumeSizeLimitMb, err := collectTopologyInfo(commandEnv, 0)
|
|
if err != nil {
|
|
return
|
|
}
|
|
|
|
quietSeconds := int64(quietPeriod / time.Second)
|
|
nowUnixSeconds := time.Now().Unix()
|
|
|
|
fmt.Printf("collect volumes with collection pattern '%s', quiet for: %d seconds and %.1f%% full\n", collectionPattern, quietSeconds, fullPercentage)
|
|
|
|
vids, matchedCollections = selectVolumeIdsFromTopology(topologyInfo, volumeSizeLimitMb, collectionRegex, sourceDiskType, quietSeconds, nowUnixSeconds, fullPercentage, verbose)
|
|
return
|
|
}
|
|
|
|
func selectVolumeIdsFromTopology(topologyInfo *master_pb.TopologyInfo, volumeSizeLimitMb uint64, collectionRegex *regexp.Regexp, sourceDiskType *types.DiskType, quietSeconds int64, nowUnixSeconds int64, fullPercentage float64, verbose bool) (vids []needle.VolumeId, matchedCollections []string) {
|
|
// Statistics for verbose mode
|
|
var (
|
|
totalVolumes int
|
|
remoteVolumes int
|
|
wrongCollection int
|
|
wrongDiskType int
|
|
tooRecent int
|
|
tooSmall int
|
|
noFreeDisk int
|
|
)
|
|
|
|
vidMap := make(map[uint32]bool)
|
|
collectionSet := make(map[string]bool)
|
|
eachDataNode(topologyInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
for _, diskInfo := range dn.DiskInfos {
|
|
for _, v := range diskInfo.VolumeInfos {
|
|
totalVolumes++
|
|
|
|
// ignore remote volumes
|
|
if v.RemoteStorageName != "" && v.RemoteStorageKey != "" {
|
|
remoteVolumes++
|
|
if verbose {
|
|
fmt.Printf("skip volume %d on %s: remote volume (storage: %s, key: %s)\n",
|
|
v.Id, dn.Id, v.RemoteStorageName, v.RemoteStorageKey)
|
|
}
|
|
continue
|
|
}
|
|
|
|
// check collection against regex pattern
|
|
if !collectionRegex.MatchString(v.Collection) {
|
|
wrongCollection++
|
|
if verbose {
|
|
fmt.Printf("skip volume %d on %s: collection doesn't match pattern (pattern: %s, actual: %s)\n",
|
|
v.Id, dn.Id, collectionRegex.String(), v.Collection)
|
|
}
|
|
continue
|
|
}
|
|
|
|
// track matched collection
|
|
collectionSet[v.Collection] = true
|
|
|
|
// check disk type
|
|
if sourceDiskType != nil && types.ToDiskType(v.DiskType) != *sourceDiskType {
|
|
wrongDiskType++
|
|
if verbose {
|
|
fmt.Printf("skip volume %d on %s: wrong disk type (expected: %s, actual: %s)\n",
|
|
v.Id, dn.Id, sourceDiskType.ReadableString(), types.ToDiskType(v.DiskType).ReadableString())
|
|
}
|
|
continue
|
|
}
|
|
|
|
// check quiet period
|
|
if v.ModifiedAtSecond+quietSeconds >= nowUnixSeconds {
|
|
tooRecent++
|
|
if verbose {
|
|
fmt.Printf("skip volume %d on %s: too recently modified (last modified: %d seconds ago, required: %d seconds)\n",
|
|
v.Id, dn.Id, nowUnixSeconds-v.ModifiedAtSecond, quietSeconds)
|
|
}
|
|
continue
|
|
}
|
|
|
|
// check size
|
|
sizeThreshold := fullPercentage / 100 * float64(volumeSizeLimitMb) * 1024 * 1024
|
|
if float64(v.Size) <= sizeThreshold {
|
|
tooSmall++
|
|
if verbose {
|
|
fmt.Printf("skip volume %d on %s: too small (size: %.1f MB, threshold: %.1f MB, %.1f%% full)\n",
|
|
v.Id, dn.Id, float64(v.Size)/(1024*1024), sizeThreshold/(1024*1024),
|
|
float64(v.Size)*100/(float64(volumeSizeLimitMb)*1024*1024))
|
|
}
|
|
continue
|
|
}
|
|
|
|
// check free disk space
|
|
if diskInfo.FreeVolumeCount < 2 {
|
|
glog.V(0).Infof("replica %s %d on %s has no free disk", v.Collection, v.Id, dn.Id)
|
|
if verbose {
|
|
fmt.Printf("skip replica of volume %d on %s: insufficient free disk space (free volumes: %d, required: 2)\n",
|
|
v.Id, dn.Id, diskInfo.FreeVolumeCount)
|
|
}
|
|
if _, found := vidMap[v.Id]; !found {
|
|
vidMap[v.Id] = false
|
|
}
|
|
} else {
|
|
if verbose {
|
|
fmt.Printf("selected volume %d on %s: size %.1f MB (%.1f%% full), last modified %d seconds ago, free volumes: %d\n",
|
|
v.Id, dn.Id, float64(v.Size)/(1024*1024),
|
|
float64(v.Size)*100/(float64(volumeSizeLimitMb)*1024*1024),
|
|
nowUnixSeconds-v.ModifiedAtSecond, diskInfo.FreeVolumeCount)
|
|
}
|
|
vidMap[v.Id] = true
|
|
}
|
|
}
|
|
}
|
|
})
|
|
|
|
for vid, good := range vidMap {
|
|
if good {
|
|
vids = append(vids, needle.VolumeId(vid))
|
|
} else {
|
|
noFreeDisk++
|
|
}
|
|
}
|
|
|
|
// Convert collection set to slice
|
|
for collection := range collectionSet {
|
|
matchedCollections = append(matchedCollections, collection)
|
|
}
|
|
sort.Strings(matchedCollections)
|
|
|
|
// Print summary statistics in verbose mode or when no volumes selected
|
|
if verbose || len(vids) == 0 {
|
|
fmt.Printf("\nVolume selection summary:\n")
|
|
fmt.Printf(" Total volumes examined: %d\n", totalVolumes)
|
|
fmt.Printf(" Selected for encoding: %d\n", len(vids))
|
|
fmt.Printf(" Collections matched: %v\n", matchedCollections)
|
|
|
|
if totalVolumes > 0 {
|
|
fmt.Printf("\nReasons for exclusion:\n")
|
|
if remoteVolumes > 0 {
|
|
fmt.Printf(" Remote volumes: %d\n", remoteVolumes)
|
|
}
|
|
if wrongCollection > 0 {
|
|
fmt.Printf(" Collection doesn't match pattern: %d\n", wrongCollection)
|
|
}
|
|
if wrongDiskType > 0 {
|
|
fmt.Printf(" Wrong disk type: %d\n", wrongDiskType)
|
|
}
|
|
if tooRecent > 0 {
|
|
fmt.Printf(" Too recently modified: %d\n", tooRecent)
|
|
}
|
|
if tooSmall > 0 {
|
|
fmt.Printf(" Too small (< %.1f%% full): %d\n", fullPercentage, tooSmall)
|
|
}
|
|
if noFreeDisk > 0 {
|
|
fmt.Printf(" Insufficient free disk space: %d\n", noFreeDisk)
|
|
}
|
|
}
|
|
fmt.Println()
|
|
}
|
|
|
|
return
|
|
}
|