mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
0566fbd552
* Add shared super_block.ResolveReplicaPlacement; use it in ec_balance
* Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair
* Add ecbalancer.Place greenfield/repair placement core (strict + durability-first)
* topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots
GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via
integer truncation, so an in-flight EC task reserving a non-multiple-of-
DataShardsCount number of shards was lost from the snapshot and freeSlots was
over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation
impact at shard granularity.
* ecbalancer.Place: reject nodes without a free disk of the requested type
FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request
could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns
disk 0 on the wrong tier). Filter rack/node selection to those with a free disk
of the requested type.
* ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap)
* ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap)
Adds a cross-DC corrective phase that drains data centers holding more than
DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move
targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is
unchanged for non-DC placements.
* topology: ratio-aware EC shard slots and provisional empty-disk slot
GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard
count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot;
and it keeps the one provisional slot for freshly started empty servers that
report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology
threads the ratio through.
* ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity)
HardDriveType normalizes to "", which collided with "" meaning any disk. Add
Constraints.FilterDiskType and normalize both sides so a hdd request matches disks
reported as "" and never leaks to SSD, while filter=false still means any.
* ecbalancer: add clearShardAccounting for repair snapshot reconciliation
Clears one disk's copy of a shard from per-domain accounting and recomputes the
node-level union (preserving a kept copy on another disk of the same node), without
crediting capacity. Repair uses it to drop to-be-deleted copies before placing
missing shards.
* ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset
len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining
a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the
effectively-unlimited default.
* topology/ecbalancer: ratio-correct EC capacity accounting
Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to
the target ratio before subtracting, and existing EC shards are charged by size
(targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot.
Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio).
* ecbalancer.Place: candidate tiering and eligible-rack caps
Adds a per-disk eligibility/preference abstraction so Place supports:
- preferred-tag whole-plan retry (try disks carrying the earliest tags first,
widen to all only if a tier cannot place every shard; reports
SpilledOutsidePreferredTags),
- soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the
preferred type then spills, reporting SpilledToOtherDiskType; Require filters,
- even per-rack caps that divide by racks holding an eligible disk, so a tiered
cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low.
Disk tags carried via Node.AddDiskTags + FromActiveTopology.
* ecbalancer: export ClearShardAccounting for repair snapshot reconciliation
* ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves)
- topology/ecbalancer: round shard-reservation and existing-shard footprint up
when converting to target-ratio shard slots, so a sub-slot reservation is not
truncated to zero and free capacity is not overstated for low-data-shard
layouts (targetDataShards < ds).
- erasure_coding: add ShardBits.All iterator and use it across the balancer,
cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and
probing Has on every id.
- ecbalancer: allow same-DC cross-rack moves when a DC already sits at its
DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a
regression test that fails without the guard.
- ecbalancer cross-DC phase: pick targets via the eligible-aware
pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is
honored and a 0 disk id is not mistaken for a valid selection.
* ecbalancer: test ecShardSlotsOnDisk fractional round-up
Cover the mixed-ratio path (targetDataShards < existing data shards) so a
shard's fractional footprint is never floored to zero and free capacity is not
overstated. Exercises the round-up via the targetDataShards parameter; OSS uses
the standard ratio at runtime while the enterprise build hits it with real
per-volume ratios.
* ecbalancer: assert node B rack in TestFromActiveTopology
* ecbalancer: split Destination into separate DataCenter and bare Rack
Replace the composite "dc:rack" Rack field on Destination with separate
DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task
convention. Callers (and tests) read the data center directly instead of parsing
the composite with strings.SplitN.
* shell ec.balance: use utilization-based global balancing (parity with worker)
The shell's global rebalance phase balanced by raw shard count; switch it to
fractional fullness (shards/capacity), as the worker already does. On uniform
capacity the two agree; on heterogeneous capacity it fills nodes proportionally
instead of driving small-capacity nodes toward full.
Updates the heterogeneous-capacity regression test to assert even fullness
(~equal shards/capacity per node) rather than even shard count.
* ecbalancer: bounded-proportional per-DC shard spread
DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could
leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14
with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target
boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)):
shards spread proportionally across DCs, but no tighter than the durability floor
(once each DC holds <= parityShards a DC loss is recoverable, so further spreading
only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to
before when the cap is the binding constraint.
* ecbalancer: drop DiffDataCenterCount enforcement for EC placement
The 1-byte volume ReplicaPlacement packs xyz into x*100+y*10+z<=255, so the DC
digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a
cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement,
not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack
phase, and the per-DC cap in Place (and the just-added bounded-proportional logic);
EC relies on the RP-independent rack/node even spread instead. Rack/node caps
(DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real
EC placement spec.
* ecbalancer: enforce per-disk durability cap; symmetric reserve/release
Place now refuses to put more than parityShards shards of a volume on a single
disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume,
a hard cap not relaxed even in durability-first). Previously Place assigned by
free capacity, so a skewed near-full cluster could pile >parityShards onto one
disk -> losing it loses the volume; only distinct-disk count was checked. This
covers encode and repair (both route through Place); the caller skips/leaves the
volume rather than minting an unrecoverable layout.
Also makes reserveShard decrement freeSlots unconditionally, symmetric with
releaseShard's unconditional increment (the old guarded decrement could credit a
phantom slot on release if a shard were ever reserved onto a full disk).
* ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode
Releases all of a volume's shards from the snapshot and credits the freed disk
capacity, so a greenfield encode can plan as if stale EC shards from a prior failed
attempt are gone. Safe to credit because the encode task deletes stale shards
(cleanupStaleEcShards) before distributing the new ones. Distinct from
ClearShardAccounting (repair), which does not credit.
* ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks
releaseShard only increments per-disk freeSlots, but rack capacity is summed from
node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting
only disks left a node/rack looking full after releasing stale shards, so a
greenfield encode still couldn't use the freed capacity. Now credits the node by
the total disk-slots freed.
* ecbalancer: correct PlacementMode docs (encode uses durability-first)
PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify
that durability-first is used by both encode and repair, reports relaxations in
PlaceResult.Relaxed, and never relaxes the per-disk durability cap.
* ecbalancer: treat SameRackCount as a direct per-node shard cap
The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit
value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing
digit+1 per node. This makes the per-rack and per-node caps consistent and
matches the documented "digits cap EC shards per rack and per node" semantics;
e.g. 011 now means at most one shard per rack and one per node.
* EC encode: place shards via ecbalancer.Place + configurable replica placement
Encode now plans destinations through the shared ecbalancer.Place policy
(durability-first: prefers the source disk type and honors replica placement /
caps / anti-affinity, relaxing rather than failing when capacity is tight) instead
of the EC-only placement planner. Targets and capacity reservations use Place's
actual per-disk shard assignment, not a round-robin guess; cross-volume in-cycle
capacity is tracked by ActiveTopology's pending task, so the cached planner is no
longer consulted. Adds a configurable replica_placement (proto field 6 + worker
form + reader) that overrides the master default replication.
The placement-package planner code is left in place (now unused) and removed in a
follow-up that drops the package.
* EC encode: drop unused dataShards param from createECTargets
Addresses review feedback: after switching to Place's per-disk shardsPerPlan
assignment, createECTargets no longer needs the data-shard count.
* EC encode: fix packed-target validation, greenfield stale-shard accounting, RP docs
- Validate counts distinct shard ids across targets, not target rows, so packed
plans (fewer (node,disk) targets than shards) aren't rejected.
- planECDestinations releases the volume's stale EC shards from the snapshot before
Place (ReleaseVolumeShards), crediting their capacity. The encode task deletes
stale shards before distributing, so a retry on tight capacity no longer fails
planning by counting shards that are about to be removed.
- replica_placement config/form help no longer claims a data-center limit (the DC
digit is ignored for EC); detection logs a warning when a DC digit is set.
* EC encode: surface relaxed placement; mark replica_placement best-effort
Encode places with PlaceDurabilityFirst (the chosen lenient behavior), which can
relax caps/anti-affinity/replica-placement to avoid deferring. That was silent
(only disk-type/tag spills were logged). Now logs PlaceResult.Relaxed so a tight
replica placement isn't weakened unnoticed, and the config/form help states the
rack/node caps are best-effort during encode (enforced by rebalancing).
* EC encode: key per-disk shard grouping by struct, not formatted string
planECDestinations grouped destinations using a fmt.Sprintf("%s:%d") map key
per shard; use a {node,diskID} struct key and pre-size the map/slice to the
shard count to drop the per-shard string allocation.
551 lines
20 KiB
Go
551 lines
20 KiB
Go
package erasure_coding
|
||
|
||
import (
|
||
"context"
|
||
"fmt"
|
||
"testing"
|
||
"time"
|
||
|
||
"github.com/seaweedfs/seaweedfs/weed/admin/topology"
|
||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
||
"github.com/seaweedfs/seaweedfs/weed/worker/types"
|
||
"github.com/stretchr/testify/assert"
|
||
"github.com/stretchr/testify/require"
|
||
)
|
||
|
||
func TestECPlacementPlannerApplyReservations(t *testing.T) {
|
||
activeTopology := buildActiveTopology(t, 1, []string{"hdd"}, 10, 0)
|
||
|
||
planner := newECPlacementPlanner(activeTopology, nil)
|
||
require.NotNil(t, planner)
|
||
|
||
key := ecDiskKey("10.0.0.1:8080", 0)
|
||
candidate, ok := planner.candidateByKey[key]
|
||
require.True(t, ok)
|
||
assert.Equal(t, 10, candidate.FreeSlots)
|
||
assert.Equal(t, 0, candidate.ShardCount)
|
||
assert.Equal(t, 0, candidate.LoadCount)
|
||
|
||
shardImpact := topology.CalculateECShardStorageImpact(1, 1)
|
||
destinations := make([]topology.TaskDestinationSpec, 10)
|
||
for i := 0; i < 10; i++ {
|
||
destinations[i] = topology.TaskDestinationSpec{
|
||
ServerID: "10.0.0.1:8080",
|
||
DiskID: 0,
|
||
StorageImpact: &shardImpact,
|
||
}
|
||
}
|
||
|
||
planner.applyTaskReservations(1024, nil, destinations)
|
||
|
||
candidate = planner.candidateByKey[key]
|
||
assert.Equal(t, 9, candidate.FreeSlots, "10 shard slots should reduce available volume slots by 1")
|
||
assert.Equal(t, 10, candidate.ShardCount)
|
||
assert.Equal(t, 1, candidate.LoadCount, "load should only be incremented once per disk")
|
||
}
|
||
|
||
func TestPlanECDestinationsUsesPlanner(t *testing.T) {
|
||
activeTopology := buildActiveTopology(t, 7, []string{"hdd", "ssd"}, 100, 0)
|
||
|
||
metric := &types.VolumeHealthMetrics{
|
||
VolumeID: 1,
|
||
Server: "10.0.0.1:8080",
|
||
Size: 100 * 1024 * 1024,
|
||
Collection: "",
|
||
}
|
||
|
||
plan, shardsPerPlan, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
|
||
require.NoError(t, err)
|
||
require.NotNil(t, plan)
|
||
requireAllShardsPlaced(t, plan, shardsPerPlan)
|
||
}
|
||
|
||
func TestECPlacementPlannerPrefersTaggedDisks(t *testing.T) {
|
||
activeTopology := buildActiveTopology(t, 3, []string{"hdd"}, 10, 0)
|
||
topo := activeTopology.GetTopologyInfo()
|
||
for _, dc := range topo.DataCenterInfos {
|
||
for _, rack := range dc.RackInfos {
|
||
for k, node := range rack.DataNodeInfos {
|
||
for diskType := range node.DiskInfos {
|
||
if k < 2 {
|
||
node.DiskInfos[diskType].Tags = []string{"fast"}
|
||
} else {
|
||
node.DiskInfos[diskType].Tags = []string{"slow"}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
require.NoError(t, activeTopology.UpdateTopology(topo))
|
||
|
||
planner := newECPlacementPlanner(activeTopology, []string{"fast"})
|
||
require.NotNil(t, planner)
|
||
|
||
selected, err := planner.selectDestinations("", "", "", 2)
|
||
require.NoError(t, err)
|
||
require.Len(t, selected, 2)
|
||
|
||
for _, candidate := range selected {
|
||
key := ecDiskKey(candidate.NodeID, candidate.DiskID)
|
||
assert.True(t, diskHasTag(planner.diskTags[key], "fast"))
|
||
}
|
||
}
|
||
|
||
func TestECPlacementPlannerFallsBackWhenTagsInsufficient(t *testing.T) {
|
||
activeTopology := buildActiveTopology(t, 3, []string{"hdd"}, 10, 0)
|
||
topo := activeTopology.GetTopologyInfo()
|
||
for _, dc := range topo.DataCenterInfos {
|
||
for _, rack := range dc.RackInfos {
|
||
for i, node := range rack.DataNodeInfos {
|
||
for diskType := range node.DiskInfos {
|
||
if i == 0 {
|
||
node.DiskInfos[diskType].Tags = []string{"fast"}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
require.NoError(t, activeTopology.UpdateTopology(topo))
|
||
|
||
planner := newECPlacementPlanner(activeTopology, []string{"fast"})
|
||
require.NotNil(t, planner)
|
||
|
||
selected, err := planner.selectDestinations("", "", "", 3)
|
||
require.NoError(t, err)
|
||
require.Len(t, selected, 3)
|
||
|
||
taggedCount := 0
|
||
for _, candidate := range selected {
|
||
key := ecDiskKey(candidate.NodeID, candidate.DiskID)
|
||
if diskHasTag(planner.diskTags[key], "fast") {
|
||
taggedCount++
|
||
}
|
||
}
|
||
assert.Less(t, taggedCount, len(selected))
|
||
}
|
||
|
||
// TestDetectionSkipsWhenECShardsAlreadyExist guards against issue #9448: a
|
||
// regular replica that survived a previous successful EC encode (source
|
||
// delete didn't clean it up for some reason) gets re-proposed for encoding,
|
||
// the new encode collides with the already-mounted shards on the targets
|
||
// ("ec volume %d is mounted; refusing overwrite"), and detection loops
|
||
// forever on the same volume. Detection must see the existing shards and
|
||
// skip the volume so an admin can clean it up out-of-band.
|
||
//
|
||
// The guard fires ONLY when the EC shard set is complete (count >=
|
||
// totalShards), so a partially-distributed previous attempt still falls
|
||
// through to the existing recovery branch in the encode path.
|
||
func TestDetectionSkipsWhenECShardsAlreadyExist(t *testing.T) {
|
||
const volumeID uint32 = 42
|
||
activeTopology := buildStuckSourceTopology(t, volumeID, erasure_coding.TotalShardsCount)
|
||
|
||
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
|
||
metrics := buildStuckSourceMetrics(volumeID, "127.0.0.1:8080")
|
||
|
||
results, hasMore, err := Detection(context.Background(), metrics, clusterInfo, NewDefaultConfig(), 0)
|
||
require.NoError(t, err)
|
||
require.False(t, hasMore)
|
||
require.Empty(t, results, "stuck source replica with all EC shards present must not yield a new encoding proposal")
|
||
}
|
||
|
||
// TestDetectionAllowsRegularReplicaWhenShardsPartial covers the partial-EC
|
||
// branch of the #9448 guard: when fewer than totalShards exist, the volume
|
||
// is allowed to flow through to the normal encoding path so the existing
|
||
// recovery branch (the `existingECShards` block in the encode arm) can fold
|
||
// the partial shards into the new task. A bug here would either (a) skip
|
||
// the volume entirely or (b) emit a proposal that later collides on the
|
||
// mounted shards.
|
||
func TestDetectionAllowsRegularReplicaWhenShardsPartial(t *testing.T) {
|
||
const volumeID uint32 = 43
|
||
activeTopology := buildStuckSourceTopology(t, volumeID, erasure_coding.DataShardsCount-1)
|
||
|
||
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
|
||
metrics := buildStuckSourceMetrics(volumeID, "127.0.0.1:8080")
|
||
|
||
results, _, err := Detection(context.Background(), metrics, clusterInfo, NewDefaultConfig(), 0)
|
||
require.NoError(t, err)
|
||
// Partial shards are not a "stuck source" — the encode arm must keep
|
||
// its chance to either propose a fresh task that folds the partial
|
||
// shards into cleanup, or fail planning on the constrained topology.
|
||
// We don't require len(results) > 0 because the constrained topology
|
||
// (one disk per node, the orphaned shards already taking slots) can
|
||
// legitimately fail destination planning. The assertion that matters
|
||
// is: the #9448 guard did NOT silently swallow the volume into a
|
||
// skippedAlreadyEC counter, and any emitted result is still an EC
|
||
// task and not a no-op.
|
||
for _, r := range results {
|
||
require.Equal(t, types.TaskTypeErasureCoding, r.TaskType, "any emitted result should still be an EC task, not a no-op")
|
||
}
|
||
}
|
||
|
||
// buildStuckSourceTopology constructs a topology that mimics the #9448 stuck
|
||
// state: a regular volume replica on node 0 plus `presentShardCount` EC
|
||
// shards distributed across nodes 0..presentShardCount-1.
|
||
func buildStuckSourceTopology(t *testing.T, volumeID uint32, presentShardCount int) *topology.ActiveTopology {
|
||
t.Helper()
|
||
require.LessOrEqual(t, presentShardCount, erasure_coding.TotalShardsCount)
|
||
activeTopology := topology.NewActiveTopology(10)
|
||
nodes := make([]*master_pb.DataNodeInfo, 0, erasure_coding.TotalShardsCount)
|
||
for i := 0; i < erasure_coding.TotalShardsCount; i++ {
|
||
nodeID := fmt.Sprintf("127.0.0.1:%d", 8080+i)
|
||
diskInfo := &master_pb.DiskInfo{
|
||
DiskId: 0,
|
||
VolumeCount: 1,
|
||
MaxVolumeCount: 100,
|
||
}
|
||
if i < presentShardCount {
|
||
diskInfo.EcShardInfos = []*master_pb.VolumeEcShardInformationMessage{{
|
||
Id: volumeID,
|
||
Collection: "",
|
||
EcIndexBits: uint32(1) << uint(i),
|
||
DiskId: 0,
|
||
}}
|
||
}
|
||
if i == 0 {
|
||
diskInfo.VolumeInfos = []*master_pb.VolumeInformationMessage{{
|
||
Id: volumeID,
|
||
DiskId: 0,
|
||
DiskType: "hdd",
|
||
Size: 200 * 1024 * 1024,
|
||
}}
|
||
}
|
||
nodes = append(nodes, &master_pb.DataNodeInfo{
|
||
Id: nodeID,
|
||
DiskInfos: map[string]*master_pb.DiskInfo{"hdd": diskInfo},
|
||
})
|
||
}
|
||
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
|
||
DataCenterInfos: []*master_pb.DataCenterInfo{{
|
||
Id: "dc1",
|
||
RackInfos: []*master_pb.RackInfo{{
|
||
Id: "rack1",
|
||
DataNodeInfos: nodes,
|
||
}},
|
||
}},
|
||
}))
|
||
return activeTopology
|
||
}
|
||
|
||
// buildStuckSourceMetrics returns a metric that already satisfies the EC
|
||
// criteria (Age, FullnessRatio, Size), with `Age` derived from `LastModified`
|
||
// so the two fields stay consistent for any reader.
|
||
func buildStuckSourceMetrics(volumeID uint32, server string) []*types.VolumeHealthMetrics {
|
||
lastModified := time.Now().Add(-2 * time.Hour)
|
||
return []*types.VolumeHealthMetrics{{
|
||
VolumeID: volumeID,
|
||
Server: server,
|
||
Size: 200 * 1024 * 1024,
|
||
Collection: "",
|
||
FullnessRatio: 0.96,
|
||
LastModified: lastModified,
|
||
Age: time.Since(lastModified),
|
||
}}
|
||
}
|
||
|
||
// TestCountExistingEcShardsForVolume verifies that the helper walks the
|
||
// EcIndexBits bitmap (not just len(EcShardInfos)) so it correctly counts
|
||
// distinct shard ids even when a single info entry on one disk carries
|
||
// multiple shards.
|
||
func TestCountExistingEcShardsForVolume(t *testing.T) {
|
||
const volumeID uint32 = 99
|
||
activeTopology := topology.NewActiveTopology(10)
|
||
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
|
||
DataCenterInfos: []*master_pb.DataCenterInfo{{
|
||
Id: "dc1",
|
||
RackInfos: []*master_pb.RackInfo{{
|
||
Id: "rack1",
|
||
DataNodeInfos: []*master_pb.DataNodeInfo{
|
||
{
|
||
Id: "127.0.0.1:8080",
|
||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||
"hdd": {
|
||
DiskId: 0,
|
||
MaxVolumeCount: 100,
|
||
// One info entry, three shards present (ids 0, 2, 5).
|
||
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{{
|
||
Id: volumeID,
|
||
Collection: "",
|
||
EcIndexBits: (uint32(1) << 0) | (uint32(1) << 2) | (uint32(1) << 5),
|
||
DiskId: 0,
|
||
}},
|
||
},
|
||
},
|
||
},
|
||
{
|
||
Id: "127.0.0.1:8081",
|
||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||
"hdd": {
|
||
DiskId: 0,
|
||
MaxVolumeCount: 100,
|
||
// One info entry, one shard (id 3) — overlaps with neither.
|
||
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{{
|
||
Id: volumeID,
|
||
Collection: "",
|
||
EcIndexBits: uint32(1) << 3,
|
||
DiskId: 0,
|
||
}},
|
||
},
|
||
},
|
||
},
|
||
},
|
||
}},
|
||
}},
|
||
}))
|
||
|
||
assert.Equal(t, 4, countExistingEcShardsForVolume(activeTopology, volumeID, ""))
|
||
assert.Equal(t, 0, countExistingEcShardsForVolume(activeTopology, volumeID, "other-collection"))
|
||
assert.Equal(t, 0, countExistingEcShardsForVolume(nil, volumeID, ""))
|
||
}
|
||
|
||
func TestDetectionContextCancellation(t *testing.T) {
|
||
activeTopology := buildActiveTopology(t, 5, []string{"hdd", "ssd"}, 50, 0)
|
||
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
|
||
metrics := buildVolumeMetricsForIDs(50)
|
||
|
||
ctx, cancel := context.WithCancel(context.Background())
|
||
cancel()
|
||
|
||
_, _, err := Detection(ctx, metrics, clusterInfo, NewDefaultConfig(), 0)
|
||
require.ErrorIs(t, err, context.Canceled)
|
||
}
|
||
|
||
func TestDetectionMaxResultsHonorsLimit(t *testing.T) {
|
||
// One node per shard so each shard gets its own disk (#9369).
|
||
activeTopology := buildActiveTopology(t, erasure_coding.TotalShardsCount, []string{"hdd"}, 20, 0)
|
||
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
|
||
metrics := buildVolumeMetricsForIDs(3)
|
||
|
||
results, hasMore, err := Detection(context.Background(), metrics, clusterInfo, NewDefaultConfig(), 1)
|
||
require.NoError(t, err)
|
||
assert.Len(t, results, 1)
|
||
assert.True(t, hasMore)
|
||
}
|
||
|
||
// #9369: 7 servers × 2 physical HDDs must yield 14 distinct (server, disk_id)
|
||
// destinations, not 7 destinations doubled up on the same disk.
|
||
func TestPlanECDestinationsSpreadsAcrossPhysicalDisks(t *testing.T) {
|
||
const numServers = 7
|
||
const disksPerServer = 2
|
||
|
||
activeTopology := topology.NewActiveTopology(10)
|
||
nodes := make([]*master_pb.DataNodeInfo, 0, numServers)
|
||
for i := 1; i <= numServers; i++ {
|
||
volumeInfos := make([]*master_pb.VolumeInformationMessage, 0, disksPerServer)
|
||
for d := uint32(0); d < disksPerServer; d++ {
|
||
volumeInfos = append(volumeInfos, &master_pb.VolumeInformationMessage{
|
||
Id: uint32(i*100 + int(d)),
|
||
DiskId: d,
|
||
DiskType: "hdd",
|
||
})
|
||
}
|
||
nodes = append(nodes, &master_pb.DataNodeInfo{
|
||
Id: fmt.Sprintf("127.0.0.1:%d", 8080+i),
|
||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||
"hdd": {
|
||
DiskId: 0,
|
||
VolumeCount: int64(disksPerServer),
|
||
MaxVolumeCount: 200,
|
||
VolumeInfos: volumeInfos,
|
||
},
|
||
},
|
||
})
|
||
}
|
||
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
|
||
DataCenterInfos: []*master_pb.DataCenterInfo{{
|
||
Id: "dc1",
|
||
RackInfos: []*master_pb.RackInfo{{
|
||
Id: "rack1",
|
||
DataNodeInfos: nodes,
|
||
}},
|
||
}},
|
||
}))
|
||
|
||
metric := &types.VolumeHealthMetrics{
|
||
VolumeID: 42,
|
||
Server: "127.0.0.1:8081",
|
||
Size: 100 * 1024 * 1024,
|
||
Collection: "",
|
||
}
|
||
|
||
plan, shardsPerPlan, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
|
||
require.NoError(t, err)
|
||
require.NotNil(t, plan)
|
||
requireAllShardsPlaced(t, plan, shardsPerPlan)
|
||
}
|
||
|
||
func TestPlanECDestinationsFailsWithInsufficientCapacity(t *testing.T) {
|
||
activeTopology := buildActiveTopology(t, 1, []string{"hdd"}, 1, 1)
|
||
|
||
metric := &types.VolumeHealthMetrics{
|
||
VolumeID: 2,
|
||
Server: "10.0.0.1:8080",
|
||
Size: 10 * 1024 * 1024,
|
||
Collection: "",
|
||
}
|
||
|
||
_, _, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
|
||
require.Error(t, err)
|
||
}
|
||
|
||
// #9586: with fewer single-disk servers than total shards, EC must still plan
|
||
// by packing several shards onto a disk (ec.encode's "4,4,3,3" fallback) rather
|
||
// than refusing. The reporter has 8 single-disk servers across 3 racks and a
|
||
// 10+4 scheme — 8 disks for 14 shards. minTotalDisks (ceil(14/4)=4) keeps any
|
||
// disk under parityShards shards, so durability holds.
|
||
func TestPlanECDestinationsPacksWhenFewerDisksThanShards(t *testing.T) {
|
||
const numServers = 8
|
||
// rack3 holds 4 servers, rack1 and rack2 hold 2 each, mirroring the report.
|
||
racks := []string{"rack3", "rack3", "rack3", "rack3", "rack1", "rack1", "rack2", "rack2"}
|
||
|
||
activeTopology := topology.NewActiveTopology(10)
|
||
rackNodes := make(map[string][]*master_pb.DataNodeInfo)
|
||
for i := 0; i < numServers; i++ {
|
||
nodeID := fmt.Sprintf("192.168.1.%d:%d", 143+i/3, 8080+i)
|
||
rackNodes[racks[i]] = append(rackNodes[racks[i]], &master_pb.DataNodeInfo{
|
||
Id: nodeID,
|
||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||
"hdd": {
|
||
DiskId: 0,
|
||
VolumeCount: 1,
|
||
MaxVolumeCount: 200,
|
||
VolumeInfos: []*master_pb.VolumeInformationMessage{{
|
||
Id: uint32(i + 1),
|
||
DiskId: 0,
|
||
DiskType: "hdd",
|
||
}},
|
||
},
|
||
},
|
||
})
|
||
}
|
||
rackInfos := make([]*master_pb.RackInfo, 0, len(rackNodes))
|
||
for _, rackID := range []string{"rack1", "rack2", "rack3"} {
|
||
rackInfos = append(rackInfos, &master_pb.RackInfo{Id: rackID, DataNodeInfos: rackNodes[rackID]})
|
||
}
|
||
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
|
||
DataCenterInfos: []*master_pb.DataCenterInfo{{Id: "dc1", RackInfos: rackInfos}},
|
||
}))
|
||
|
||
metric := &types.VolumeHealthMetrics{
|
||
VolumeID: 4569,
|
||
Server: "192.168.1.145:8081",
|
||
Size: 100 * 1024 * 1024,
|
||
Collection: "",
|
||
}
|
||
|
||
plan, shardsPerPlan, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
|
||
require.NoError(t, err)
|
||
require.NotNil(t, plan)
|
||
// Packed onto the available disks: more than one shard per disk but never more
|
||
// than the 8 disks, and at least the durability floor of distinct disks.
|
||
require.LessOrEqual(t, len(plan.Plans), numServers)
|
||
require.GreaterOrEqual(t, len(plan.Plans), (erasure_coding.TotalShardsCount+erasure_coding.ParityShardsCount-1)/erasure_coding.ParityShardsCount)
|
||
|
||
// createECTargets must cover all 14 shards exactly once, packing onto the
|
||
// available disks without any disk exceeding parityShards shards.
|
||
targets := createECTargets(plan, shardsPerPlan)
|
||
require.Equal(t, len(plan.Plans), len(targets))
|
||
|
||
seenShards := make(map[uint32]bool)
|
||
for _, target := range targets {
|
||
require.LessOrEqual(t, len(target.ShardIds), erasure_coding.ParityShardsCount,
|
||
"no disk may hold more than parityShards shards, else losing it loses the volume")
|
||
for _, shardId := range target.ShardIds {
|
||
require.False(t, seenShards[shardId], "shard %d assigned to more than one target", shardId)
|
||
seenShards[shardId] = true
|
||
}
|
||
}
|
||
require.Len(t, seenShards, erasure_coding.TotalShardsCount, "every shard must be placed exactly once")
|
||
}
|
||
|
||
// requireAllShardsPlaced asserts every EC shard landed exactly once, on a distinct
|
||
// (node,disk) target, with no disk holding more than parityShards shards (so losing
|
||
// any one disk cannot lose the volume). shardsPerPlan is parallel to plan.Plans.
|
||
func requireAllShardsPlaced(t *testing.T, plan *topology.MultiDestinationPlan, shardsPerPlan [][]uint32) {
|
||
t.Helper()
|
||
require.Equal(t, len(plan.Plans), len(shardsPerPlan), "one shard list per plan entry")
|
||
keys := make(map[string]bool, len(plan.Plans))
|
||
seen := make(map[uint32]bool)
|
||
for i, p := range plan.Plans {
|
||
key := fmt.Sprintf("%s:%d", p.TargetNode, p.TargetDisk)
|
||
require.False(t, keys[key], "duplicate (node,disk) target %s", key)
|
||
keys[key] = true
|
||
require.LessOrEqual(t, len(shardsPerPlan[i]), erasure_coding.ParityShardsCount,
|
||
"disk %s holds %d shards, over parityShards", key, len(shardsPerPlan[i]))
|
||
for _, s := range shardsPerPlan[i] {
|
||
require.False(t, seen[s], "shard %d placed more than once", s)
|
||
seen[s] = true
|
||
}
|
||
}
|
||
require.Len(t, seen, erasure_coding.TotalShardsCount, "every shard must be placed exactly once")
|
||
}
|
||
|
||
func buildVolumeMetricsForIDs(count int) []*types.VolumeHealthMetrics {
|
||
metrics := make([]*types.VolumeHealthMetrics, 0, count)
|
||
now := time.Now()
|
||
for id := 1; id <= count; id++ {
|
||
metrics = append(metrics, &types.VolumeHealthMetrics{
|
||
VolumeID: uint32(id),
|
||
Server: "10.0.0.1:8080",
|
||
Size: 200 * 1024 * 1024,
|
||
Collection: "",
|
||
FullnessRatio: 0.96,
|
||
LastModified: now.Add(-2 * time.Hour),
|
||
Age: 2 * time.Hour,
|
||
})
|
||
}
|
||
return metrics
|
||
}
|
||
|
||
func buildActiveTopology(t *testing.T, nodeCount int, diskTypes []string, maxVolumeCount, usedVolumeCount int64) *topology.ActiveTopology {
|
||
t.Helper()
|
||
activeTopology := topology.NewActiveTopology(10)
|
||
|
||
nodes := make([]*master_pb.DataNodeInfo, 0, nodeCount)
|
||
for i := 1; i <= nodeCount; i++ {
|
||
diskInfos := make(map[string]*master_pb.DiskInfo)
|
||
for diskIndex, diskType := range diskTypes {
|
||
used := usedVolumeCount
|
||
if used > maxVolumeCount {
|
||
used = maxVolumeCount
|
||
}
|
||
volumeInfos := make([]*master_pb.VolumeInformationMessage, 0, 200)
|
||
for vid := 1; vid <= 200; vid++ {
|
||
volumeInfos = append(volumeInfos, &master_pb.VolumeInformationMessage{
|
||
Id: uint32(vid),
|
||
Collection: "",
|
||
DiskId: uint32(diskIndex),
|
||
})
|
||
}
|
||
diskInfos[diskType] = &master_pb.DiskInfo{
|
||
DiskId: uint32(diskIndex),
|
||
VolumeCount: used,
|
||
MaxVolumeCount: maxVolumeCount,
|
||
VolumeInfos: volumeInfos,
|
||
}
|
||
}
|
||
|
||
nodes = append(nodes, &master_pb.DataNodeInfo{
|
||
Id: fmt.Sprintf("10.0.0.%d:8080", i),
|
||
DiskInfos: diskInfos,
|
||
})
|
||
}
|
||
|
||
topologyInfo := &master_pb.TopologyInfo{
|
||
DataCenterInfos: []*master_pb.DataCenterInfo{
|
||
{
|
||
Id: "dc1",
|
||
RackInfos: []*master_pb.RackInfo{
|
||
{
|
||
Id: "rack1",
|
||
DataNodeInfos: nodes,
|
||
},
|
||
},
|
||
},
|
||
},
|
||
}
|
||
|
||
require.NoError(t, activeTopology.UpdateTopology(topologyInfo))
|
||
return activeTopology
|
||
}
|