Files
seaweedfs/weed/worker/tasks/erasure_coding/detection_test.go
T
Chris Lu 0566fbd552 EC encode: place shards via ecbalancer.Place + configurable replica placement (#9623)
* Add shared super_block.ResolveReplicaPlacement; use it in ec_balance

* Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair

* Add ecbalancer.Place greenfield/repair placement core (strict + durability-first)

* topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots

GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via
integer truncation, so an in-flight EC task reserving a non-multiple-of-
DataShardsCount number of shards was lost from the snapshot and freeSlots was
over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation
impact at shard granularity.

* ecbalancer.Place: reject nodes without a free disk of the requested type

FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request
could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns
disk 0 on the wrong tier). Filter rack/node selection to those with a free disk
of the requested type.

* ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap)

* ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap)

Adds a cross-DC corrective phase that drains data centers holding more than
DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move
targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is
unchanged for non-DC placements.

* topology: ratio-aware EC shard slots and provisional empty-disk slot

GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard
count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot;
and it keeps the one provisional slot for freshly started empty servers that
report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology
threads the ratio through.

* ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity)

HardDriveType normalizes to "", which collided with "" meaning any disk. Add
Constraints.FilterDiskType and normalize both sides so a hdd request matches disks
reported as "" and never leaks to SSD, while filter=false still means any.

* ecbalancer: add clearShardAccounting for repair snapshot reconciliation

Clears one disk's copy of a shard from per-domain accounting and recomputes the
node-level union (preserving a kept copy on another disk of the same node), without
crediting capacity. Repair uses it to drop to-be-deleted copies before placing
missing shards.

* ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset

len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining
a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the
effectively-unlimited default.

* topology/ecbalancer: ratio-correct EC capacity accounting

Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to
the target ratio before subtracting, and existing EC shards are charged by size
(targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot.
Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio).

* ecbalancer.Place: candidate tiering and eligible-rack caps

Adds a per-disk eligibility/preference abstraction so Place supports:
- preferred-tag whole-plan retry (try disks carrying the earliest tags first,
  widen to all only if a tier cannot place every shard; reports
  SpilledOutsidePreferredTags),
- soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the
  preferred type then spills, reporting SpilledToOtherDiskType; Require filters,
- even per-rack caps that divide by racks holding an eligible disk, so a tiered
  cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low.
Disk tags carried via Node.AddDiskTags + FromActiveTopology.

* ecbalancer: export ClearShardAccounting for repair snapshot reconciliation

* ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves)

- topology/ecbalancer: round shard-reservation and existing-shard footprint up
  when converting to target-ratio shard slots, so a sub-slot reservation is not
  truncated to zero and free capacity is not overstated for low-data-shard
  layouts (targetDataShards < ds).
- erasure_coding: add ShardBits.All iterator and use it across the balancer,
  cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and
  probing Has on every id.
- ecbalancer: allow same-DC cross-rack moves when a DC already sits at its
  DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a
  regression test that fails without the guard.
- ecbalancer cross-DC phase: pick targets via the eligible-aware
  pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is
  honored and a 0 disk id is not mistaken for a valid selection.

* ecbalancer: test ecShardSlotsOnDisk fractional round-up

Cover the mixed-ratio path (targetDataShards < existing data shards) so a
shard's fractional footprint is never floored to zero and free capacity is not
overstated. Exercises the round-up via the targetDataShards parameter; OSS uses
the standard ratio at runtime while the enterprise build hits it with real
per-volume ratios.

* ecbalancer: assert node B rack in TestFromActiveTopology

* ecbalancer: split Destination into separate DataCenter and bare Rack

Replace the composite "dc:rack" Rack field on Destination with separate
DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task
convention. Callers (and tests) read the data center directly instead of parsing
the composite with strings.SplitN.

* shell ec.balance: use utilization-based global balancing (parity with worker)

The shell's global rebalance phase balanced by raw shard count; switch it to
fractional fullness (shards/capacity), as the worker already does. On uniform
capacity the two agree; on heterogeneous capacity it fills nodes proportionally
instead of driving small-capacity nodes toward full.

Updates the heterogeneous-capacity regression test to assert even fullness
(~equal shards/capacity per node) rather than even shard count.

* ecbalancer: bounded-proportional per-DC shard spread

DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could
leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14
with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target
boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)):
shards spread proportionally across DCs, but no tighter than the durability floor
(once each DC holds <= parityShards a DC loss is recoverable, so further spreading
only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to
before when the cap is the binding constraint.

* ecbalancer: drop DiffDataCenterCount enforcement for EC placement

The 1-byte volume ReplicaPlacement packs xyz into x*100+y*10+z<=255, so the DC
digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a
cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement,
not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack
phase, and the per-DC cap in Place (and the just-added bounded-proportional logic);
EC relies on the RP-independent rack/node even spread instead. Rack/node caps
(DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real
EC placement spec.

* ecbalancer: enforce per-disk durability cap; symmetric reserve/release

Place now refuses to put more than parityShards shards of a volume on a single
disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume,
a hard cap not relaxed even in durability-first). Previously Place assigned by
free capacity, so a skewed near-full cluster could pile >parityShards onto one
disk -> losing it loses the volume; only distinct-disk count was checked. This
covers encode and repair (both route through Place); the caller skips/leaves the
volume rather than minting an unrecoverable layout.

Also makes reserveShard decrement freeSlots unconditionally, symmetric with
releaseShard's unconditional increment (the old guarded decrement could credit a
phantom slot on release if a shard were ever reserved onto a full disk).

* ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode

Releases all of a volume's shards from the snapshot and credits the freed disk
capacity, so a greenfield encode can plan as if stale EC shards from a prior failed
attempt are gone. Safe to credit because the encode task deletes stale shards
(cleanupStaleEcShards) before distributing the new ones. Distinct from
ClearShardAccounting (repair), which does not credit.

* ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks

releaseShard only increments per-disk freeSlots, but rack capacity is summed from
node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting
only disks left a node/rack looking full after releasing stale shards, so a
greenfield encode still couldn't use the freed capacity. Now credits the node by
the total disk-slots freed.

* ecbalancer: correct PlacementMode docs (encode uses durability-first)

PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify
that durability-first is used by both encode and repair, reports relaxations in
PlaceResult.Relaxed, and never relaxes the per-disk durability cap.

* ecbalancer: treat SameRackCount as a direct per-node shard cap

The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit
value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing
digit+1 per node. This makes the per-rack and per-node caps consistent and
matches the documented "digits cap EC shards per rack and per node" semantics;
e.g. 011 now means at most one shard per rack and one per node.

* EC encode: place shards via ecbalancer.Place + configurable replica placement

Encode now plans destinations through the shared ecbalancer.Place policy
(durability-first: prefers the source disk type and honors replica placement /
caps / anti-affinity, relaxing rather than failing when capacity is tight) instead
of the EC-only placement planner. Targets and capacity reservations use Place's
actual per-disk shard assignment, not a round-robin guess; cross-volume in-cycle
capacity is tracked by ActiveTopology's pending task, so the cached planner is no
longer consulted. Adds a configurable replica_placement (proto field 6 + worker
form + reader) that overrides the master default replication.

The placement-package planner code is left in place (now unused) and removed in a
follow-up that drops the package.

* EC encode: drop unused dataShards param from createECTargets

Addresses review feedback: after switching to Place's per-disk shardsPerPlan
assignment, createECTargets no longer needs the data-shard count.

* EC encode: fix packed-target validation, greenfield stale-shard accounting, RP docs

- Validate counts distinct shard ids across targets, not target rows, so packed
  plans (fewer (node,disk) targets than shards) aren't rejected.
- planECDestinations releases the volume's stale EC shards from the snapshot before
  Place (ReleaseVolumeShards), crediting their capacity. The encode task deletes
  stale shards before distributing, so a retry on tight capacity no longer fails
  planning by counting shards that are about to be removed.
- replica_placement config/form help no longer claims a data-center limit (the DC
  digit is ignored for EC); detection logs a warning when a DC digit is set.

* EC encode: surface relaxed placement; mark replica_placement best-effort

Encode places with PlaceDurabilityFirst (the chosen lenient behavior), which can
relax caps/anti-affinity/replica-placement to avoid deferring. That was silent
(only disk-type/tag spills were logged). Now logs PlaceResult.Relaxed so a tight
replica placement isn't weakened unnoticed, and the config/form help states the
rack/node caps are best-effort during encode (enforced by rebalancing).

* EC encode: key per-disk shard grouping by struct, not formatted string

planECDestinations grouped destinations using a fmt.Sprintf("%s:%d") map key
per shard; use a {node,diskID} struct key and pre-size the map/slice to the
shard count to drop the per-shard string allocation.
2026-05-22 20:22:30 -07:00

551 lines
20 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
package erasure_coding
import (
"context"
"fmt"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/admin/topology"
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
"github.com/seaweedfs/seaweedfs/weed/worker/types"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestECPlacementPlannerApplyReservations(t *testing.T) {
activeTopology := buildActiveTopology(t, 1, []string{"hdd"}, 10, 0)
planner := newECPlacementPlanner(activeTopology, nil)
require.NotNil(t, planner)
key := ecDiskKey("10.0.0.1:8080", 0)
candidate, ok := planner.candidateByKey[key]
require.True(t, ok)
assert.Equal(t, 10, candidate.FreeSlots)
assert.Equal(t, 0, candidate.ShardCount)
assert.Equal(t, 0, candidate.LoadCount)
shardImpact := topology.CalculateECShardStorageImpact(1, 1)
destinations := make([]topology.TaskDestinationSpec, 10)
for i := 0; i < 10; i++ {
destinations[i] = topology.TaskDestinationSpec{
ServerID: "10.0.0.1:8080",
DiskID: 0,
StorageImpact: &shardImpact,
}
}
planner.applyTaskReservations(1024, nil, destinations)
candidate = planner.candidateByKey[key]
assert.Equal(t, 9, candidate.FreeSlots, "10 shard slots should reduce available volume slots by 1")
assert.Equal(t, 10, candidate.ShardCount)
assert.Equal(t, 1, candidate.LoadCount, "load should only be incremented once per disk")
}
func TestPlanECDestinationsUsesPlanner(t *testing.T) {
activeTopology := buildActiveTopology(t, 7, []string{"hdd", "ssd"}, 100, 0)
metric := &types.VolumeHealthMetrics{
VolumeID: 1,
Server: "10.0.0.1:8080",
Size: 100 * 1024 * 1024,
Collection: "",
}
plan, shardsPerPlan, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
require.NoError(t, err)
require.NotNil(t, plan)
requireAllShardsPlaced(t, plan, shardsPerPlan)
}
func TestECPlacementPlannerPrefersTaggedDisks(t *testing.T) {
activeTopology := buildActiveTopology(t, 3, []string{"hdd"}, 10, 0)
topo := activeTopology.GetTopologyInfo()
for _, dc := range topo.DataCenterInfos {
for _, rack := range dc.RackInfos {
for k, node := range rack.DataNodeInfos {
for diskType := range node.DiskInfos {
if k < 2 {
node.DiskInfos[diskType].Tags = []string{"fast"}
} else {
node.DiskInfos[diskType].Tags = []string{"slow"}
}
}
}
}
}
require.NoError(t, activeTopology.UpdateTopology(topo))
planner := newECPlacementPlanner(activeTopology, []string{"fast"})
require.NotNil(t, planner)
selected, err := planner.selectDestinations("", "", "", 2)
require.NoError(t, err)
require.Len(t, selected, 2)
for _, candidate := range selected {
key := ecDiskKey(candidate.NodeID, candidate.DiskID)
assert.True(t, diskHasTag(planner.diskTags[key], "fast"))
}
}
func TestECPlacementPlannerFallsBackWhenTagsInsufficient(t *testing.T) {
activeTopology := buildActiveTopology(t, 3, []string{"hdd"}, 10, 0)
topo := activeTopology.GetTopologyInfo()
for _, dc := range topo.DataCenterInfos {
for _, rack := range dc.RackInfos {
for i, node := range rack.DataNodeInfos {
for diskType := range node.DiskInfos {
if i == 0 {
node.DiskInfos[diskType].Tags = []string{"fast"}
}
}
}
}
}
require.NoError(t, activeTopology.UpdateTopology(topo))
planner := newECPlacementPlanner(activeTopology, []string{"fast"})
require.NotNil(t, planner)
selected, err := planner.selectDestinations("", "", "", 3)
require.NoError(t, err)
require.Len(t, selected, 3)
taggedCount := 0
for _, candidate := range selected {
key := ecDiskKey(candidate.NodeID, candidate.DiskID)
if diskHasTag(planner.diskTags[key], "fast") {
taggedCount++
}
}
assert.Less(t, taggedCount, len(selected))
}
// TestDetectionSkipsWhenECShardsAlreadyExist guards against issue #9448: a
// regular replica that survived a previous successful EC encode (source
// delete didn't clean it up for some reason) gets re-proposed for encoding,
// the new encode collides with the already-mounted shards on the targets
// ("ec volume %d is mounted; refusing overwrite"), and detection loops
// forever on the same volume. Detection must see the existing shards and
// skip the volume so an admin can clean it up out-of-band.
//
// The guard fires ONLY when the EC shard set is complete (count >=
// totalShards), so a partially-distributed previous attempt still falls
// through to the existing recovery branch in the encode path.
func TestDetectionSkipsWhenECShardsAlreadyExist(t *testing.T) {
const volumeID uint32 = 42
activeTopology := buildStuckSourceTopology(t, volumeID, erasure_coding.TotalShardsCount)
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
metrics := buildStuckSourceMetrics(volumeID, "127.0.0.1:8080")
results, hasMore, err := Detection(context.Background(), metrics, clusterInfo, NewDefaultConfig(), 0)
require.NoError(t, err)
require.False(t, hasMore)
require.Empty(t, results, "stuck source replica with all EC shards present must not yield a new encoding proposal")
}
// TestDetectionAllowsRegularReplicaWhenShardsPartial covers the partial-EC
// branch of the #9448 guard: when fewer than totalShards exist, the volume
// is allowed to flow through to the normal encoding path so the existing
// recovery branch (the `existingECShards` block in the encode arm) can fold
// the partial shards into the new task. A bug here would either (a) skip
// the volume entirely or (b) emit a proposal that later collides on the
// mounted shards.
func TestDetectionAllowsRegularReplicaWhenShardsPartial(t *testing.T) {
const volumeID uint32 = 43
activeTopology := buildStuckSourceTopology(t, volumeID, erasure_coding.DataShardsCount-1)
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
metrics := buildStuckSourceMetrics(volumeID, "127.0.0.1:8080")
results, _, err := Detection(context.Background(), metrics, clusterInfo, NewDefaultConfig(), 0)
require.NoError(t, err)
// Partial shards are not a "stuck source" — the encode arm must keep
// its chance to either propose a fresh task that folds the partial
// shards into cleanup, or fail planning on the constrained topology.
// We don't require len(results) > 0 because the constrained topology
// (one disk per node, the orphaned shards already taking slots) can
// legitimately fail destination planning. The assertion that matters
// is: the #9448 guard did NOT silently swallow the volume into a
// skippedAlreadyEC counter, and any emitted result is still an EC
// task and not a no-op.
for _, r := range results {
require.Equal(t, types.TaskTypeErasureCoding, r.TaskType, "any emitted result should still be an EC task, not a no-op")
}
}
// buildStuckSourceTopology constructs a topology that mimics the #9448 stuck
// state: a regular volume replica on node 0 plus `presentShardCount` EC
// shards distributed across nodes 0..presentShardCount-1.
func buildStuckSourceTopology(t *testing.T, volumeID uint32, presentShardCount int) *topology.ActiveTopology {
t.Helper()
require.LessOrEqual(t, presentShardCount, erasure_coding.TotalShardsCount)
activeTopology := topology.NewActiveTopology(10)
nodes := make([]*master_pb.DataNodeInfo, 0, erasure_coding.TotalShardsCount)
for i := 0; i < erasure_coding.TotalShardsCount; i++ {
nodeID := fmt.Sprintf("127.0.0.1:%d", 8080+i)
diskInfo := &master_pb.DiskInfo{
DiskId: 0,
VolumeCount: 1,
MaxVolumeCount: 100,
}
if i < presentShardCount {
diskInfo.EcShardInfos = []*master_pb.VolumeEcShardInformationMessage{{
Id: volumeID,
Collection: "",
EcIndexBits: uint32(1) << uint(i),
DiskId: 0,
}}
}
if i == 0 {
diskInfo.VolumeInfos = []*master_pb.VolumeInformationMessage{{
Id: volumeID,
DiskId: 0,
DiskType: "hdd",
Size: 200 * 1024 * 1024,
}}
}
nodes = append(nodes, &master_pb.DataNodeInfo{
Id: nodeID,
DiskInfos: map[string]*master_pb.DiskInfo{"hdd": diskInfo},
})
}
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{{
Id: "dc1",
RackInfos: []*master_pb.RackInfo{{
Id: "rack1",
DataNodeInfos: nodes,
}},
}},
}))
return activeTopology
}
// buildStuckSourceMetrics returns a metric that already satisfies the EC
// criteria (Age, FullnessRatio, Size), with `Age` derived from `LastModified`
// so the two fields stay consistent for any reader.
func buildStuckSourceMetrics(volumeID uint32, server string) []*types.VolumeHealthMetrics {
lastModified := time.Now().Add(-2 * time.Hour)
return []*types.VolumeHealthMetrics{{
VolumeID: volumeID,
Server: server,
Size: 200 * 1024 * 1024,
Collection: "",
FullnessRatio: 0.96,
LastModified: lastModified,
Age: time.Since(lastModified),
}}
}
// TestCountExistingEcShardsForVolume verifies that the helper walks the
// EcIndexBits bitmap (not just len(EcShardInfos)) so it correctly counts
// distinct shard ids even when a single info entry on one disk carries
// multiple shards.
func TestCountExistingEcShardsForVolume(t *testing.T) {
const volumeID uint32 = 99
activeTopology := topology.NewActiveTopology(10)
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{{
Id: "dc1",
RackInfos: []*master_pb.RackInfo{{
Id: "rack1",
DataNodeInfos: []*master_pb.DataNodeInfo{
{
Id: "127.0.0.1:8080",
DiskInfos: map[string]*master_pb.DiskInfo{
"hdd": {
DiskId: 0,
MaxVolumeCount: 100,
// One info entry, three shards present (ids 0, 2, 5).
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{{
Id: volumeID,
Collection: "",
EcIndexBits: (uint32(1) << 0) | (uint32(1) << 2) | (uint32(1) << 5),
DiskId: 0,
}},
},
},
},
{
Id: "127.0.0.1:8081",
DiskInfos: map[string]*master_pb.DiskInfo{
"hdd": {
DiskId: 0,
MaxVolumeCount: 100,
// One info entry, one shard (id 3) — overlaps with neither.
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{{
Id: volumeID,
Collection: "",
EcIndexBits: uint32(1) << 3,
DiskId: 0,
}},
},
},
},
},
}},
}},
}))
assert.Equal(t, 4, countExistingEcShardsForVolume(activeTopology, volumeID, ""))
assert.Equal(t, 0, countExistingEcShardsForVolume(activeTopology, volumeID, "other-collection"))
assert.Equal(t, 0, countExistingEcShardsForVolume(nil, volumeID, ""))
}
func TestDetectionContextCancellation(t *testing.T) {
activeTopology := buildActiveTopology(t, 5, []string{"hdd", "ssd"}, 50, 0)
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
metrics := buildVolumeMetricsForIDs(50)
ctx, cancel := context.WithCancel(context.Background())
cancel()
_, _, err := Detection(ctx, metrics, clusterInfo, NewDefaultConfig(), 0)
require.ErrorIs(t, err, context.Canceled)
}
func TestDetectionMaxResultsHonorsLimit(t *testing.T) {
// One node per shard so each shard gets its own disk (#9369).
activeTopology := buildActiveTopology(t, erasure_coding.TotalShardsCount, []string{"hdd"}, 20, 0)
clusterInfo := &types.ClusterInfo{ActiveTopology: activeTopology}
metrics := buildVolumeMetricsForIDs(3)
results, hasMore, err := Detection(context.Background(), metrics, clusterInfo, NewDefaultConfig(), 1)
require.NoError(t, err)
assert.Len(t, results, 1)
assert.True(t, hasMore)
}
// #9369: 7 servers × 2 physical HDDs must yield 14 distinct (server, disk_id)
// destinations, not 7 destinations doubled up on the same disk.
func TestPlanECDestinationsSpreadsAcrossPhysicalDisks(t *testing.T) {
const numServers = 7
const disksPerServer = 2
activeTopology := topology.NewActiveTopology(10)
nodes := make([]*master_pb.DataNodeInfo, 0, numServers)
for i := 1; i <= numServers; i++ {
volumeInfos := make([]*master_pb.VolumeInformationMessage, 0, disksPerServer)
for d := uint32(0); d < disksPerServer; d++ {
volumeInfos = append(volumeInfos, &master_pb.VolumeInformationMessage{
Id: uint32(i*100 + int(d)),
DiskId: d,
DiskType: "hdd",
})
}
nodes = append(nodes, &master_pb.DataNodeInfo{
Id: fmt.Sprintf("127.0.0.1:%d", 8080+i),
DiskInfos: map[string]*master_pb.DiskInfo{
"hdd": {
DiskId: 0,
VolumeCount: int64(disksPerServer),
MaxVolumeCount: 200,
VolumeInfos: volumeInfos,
},
},
})
}
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{{
Id: "dc1",
RackInfos: []*master_pb.RackInfo{{
Id: "rack1",
DataNodeInfos: nodes,
}},
}},
}))
metric := &types.VolumeHealthMetrics{
VolumeID: 42,
Server: "127.0.0.1:8081",
Size: 100 * 1024 * 1024,
Collection: "",
}
plan, shardsPerPlan, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
require.NoError(t, err)
require.NotNil(t, plan)
requireAllShardsPlaced(t, plan, shardsPerPlan)
}
func TestPlanECDestinationsFailsWithInsufficientCapacity(t *testing.T) {
activeTopology := buildActiveTopology(t, 1, []string{"hdd"}, 1, 1)
metric := &types.VolumeHealthMetrics{
VolumeID: 2,
Server: "10.0.0.1:8080",
Size: 10 * 1024 * 1024,
Collection: "",
}
_, _, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
require.Error(t, err)
}
// #9586: with fewer single-disk servers than total shards, EC must still plan
// by packing several shards onto a disk (ec.encode's "4,4,3,3" fallback) rather
// than refusing. The reporter has 8 single-disk servers across 3 racks and a
// 10+4 scheme — 8 disks for 14 shards. minTotalDisks (ceil(14/4)=4) keeps any
// disk under parityShards shards, so durability holds.
func TestPlanECDestinationsPacksWhenFewerDisksThanShards(t *testing.T) {
const numServers = 8
// rack3 holds 4 servers, rack1 and rack2 hold 2 each, mirroring the report.
racks := []string{"rack3", "rack3", "rack3", "rack3", "rack1", "rack1", "rack2", "rack2"}
activeTopology := topology.NewActiveTopology(10)
rackNodes := make(map[string][]*master_pb.DataNodeInfo)
for i := 0; i < numServers; i++ {
nodeID := fmt.Sprintf("192.168.1.%d:%d", 143+i/3, 8080+i)
rackNodes[racks[i]] = append(rackNodes[racks[i]], &master_pb.DataNodeInfo{
Id: nodeID,
DiskInfos: map[string]*master_pb.DiskInfo{
"hdd": {
DiskId: 0,
VolumeCount: 1,
MaxVolumeCount: 200,
VolumeInfos: []*master_pb.VolumeInformationMessage{{
Id: uint32(i + 1),
DiskId: 0,
DiskType: "hdd",
}},
},
},
})
}
rackInfos := make([]*master_pb.RackInfo, 0, len(rackNodes))
for _, rackID := range []string{"rack1", "rack2", "rack3"} {
rackInfos = append(rackInfos, &master_pb.RackInfo{Id: rackID, DataNodeInfos: rackNodes[rackID]})
}
require.NoError(t, activeTopology.UpdateTopology(&master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{{Id: "dc1", RackInfos: rackInfos}},
}))
metric := &types.VolumeHealthMetrics{
VolumeID: 4569,
Server: "192.168.1.145:8081",
Size: 100 * 1024 * 1024,
Collection: "",
}
plan, shardsPerPlan, err := planECDestinations(activeTopology, metric, NewDefaultConfig(), nil, erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount)
require.NoError(t, err)
require.NotNil(t, plan)
// Packed onto the available disks: more than one shard per disk but never more
// than the 8 disks, and at least the durability floor of distinct disks.
require.LessOrEqual(t, len(plan.Plans), numServers)
require.GreaterOrEqual(t, len(plan.Plans), (erasure_coding.TotalShardsCount+erasure_coding.ParityShardsCount-1)/erasure_coding.ParityShardsCount)
// createECTargets must cover all 14 shards exactly once, packing onto the
// available disks without any disk exceeding parityShards shards.
targets := createECTargets(plan, shardsPerPlan)
require.Equal(t, len(plan.Plans), len(targets))
seenShards := make(map[uint32]bool)
for _, target := range targets {
require.LessOrEqual(t, len(target.ShardIds), erasure_coding.ParityShardsCount,
"no disk may hold more than parityShards shards, else losing it loses the volume")
for _, shardId := range target.ShardIds {
require.False(t, seenShards[shardId], "shard %d assigned to more than one target", shardId)
seenShards[shardId] = true
}
}
require.Len(t, seenShards, erasure_coding.TotalShardsCount, "every shard must be placed exactly once")
}
// requireAllShardsPlaced asserts every EC shard landed exactly once, on a distinct
// (node,disk) target, with no disk holding more than parityShards shards (so losing
// any one disk cannot lose the volume). shardsPerPlan is parallel to plan.Plans.
func requireAllShardsPlaced(t *testing.T, plan *topology.MultiDestinationPlan, shardsPerPlan [][]uint32) {
t.Helper()
require.Equal(t, len(plan.Plans), len(shardsPerPlan), "one shard list per plan entry")
keys := make(map[string]bool, len(plan.Plans))
seen := make(map[uint32]bool)
for i, p := range plan.Plans {
key := fmt.Sprintf("%s:%d", p.TargetNode, p.TargetDisk)
require.False(t, keys[key], "duplicate (node,disk) target %s", key)
keys[key] = true
require.LessOrEqual(t, len(shardsPerPlan[i]), erasure_coding.ParityShardsCount,
"disk %s holds %d shards, over parityShards", key, len(shardsPerPlan[i]))
for _, s := range shardsPerPlan[i] {
require.False(t, seen[s], "shard %d placed more than once", s)
seen[s] = true
}
}
require.Len(t, seen, erasure_coding.TotalShardsCount, "every shard must be placed exactly once")
}
func buildVolumeMetricsForIDs(count int) []*types.VolumeHealthMetrics {
metrics := make([]*types.VolumeHealthMetrics, 0, count)
now := time.Now()
for id := 1; id <= count; id++ {
metrics = append(metrics, &types.VolumeHealthMetrics{
VolumeID: uint32(id),
Server: "10.0.0.1:8080",
Size: 200 * 1024 * 1024,
Collection: "",
FullnessRatio: 0.96,
LastModified: now.Add(-2 * time.Hour),
Age: 2 * time.Hour,
})
}
return metrics
}
func buildActiveTopology(t *testing.T, nodeCount int, diskTypes []string, maxVolumeCount, usedVolumeCount int64) *topology.ActiveTopology {
t.Helper()
activeTopology := topology.NewActiveTopology(10)
nodes := make([]*master_pb.DataNodeInfo, 0, nodeCount)
for i := 1; i <= nodeCount; i++ {
diskInfos := make(map[string]*master_pb.DiskInfo)
for diskIndex, diskType := range diskTypes {
used := usedVolumeCount
if used > maxVolumeCount {
used = maxVolumeCount
}
volumeInfos := make([]*master_pb.VolumeInformationMessage, 0, 200)
for vid := 1; vid <= 200; vid++ {
volumeInfos = append(volumeInfos, &master_pb.VolumeInformationMessage{
Id: uint32(vid),
Collection: "",
DiskId: uint32(diskIndex),
})
}
diskInfos[diskType] = &master_pb.DiskInfo{
DiskId: uint32(diskIndex),
VolumeCount: used,
MaxVolumeCount: maxVolumeCount,
VolumeInfos: volumeInfos,
}
}
nodes = append(nodes, &master_pb.DataNodeInfo{
Id: fmt.Sprintf("10.0.0.%d:8080", i),
DiskInfos: diskInfos,
})
}
topologyInfo := &master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{
{
Id: "dc1",
RackInfos: []*master_pb.RackInfo{
{
Id: "rack1",
DataNodeInfos: nodes,
},
},
},
},
}
require.NoError(t, activeTopology.UpdateTopology(topologyInfo))
return activeTopology
}