Files
seaweedfs/weed/shell/command_volume_balance_test.go
T
Chris Lu f0d2a0d417 Treat co-located volume servers as one fault domain when balancing and allocating (#9854)
* admin/topology: carry the volume server address on DiskInfo

The planning DiskInfo exposed only the node id, which can be an opaque label rather than ip:port. Record the address too so callers can resolve the physical machine a disk sits on.

* ec.balance: spread a volume's shards across machines, not just nodes

Volume servers sharing a host are one fault domain, but the within-rack spread treated them as independent nodes, so one box could end up holding more shards of a volume than EC can afford to lose. Add a machine (host) tier between rack and node: the within-rack pass spreads each volume across machines, and the global load phase no longer re-concentrates a volume onto a machine it already sits on. Host defaults to the node id, so clusters with one server per host are unchanged.

* ec placement: prefer machines holding fewer of a volume's shards

EC allocation and repair picked the least-loaded node in a rack with no regard for which physical machine it sits on, so a volume's shards could pile onto several servers of one box. Rank candidate nodes by their machine's shard count first, then the node's own. The machine is derived from the volume server address carried on DiskInfo, falling back to the node id, matching how the balancer resolves it.

* volume.balance: don't move a replica onto a machine already holding one

isGoodMove only rejected a move onto the same data node, so two replicas could land on two volume servers of one box and a single machine failure would lose both. Reject a target whose host already holds another replica of the volume. Best-effort: balancing simply skips and tries the next target.

* volume allocation: spread same-rack replicas across machines

PickNodesByWeight filled the same-rack replica picks by weight alone, so replicas could co-locate on one box. Prefer candidates on not-yet-used hosts, falling back when too few distinct machines exist. Data-center and rack tiers have no host, so their ordering is unchanged.

* ec.balance: harden machine spread against re-concentration and capped machines

Two cases where the machine-aware spread could still leave a volume badly placed:

- The global load phase could move a shard of a volume onto a machine that
  already held it, raising that machine's count and undoing the within-rack
  spread (a 4/4/3/3 layout could become 3/5/3/3, past parity for 10+4). Limit
  the load-only fallback to same-machine moves, which leave a machine's count
  unchanged; cross-machine concentration is no longer allowed for load alone.

- The within-rack spread chose a destination machine by free slots alone, so if
  that machine's only nodes were already at the SameRackCount cap it skipped the
  move instead of trying another machine. Require a machine to have a node that
  can actually take the shard before selecting it.

* reduce comments across the machine-affinity change

Trim narration down to the non-obvious why; one terse line where a block was overkill.

* ec.balance: gate machine spread on fault-tolerance feasibility

Spreading a volume evenly across machines only helps when there are enough that
each can stay within EC's parity tolerance (numMachines >= ceil(total/parity)).
With fewer -- or wildly unequal -- machines it can't make a machine loss
survivable anyway, and forcing it fights capacity: e.g. a cluster of 12 volume
servers on one host and 2 on another would have half of every volume crammed onto
the 2-server box. So spread across machines only when it's achievable; otherwise
fall back to per-node spread and let capacity/global balancing decide.

The global load phase applies the same test: it protects a volume's machine spread
(no cross-machine move that raises a machine's count past the source's) only where
that spread is achievable, so heterogeneous clusters still level by fullness.

* ec.balance worker: group servers by host when planning

The worker built its planner topology without recording each server's host, so
automated ec.balance treated ports on one machine as independent nodes and could
concentrate a volume's shards on one physical box. Set the host from the volume
server address, matching the shell path.

* volume.balance worker: don't move a replica onto a machine holding one

The worker compared only node ids, and the replica map dropped the server address,
so it could move replicas onto different ports of one machine. Carry the host on
ReplicaLocation (from the server address) and reject a target whose host already
holds another replica of the volume. Best-effort, matching the shell.

* ec.balance: judge machine-spread feasibility by the rack's shards

The within-rack and global feasibility checks compared the whole volume's shard
count against a rack's machine count, so a rack holding only part of a volume after
cross-rack spreading -- e.g. 7 of a 10+4 volume across 2 machines -- was wrongly
judged infeasible and fell back to node spread, which could pile 6 shards onto one
host, past parity. Gate on the rack's own shard count of the volume instead.

* ec.balance: spread a volume's shards across machines by combined count

EC recovers from any loss within parity regardless of shard type, so what bounds a
machine's exposure is its total shards of the volume, not data and parity
separately. Spreading the two independently let each type's remainder land on the
same machine -- ceil(d/M)+ceil(p/M) can exceed ceil(total/M), e.g. a 5/3 split where
4/4 was achievable, past parity. Balance the combined count in one pass; disk-level
data/parity anti-affinity stays in pickBestDiskOnNode.

* ec.balance: don't let the imbalance threshold skip an over-parity machine

The within-rack spread gated on relative skew ((max-min)/avg > threshold), so a
worker threshold of 0.5 skipped an exactly-50%-skewed layout like 5/4/3 for a 10+4
volume, leaving 5 shards -- past parity -- on one machine. The even cap
(ceil(shards/groups)) is the real bound and the move loop already sheds only what
exceeds it, so drop the threshold gate from the within-rack phase (machine and node):
a balanced rack stays a no-op while any over-cap machine is always fixed.

* ec.balance: keep the imbalance threshold for the node fallback

Dropping the threshold from the whole within-rack phase made the node fallback too
eager: it runs only when machine fault tolerance is unachievable, so it is cosmetic
load distribution that should defer to the global utilization phase. Without the
gate it would, for a one-server-per-host 6/4 split at threshold 0.5, schedule a count
move that worsens utilization balance. Restore the threshold there; machine spreading
keeps bypassing it, since that bound is durability, not cosmetic skew.
2026-06-07 14:14:45 -07:00

467 lines
15 KiB
Go

package shell
import (
"fmt"
"testing"
"github.com/seaweedfs/seaweedfs/weed/storage/types"
"github.com/stretchr/testify/assert"
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
)
type testMoveCase struct {
name string
replication string
replicas []*VolumeReplica
sourceLocation location
targetLocation location
expected bool
}
func TestIsGoodMove(t *testing.T) {
var tests = []testMoveCase{
{
name: "test 100 move to wrong data centers",
replication: "100",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc2", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
targetLocation: location{"dc2", "r3", &master_pb.DataNodeInfo{Id: "dn3"}},
expected: false,
},
{
name: "test 100 move to spread into proper data centers",
replication: "100",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc2", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
expected: true,
},
{
name: "test move to the same node",
replication: "001",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
expected: false,
},
{
name: "test move to the same rack, but existing node",
replication: "001",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
expected: false,
},
{
name: "test move to the same rack, a new node",
replication: "001",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn3"}},
expected: true,
},
{
name: "test 010 move all to the same rack",
replication: "010",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn3"}},
expected: false,
},
{
name: "test 010 move to spread racks",
replication: "010",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc1", "r3", &master_pb.DataNodeInfo{Id: "dn3"}},
expected: true,
},
{
name: "test 010 move to spread racks",
replication: "010",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
},
},
sourceLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn2"}},
targetLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
expected: true,
},
{
name: "test 011 switch which rack has more replicas",
replication: "011",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
targetLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn4"}},
expected: true,
},
{
name: "test 011 move the lonely replica to another racks",
replication: "011",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
},
},
sourceLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
targetLocation: location{"dc1", "r3", &master_pb.DataNodeInfo{Id: "dn4"}},
expected: true,
},
{
name: "test 011 move to wrong racks",
replication: "011",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
targetLocation: location{"dc1", "r3", &master_pb.DataNodeInfo{Id: "dn4"}},
expected: false,
},
{
name: "test 011 move all to the same rack",
replication: "011",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn1"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn2"}},
},
{
location: &location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
},
},
sourceLocation: location{"dc1", "r2", &master_pb.DataNodeInfo{Id: "dn3"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "dn4"}},
expected: false,
},
{
// rep 001 allows two copies in one rack; replica-placement alone would
// permit this, but the target shares a host with another replica, so the
// machine anti-affinity must reject it.
name: "test 001 reject move onto a machine already holding a replica",
replication: "001",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.1:8080"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.2:8080"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.2:8080"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.1:8081"}},
expected: false,
},
{
name: "test 001 allow move onto a different machine in the rack",
replication: "001",
replicas: []*VolumeReplica{
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.1:8080"}},
},
{
location: &location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.2:8080"}},
},
},
sourceLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.2:8080"}},
targetLocation: location{"dc1", "r1", &master_pb.DataNodeInfo{Id: "10.0.0.3:8080"}},
expected: true,
},
}
for _, tt := range tests {
replicaPlacement, _ := super_block.NewReplicaPlacementFromString(tt.replication)
println("replication:", tt.replication, "expected", tt.expected, "name:", tt.name)
sourceNode := &Node{
info: tt.sourceLocation.dataNode,
dc: tt.sourceLocation.dc,
rack: tt.sourceLocation.rack,
}
targetNode := &Node{
info: tt.targetLocation.dataNode,
dc: tt.targetLocation.dc,
rack: tt.targetLocation.rack,
}
if isGoodMove(replicaPlacement, tt.replicas, sourceNode, targetNode) != tt.expected {
t.Errorf("%s: expect %v move from %v to %s, replication:%v",
tt.name, tt.expected, tt.sourceLocation, tt.targetLocation, tt.replication)
}
}
}
func TestBalance(t *testing.T) {
topologyInfo := parseOutput(topoData)
volumeServers := collectVolumeServersByDcRackNode(topologyInfo, "", "", "")
volumeReplicas, _ := collectVolumeReplicaLocations(topologyInfo)
diskTypes := collectVolumeDiskTypes(topologyInfo)
c := &commandVolumeBalance{}
if err := c.balanceVolumeServers(diskTypes, volumeReplicas, volumeServers, nil, "ALL_COLLECTIONS"); err != nil {
t.Errorf("balance: %v", err)
}
}
// Regression test: a freshly added empty volume server must end up sharing the
// data roughly evenly, not having every volume drained onto it. Before the fix,
// adjustAfterMove never updated the per-disk VolumeInfos that the density-based
// capacity function reads, so the planner saw a stale topology and moved every
// volume from the full node onto the empty one.
func TestBalanceDoesNotDrainOntoOneNode(t *testing.T) {
const mb = 1024 * 1024
volumeSizeLimitMb := uint64(100)
makeNode := func(id string, volumes []*master_pb.VolumeInformationMessage) *Node {
return &Node{
info: &master_pb.DataNodeInfo{
Id: id,
DiskInfos: map[string]*master_pb.DiskInfo{
"": {
MaxVolumeCount: 10,
VolumeCount: int64(len(volumes)),
VolumeInfos: volumes,
},
},
},
dc: "dc1",
rack: "rack1",
}
}
var fullVolumes []*master_pb.VolumeInformationMessage
for id := uint32(1); id <= 6; id++ {
fullVolumes = append(fullVolumes, &master_pb.VolumeInformationMessage{Id: id, Size: 95 * mb})
}
fullNode := makeNode("full", fullVolumes)
emptyNode := makeNode("empty", nil)
nodes := []*Node{fullNode, emptyNode}
volumeReplicas := map[uint32][]*VolumeReplica{}
for _, v := range fullVolumes {
loc := newLocation("dc1", "rack1", fullNode.info)
volumeReplicas[v.Id] = []*VolumeReplica{{location: &loc, info: v}}
}
for _, n := range nodes {
n.selectVolumes(func(v *master_pb.VolumeInformationMessage) bool { return true })
}
if err := balanceSelectedVolume(nil, types.HardDriveType, volumeReplicas, nodes, sortWritableVolumes, volumeSizeLimitMb, false); err != nil {
t.Fatalf("balanceSelectedVolume: %v", err)
}
fullCount := len(fullNode.info.DiskInfos[""].VolumeInfos)
emptyCount := len(emptyNode.info.DiskInfos[""].VolumeInfos)
if fullCount == 0 || emptyCount == 0 {
t.Fatalf("expected volumes spread across both nodes, got full=%d empty=%d", fullCount, emptyCount)
}
if diff := fullCount - emptyCount; diff > 1 || diff < -1 {
t.Fatalf("expected balanced distribution within one volume, got full=%d empty=%d", fullCount, emptyCount)
}
}
func TestVolumeSelection(t *testing.T) {
topologyInfo := parseOutput(topoData)
vids, err := collectVolumeIdsForTierChange(topologyInfo, 1000, types.ToDiskType(types.HddType), "", 20.0, 0)
if err != nil {
t.Errorf("collectVolumeIdsForTierChange: %v", err)
}
assert.Equal(t, 378, len(vids))
}
func TestDeleteEmptySelection(t *testing.T) {
topologyInfo := parseOutput(topoData)
eachDataNode(topologyInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
for _, diskInfo := range dn.DiskInfos {
for _, v := range diskInfo.VolumeInfos {
if v.Size <= super_block.SuperBlockSize && v.ModifiedAtSecond > 0 {
fmt.Printf("empty volume %d from %s\n", v.Id, dn.Id)
}
}
}
})
}
func TestSplitCSVSet(t *testing.T) {
tests := []struct {
name string
in string
want map[string]bool
}{
{"empty input is empty set (no filter)", "", map[string]bool{}},
{"whitespace only is empty set (no filter)", " ", map[string]bool{}},
{"commas only is empty set (no filter)", ",,,", map[string]bool{}},
{"whitespace and commas only is empty set (no filter)", " , , ", map[string]bool{}},
{"single", "rack1", map[string]bool{"rack1": true}},
{"multi", "rack1,rack2", map[string]bool{"rack1": true, "rack2": true}},
{"trims whitespace", " rack1 , rack2 ", map[string]bool{"rack1": true, "rack2": true}},
{"skips empty items", "rack1,,rack2,", map[string]bool{"rack1": true, "rack2": true}},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
assert.Equal(t, tc.want, splitCSVSet(tc.in))
})
}
}
// Regression test for the rack/node filter that previously used
// strings.Contains, which falsely matched any id that was a substring of the
// user-supplied flag value (e.g. -racks=rack10 also matched rack1).
func TestCollectVolumeServersByDcRackNode_RackFilter(t *testing.T) {
topo := &master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{{
Id: "dc1",
RackInfos: []*master_pb.RackInfo{
{Id: "rack1", DataNodeInfos: []*master_pb.DataNodeInfo{{Id: "n1"}}},
{Id: "rack10", DataNodeInfos: []*master_pb.DataNodeInfo{{Id: "n10"}}},
{Id: "rack2", DataNodeInfos: []*master_pb.DataNodeInfo{{Id: "n2"}}},
},
}},
}
got := collectVolumeServersByDcRackNode(topo, "", "rack10", "")
if assert.Len(t, got, 1, "-racks=rack10 should not match rack1") {
assert.Equal(t, "rack10", got[0].rack)
}
got = collectVolumeServersByDcRackNode(topo, "", "rack1,rack2", "")
racks := map[string]bool{}
for _, n := range got {
racks[n.rack] = true
}
assert.Equal(t, map[string]bool{"rack1": true, "rack2": true}, racks,
"-racks=rack1,rack2 should match exactly those two, not rack10")
}
// Regression test for the -nodes filter, mirroring the rack-filter case.
// Uses bare ids (no :port suffix) so that "node1" is a true substring of
// "node10": under the old strings.Contains implementation,
// -nodes=node10 wrongly included node1 as well.
func TestCollectVolumeServersByDcRackNode_NodeFilter(t *testing.T) {
topo := &master_pb.TopologyInfo{
DataCenterInfos: []*master_pb.DataCenterInfo{{
Id: "dc1",
RackInfos: []*master_pb.RackInfo{{
Id: "rack1",
DataNodeInfos: []*master_pb.DataNodeInfo{
{Id: "node1"},
{Id: "node10"},
{Id: "node2"},
},
}},
}},
}
got := collectVolumeServersByDcRackNode(topo, "", "", "node10")
if assert.Len(t, got, 1, "-nodes=node10 should not match node1") {
assert.Equal(t, "node10", got[0].info.Id)
}
got = collectVolumeServersByDcRackNode(topo, "", "", "node1,node2")
nodes := map[string]bool{}
for _, n := range got {
nodes[n.info.Id] = true
}
assert.Equal(t, map[string]bool{"node1": true, "node2": true}, nodes,
"-nodes=node1,node2 should match exactly those two, not node10")
}