mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
filer-per-path-lock
13931 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
56c51e7f50 |
filer: serialize same-path mutations with a local lock
CreateEntry is a FindEntry-then-write with no lock, so concurrent creates to the same path race: OExcl can admit two creators, and a conditional write has no atomic check-then-act. Add a per-path exclusive lock (util.LockTable, which evicts idle keys so it stays bounded) in the CreateEntry handler so the read and the write are atomic on this filer. Once callers route a key's writes to its owner filer, this local lock is the authoritative serialization point. AppendToEntry moves from the distributed lock to the same per-path lock. |
||
|
|
d1665750e1 |
Delete the EC placement package now that encode/repair use ecbalancer.Place (#9624)
Delete the EC placement package and the dead encode planner code Now that encode (and repair) place via ecbalancer.Place, nothing uses the erasure_coding/placement package or the EC-only planner machinery (ecPlacementPlanner, diskInfosToCandidates, calculateECScoreCandidate, distributeECShards) in detection.go. Removes them and the package, along with the planner-direct unit tests. |
||
|
|
0566fbd552 |
EC encode: place shards via ecbalancer.Place + configurable replica placement (#9623)
* Add shared super_block.ResolveReplicaPlacement; use it in ec_balance
* Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair
* Add ecbalancer.Place greenfield/repair placement core (strict + durability-first)
* topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots
GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via
integer truncation, so an in-flight EC task reserving a non-multiple-of-
DataShardsCount number of shards was lost from the snapshot and freeSlots was
over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation
impact at shard granularity.
* ecbalancer.Place: reject nodes without a free disk of the requested type
FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request
could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns
disk 0 on the wrong tier). Filter rack/node selection to those with a free disk
of the requested type.
* ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap)
* ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap)
Adds a cross-DC corrective phase that drains data centers holding more than
DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move
targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is
unchanged for non-DC placements.
* topology: ratio-aware EC shard slots and provisional empty-disk slot
GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard
count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot;
and it keeps the one provisional slot for freshly started empty servers that
report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology
threads the ratio through.
* ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity)
HardDriveType normalizes to "", which collided with "" meaning any disk. Add
Constraints.FilterDiskType and normalize both sides so a hdd request matches disks
reported as "" and never leaks to SSD, while filter=false still means any.
* ecbalancer: add clearShardAccounting for repair snapshot reconciliation
Clears one disk's copy of a shard from per-domain accounting and recomputes the
node-level union (preserving a kept copy on another disk of the same node), without
crediting capacity. Repair uses it to drop to-be-deleted copies before placing
missing shards.
* ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset
len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining
a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the
effectively-unlimited default.
* topology/ecbalancer: ratio-correct EC capacity accounting
Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to
the target ratio before subtracting, and existing EC shards are charged by size
(targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot.
Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio).
* ecbalancer.Place: candidate tiering and eligible-rack caps
Adds a per-disk eligibility/preference abstraction so Place supports:
- preferred-tag whole-plan retry (try disks carrying the earliest tags first,
widen to all only if a tier cannot place every shard; reports
SpilledOutsidePreferredTags),
- soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the
preferred type then spills, reporting SpilledToOtherDiskType; Require filters,
- even per-rack caps that divide by racks holding an eligible disk, so a tiered
cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low.
Disk tags carried via Node.AddDiskTags + FromActiveTopology.
* ecbalancer: export ClearShardAccounting for repair snapshot reconciliation
* ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves)
- topology/ecbalancer: round shard-reservation and existing-shard footprint up
when converting to target-ratio shard slots, so a sub-slot reservation is not
truncated to zero and free capacity is not overstated for low-data-shard
layouts (targetDataShards < ds).
- erasure_coding: add ShardBits.All iterator and use it across the balancer,
cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and
probing Has on every id.
- ecbalancer: allow same-DC cross-rack moves when a DC already sits at its
DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a
regression test that fails without the guard.
- ecbalancer cross-DC phase: pick targets via the eligible-aware
pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is
honored and a 0 disk id is not mistaken for a valid selection.
* ecbalancer: test ecShardSlotsOnDisk fractional round-up
Cover the mixed-ratio path (targetDataShards < existing data shards) so a
shard's fractional footprint is never floored to zero and free capacity is not
overstated. Exercises the round-up via the targetDataShards parameter; OSS uses
the standard ratio at runtime while the enterprise build hits it with real
per-volume ratios.
* ecbalancer: assert node B rack in TestFromActiveTopology
* ecbalancer: split Destination into separate DataCenter and bare Rack
Replace the composite "dc:rack" Rack field on Destination with separate
DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task
convention. Callers (and tests) read the data center directly instead of parsing
the composite with strings.SplitN.
* shell ec.balance: use utilization-based global balancing (parity with worker)
The shell's global rebalance phase balanced by raw shard count; switch it to
fractional fullness (shards/capacity), as the worker already does. On uniform
capacity the two agree; on heterogeneous capacity it fills nodes proportionally
instead of driving small-capacity nodes toward full.
Updates the heterogeneous-capacity regression test to assert even fullness
(~equal shards/capacity per node) rather than even shard count.
* ecbalancer: bounded-proportional per-DC shard spread
DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could
leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14
with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target
boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)):
shards spread proportionally across DCs, but no tighter than the durability floor
(once each DC holds <= parityShards a DC loss is recoverable, so further spreading
only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to
before when the cap is the binding constraint.
* ecbalancer: drop DiffDataCenterCount enforcement for EC placement
The 1-byte volume ReplicaPlacement packs xyz into x*100+y*10+z<=255, so the DC
digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a
cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement,
not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack
phase, and the per-DC cap in Place (and the just-added bounded-proportional logic);
EC relies on the RP-independent rack/node even spread instead. Rack/node caps
(DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real
EC placement spec.
* ecbalancer: enforce per-disk durability cap; symmetric reserve/release
Place now refuses to put more than parityShards shards of a volume on a single
disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume,
a hard cap not relaxed even in durability-first). Previously Place assigned by
free capacity, so a skewed near-full cluster could pile >parityShards onto one
disk -> losing it loses the volume; only distinct-disk count was checked. This
covers encode and repair (both route through Place); the caller skips/leaves the
volume rather than minting an unrecoverable layout.
Also makes reserveShard decrement freeSlots unconditionally, symmetric with
releaseShard's unconditional increment (the old guarded decrement could credit a
phantom slot on release if a shard were ever reserved onto a full disk).
* ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode
Releases all of a volume's shards from the snapshot and credits the freed disk
capacity, so a greenfield encode can plan as if stale EC shards from a prior failed
attempt are gone. Safe to credit because the encode task deletes stale shards
(cleanupStaleEcShards) before distributing the new ones. Distinct from
ClearShardAccounting (repair), which does not credit.
* ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks
releaseShard only increments per-disk freeSlots, but rack capacity is summed from
node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting
only disks left a node/rack looking full after releasing stale shards, so a
greenfield encode still couldn't use the freed capacity. Now credits the node by
the total disk-slots freed.
* ecbalancer: correct PlacementMode docs (encode uses durability-first)
PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify
that durability-first is used by both encode and repair, reports relaxations in
PlaceResult.Relaxed, and never relaxes the per-disk durability cap.
* ecbalancer: treat SameRackCount as a direct per-node shard cap
The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit
value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing
digit+1 per node. This makes the per-rack and per-node caps consistent and
matches the documented "digits cap EC shards per rack and per node" semantics;
e.g. 011 now means at most one shard per rack and one per node.
* EC encode: place shards via ecbalancer.Place + configurable replica placement
Encode now plans destinations through the shared ecbalancer.Place policy
(durability-first: prefers the source disk type and honors replica placement /
caps / anti-affinity, relaxing rather than failing when capacity is tight) instead
of the EC-only placement planner. Targets and capacity reservations use Place's
actual per-disk shard assignment, not a round-robin guess; cross-volume in-cycle
capacity is tracked by ActiveTopology's pending task, so the cached planner is no
longer consulted. Adds a configurable replica_placement (proto field 6 + worker
form + reader) that overrides the master default replication.
The placement-package planner code is left in place (now unused) and removed in a
follow-up that drops the package.
* EC encode: drop unused dataShards param from createECTargets
Addresses review feedback: after switching to Place's per-disk shardsPerPlan
assignment, createECTargets no longer needs the data-shard count.
* EC encode: fix packed-target validation, greenfield stale-shard accounting, RP docs
- Validate counts distinct shard ids across targets, not target rows, so packed
plans (fewer (node,disk) targets than shards) aren't rejected.
- planECDestinations releases the volume's stale EC shards from the snapshot before
Place (ReleaseVolumeShards), crediting their capacity. The encode task deletes
stale shards before distributing, so a retry on tight capacity no longer fails
planning by counting shards that are about to be removed.
- replica_placement config/form help no longer claims a data-center limit (the DC
digit is ignored for EC); detection logs a warning when a DC digit is set.
* EC encode: surface relaxed placement; mark replica_placement best-effort
Encode places with PlaceDurabilityFirst (the chosen lenient behavior), which can
relax caps/anti-affinity/replica-placement to avoid deferring. That was silent
(only disk-type/tag spills were logged). Now logs PlaceResult.Relaxed so a tight
replica placement isn't weakened unnoticed, and the config/form help states the
rack/node caps are best-effort during encode (enforced by rebalancing).
* EC encode: key per-disk shard grouping by struct, not formatted string
planECDestinations grouped destinations using a fmt.Sprintf("%s:%d") map key
per shard; use a {node,diskID} struct key and pre-size the map/slice to the
shard count to drop the per-shard string allocation.
|
||
|
|
d4e39b499b |
EC placement: shared replica-placement resolver, snapshot + Place core, capacity fixes, tiering (#9621)
* Add shared super_block.ResolveReplicaPlacement; use it in ec_balance * Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair * Add ecbalancer.Place greenfield/repair placement core (strict + durability-first) * topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via integer truncation, so an in-flight EC task reserving a non-multiple-of- DataShardsCount number of shards was lost from the snapshot and freeSlots was over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation impact at shard granularity. * ecbalancer.Place: reject nodes without a free disk of the requested type FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns disk 0 on the wrong tier). Filter rack/node selection to those with a free disk of the requested type. * ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap) * ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap) Adds a cross-DC corrective phase that drains data centers holding more than DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is unchanged for non-DC placements. * topology: ratio-aware EC shard slots and provisional empty-disk slot GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot; and it keeps the one provisional slot for freshly started empty servers that report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology threads the ratio through. * ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity) HardDriveType normalizes to "", which collided with "" meaning any disk. Add Constraints.FilterDiskType and normalize both sides so a hdd request matches disks reported as "" and never leaks to SSD, while filter=false still means any. * ecbalancer: add clearShardAccounting for repair snapshot reconciliation Clears one disk's copy of a shard from per-domain accounting and recomputes the node-level union (preserving a kept copy on another disk of the same node), without crediting capacity. Repair uses it to drop to-be-deleted copies before placing missing shards. * ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the effectively-unlimited default. * topology/ecbalancer: ratio-correct EC capacity accounting Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to the target ratio before subtracting, and existing EC shards are charged by size (targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot. Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio). * ecbalancer.Place: candidate tiering and eligible-rack caps Adds a per-disk eligibility/preference abstraction so Place supports: - preferred-tag whole-plan retry (try disks carrying the earliest tags first, widen to all only if a tier cannot place every shard; reports SpilledOutsidePreferredTags), - soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the preferred type then spills, reporting SpilledToOtherDiskType; Require filters, - even per-rack caps that divide by racks holding an eligible disk, so a tiered cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low. Disk tags carried via Node.AddDiskTags + FromActiveTopology. * ecbalancer: export ClearShardAccounting for repair snapshot reconciliation * ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves) - topology/ecbalancer: round shard-reservation and existing-shard footprint up when converting to target-ratio shard slots, so a sub-slot reservation is not truncated to zero and free capacity is not overstated for low-data-shard layouts (targetDataShards < ds). - erasure_coding: add ShardBits.All iterator and use it across the balancer, cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and probing Has on every id. - ecbalancer: allow same-DC cross-rack moves when a DC already sits at its DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a regression test that fails without the guard. - ecbalancer cross-DC phase: pick targets via the eligible-aware pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is honored and a 0 disk id is not mistaken for a valid selection. * ecbalancer: test ecShardSlotsOnDisk fractional round-up Cover the mixed-ratio path (targetDataShards < existing data shards) so a shard's fractional footprint is never floored to zero and free capacity is not overstated. Exercises the round-up via the targetDataShards parameter; OSS uses the standard ratio at runtime while the enterprise build hits it with real per-volume ratios. * ecbalancer: assert node B rack in TestFromActiveTopology * ecbalancer: split Destination into separate DataCenter and bare Rack Replace the composite "dc:rack" Rack field on Destination with separate DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task convention. Callers (and tests) read the data center directly instead of parsing the composite with strings.SplitN. * shell ec.balance: use utilization-based global balancing (parity with worker) The shell's global rebalance phase balanced by raw shard count; switch it to fractional fullness (shards/capacity), as the worker already does. On uniform capacity the two agree; on heterogeneous capacity it fills nodes proportionally instead of driving small-capacity nodes toward full. Updates the heterogeneous-capacity regression test to assert even fullness (~equal shards/capacity per node) rather than even shard count. * ecbalancer: bounded-proportional per-DC shard spread DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14 with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)): shards spread proportionally across DCs, but no tighter than the durability floor (once each DC holds <= parityShards a DC loss is recoverable, so further spreading only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to before when the cap is the binding constraint. * ecbalancer: drop DiffDataCenterCount enforcement for EC placement The 1-byte volume ReplicaPlacement packs xyz into x*100+y*10+z<=255, so the DC digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement, not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack phase, and the per-DC cap in Place (and the just-added bounded-proportional logic); EC relies on the RP-independent rack/node even spread instead. Rack/node caps (DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real EC placement spec. * ecbalancer: enforce per-disk durability cap; symmetric reserve/release Place now refuses to put more than parityShards shards of a volume on a single disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume, a hard cap not relaxed even in durability-first). Previously Place assigned by free capacity, so a skewed near-full cluster could pile >parityShards onto one disk -> losing it loses the volume; only distinct-disk count was checked. This covers encode and repair (both route through Place); the caller skips/leaves the volume rather than minting an unrecoverable layout. Also makes reserveShard decrement freeSlots unconditionally, symmetric with releaseShard's unconditional increment (the old guarded decrement could credit a phantom slot on release if a shard were ever reserved onto a full disk). * ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode Releases all of a volume's shards from the snapshot and credits the freed disk capacity, so a greenfield encode can plan as if stale EC shards from a prior failed attempt are gone. Safe to credit because the encode task deletes stale shards (cleanupStaleEcShards) before distributing the new ones. Distinct from ClearShardAccounting (repair), which does not credit. * ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks releaseShard only increments per-disk freeSlots, but rack capacity is summed from node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting only disks left a node/rack looking full after releasing stale shards, so a greenfield encode still couldn't use the freed capacity. Now credits the node by the total disk-slots freed. * ecbalancer: correct PlacementMode docs (encode uses durability-first) PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify that durability-first is used by both encode and repair, reports relaxations in PlaceResult.Relaxed, and never relaxes the per-disk durability cap. * ecbalancer: treat SameRackCount as a direct per-node shard cap The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing digit+1 per node. This makes the per-rack and per-node caps consistent and matches the documented "digits cap EC shards per rack and per node" semantics; e.g. 011 now means at most one shard per rack and one per node. |
||
|
|
adfd731bb8 | 4.28 4.28 | ||
|
|
917a87928c |
fix(s3api/list): cancel ListEntries stream in hasChildren (#9617)
* fix(s3api/list): cancel ListEntries stream in hasChildren * fix(s3api): use filer_pb.List in hasChildren filer_pb.List already wraps the ListEntries stream in a cancellable context, so the single-entry probe needs no separate helper or manual context plumbing to avoid the leaked gRPC stream goroutine. * fix(s3api): propagate request context into hasChildren Thread r.Context() through listFilerEntries and hasChildren so the implicit-directory probe cancels when the client disconnects, instead of running on context.Background(). --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
8fa769f29a |
feat(helm): add volume.rust toggle to run the Rust volume server (#9618)
feat(helm): add volume.rust to run the Rust volume server When set, the volume statefulset execs /usr/bin/weed-volume instead of 'weed volume', dropping the Go-only -logtostderr/-logdir/-v flags and the 'volume' subcommand. All shared flags and extraArgs carry over unchanged. |
||
|
|
7c635c4508 |
fix(docker): restore executable bit on prebuilt weed-volume (#9616)
GitHub Actions artifacts drop the executable bit, so the pre-built Rust volume server lands in the image as 0644 and 'weed-volume' fails to start with 'exec: Permission denied'. chmod it 0755 after copying. |
||
|
|
fbdcec1cba |
fix(s3): list empty directories as directory markers (#9615)
* fix(s3): list empty directories as directory markers A real but empty directory created out of band (mount, mkdir, filer API) carries no MIME, so it was hidden from S3 listings. hadoop-aws getFileStatus probes LIST prefix=dir/ &delimiter=/ and reads an empty result as a missing path, which breaks Spark's eventLog.dir when it points at an empty directory. Surface such directories as directory markers, matching directories created via PutObject with a trailing "/". Emptiness comes from the recursion result, and the marker MIME is set only on the in-memory listing entry, so empty directories stay eligible for empty-folder cleanup. * fix(s3): only surface empty directory markers for explicit dir probes Restrict the empty-directory marker to a trailing-slash prefix probe (prefix=dir/), the pattern hadoop-aws getFileStatus uses. Plain listings are left as before, so an empty directory left behind by deleted objects (e.g. after lifecycle expiration) is no longer shown as a phantom key. |
||
|
|
0accff0e4a |
fix(ec): log EC destination planning failures at v=2
The maintenance scanner tries to plan EC destinations for every eligible volume, so clusters that can't place EC logged a warning per volume every cycle. The min-node gate already skips clusters with fewer nodes than parity shards; demote the rest to V(2). |
||
|
|
9021225591 |
master: accept volume-server Ping targets on follower masters (#9614)
cluster.check asks every master to ping every volume server, but the Ping gate validated volume-server targets only against the local topology. Only the leader receives volume-server heartbeats, so a follower's topology is empty and every probe through it failed with "unknown ping target ... of type volumeServer". Fall back to the volume-server set the master learns over its own MasterClient subscription to the leader, the same source the filer gate already trusts. The anti-SSRF intent is preserved: Ping still only dials recognized cluster members. |
||
|
|
5b42287c22 |
fix(storage): surface stat error on zero-size idx scrub, mirror to rust (#9612)
fix(storage): harden zero-size idx scrub and mirror to rust When a zero-size .idx is found, openIndex stats the backing .dat through v.DataBackend: wrap that GetStat failure with %w, fix the indices typo, and guard both openIndex and scrubVolumeData against a nil DataBackend (closed or remote-only volumes) instead of panicking. Add rust scrub tests for empty (superblock-only .dat, zero-size .idx) and healthy volumes, keeping the volume server in parity with the go zero-size scrub handling. |
||
|
|
3392493f0a |
test(volume): fix race in TestReplicatedUploadSucceedsImmediatelyAfterAllocate (#9613)
test(volume): wait for master to register both replicas before replicated upload
TestReplicatedUploadSucceedsImmediatelyAfterAllocate allocated the volume on
both nodes via direct AllocateVolume gRPC calls, then uploaded immediately. The
master only learns about replica locations through volume-server heartbeats,
which lag behind those direct gRPC calls, so the replicated write could look up
the master before the second replica was registered and fail with a 500
("replicating operations [1] is less than volume replication copy count [2]").
In production a client obtains its fid from the master assign flow, which
guarantees the master already knows every replica. The test crafts the fid by
hand, bypassing that guarantee, so wait until the master reports both replicas
before uploading.
|
||
|
|
d82b3a8d6a |
refactor(s3): drop unused source path in copy ETag check
ETagEntry derives the tag from chunks/Md5/remote-etag, never the entry path, so the conditional-copy check no longer builds a bogus FullPath. |
||
|
|
39e9294907 |
Have volume scrubs account for zero-sized volumes. (#9609)
Fixes scrubbing for pre-allocated volumes with zero-size indeces by reworking the validation code to allow zero-size indeces on zero-size volumes. |
||
|
|
3825035f07 |
test(ec): deterministically populate disks before multi-disk EC balance check (#9611)
The disk-spread assertion raced volume growth and heartbeats. volume.grow -count is a writable-target topup, not add-N, and swallows partial-failure errors, so one grow could leave a node's data on a single disk; ec.encode then piles all that node's shards there and ec.balance can't spread them. Retry grow on under-spread nodes until the master topology shows every node holding volumes on at least two physical disks, then encode. |
||
|
|
83b7ea5e7b |
fix(s3): keep server-side copy data in the bucket collection (#9607)
* fix(s3): keep server-side copy data in the bucket collection UploadPartCopy and SSE-C CopyObject assigned destination volumes against r.URL.Path, the S3 request URI. The filer derives a bucket's collection only when the assign path sits under its buckets folder, so an S3 URI routed copied bytes to the default collection instead of the destination bucket's. Assign against the destination's real filer path. * refactor(s3): centralize copy-part path and thread dstPath into SSE-C copy Extract copyPartLocation so the fast path and writeEmptyCopyPart share one definition of the .uploads/<id>/<n>_copy.part location. Pass the destination filer path into copyChunksWithSSEC instead of re-deriving it from the request, and thread it through key rotation so re-encrypt copies also assign in the destination bucket's collection. |
||
|
|
eae8f33db5 |
fix(filersink): return lock-free snapshot from ActiveTransfers (#9604)
ChunkTransferStatus embeds a sync.RWMutex, so returning a slice of it made callers copy the lock when ranging. Split out a copyable ChunkTransferSnapshot holding the data fields and return that instead. |
||
|
|
2c2b2d4d3e |
chore(skiplist): remove unused NameList/NameBatch implementation (#9603)
NameList, NameBatch and their serde were an earlier in-memory directory batch implementation. The redis3 filer store uses its own ItemList backed by Redis sorted sets, so these types had no production callers (NameList only via its own test, LoadNameList none at all). Drop them and the now-orphaned NameBatchData proto message, regenerating skiplist.pb.go with the repo-standard protoc-gen-go v1.36.6. |
||
|
|
cd15ae1395 |
fix(ec): bring ec.encode worker and EC/volume helpers to parity with shell (#9599)
* refactor(volume): extract replica sync/select into shared volume_replica package Move the volume replica reconciliation helpers (status, union builder, SyncAndSelectBestReplica, ReadNeedleMeta) out of the shell into a new weed/storage/volume_replica package so both the shell (ec.encode, volume.tier.move, volume.check.disk) and the EC encode worker can reuse them. No behavior change. * fix(ec): bring ec.encode worker to parity with the shell - Sync replicas and encode the most-complete one (via the shared volume_replica.SyncAndSelectBestReplica) instead of a possibly-stale replica, marking all replicas readonly first. Prevents silent data loss when a stale replica is encoded and the originals deleted. - Skip remote/tiered volumes in detection (shell ec.encode excludes them). - Min-node safety gate: refuse to encode when cluster nodes < parity shards. - Align default thresholds with the shell (fullness 0.95, quiet 1h). * fix(vacuum): plugin path honors min_volume_age_seconds override deriveVacuumConfig hard-coded MinVolumeAgeSeconds=0, dropping any configured value. Read it from worker config (default 0, matching the shell/master vacuum which has no age gate) so an explicit override is honored. * address review feedback - config.go: align GetConfigSpec schema defaults (quiet_for_seconds=3600, fullness_ratio=0.95) with the runtime defaults so UI/bootstrap flows match the shell (coderabbitai). - ec_task.go: roll back readonly when markReplicasReadonly fails partway, so already-marked replicas don't stay readonly (coderabbitai). - volume_replica: pass the caller's replica statuses into buildUnionReplica instead of re-fetching them, and skip the per-needle ReadNeedleMeta RPC when the source replica is read-only (gemini-code-assist). * test(plugin_workers/ec): make fixtures eligible under the new defaults The default EC encode thresholds were raised to match the shell (fullness 0.95, quiet 1h), but the plugin-worker integration fixtures still used 90%-full / 10-minute-old volumes, so detection found no eligible volumes and the tests failed in CI. Bump the eligible fixtures to 96% full and 2h old. |
||
|
|
3f6410fdc3 |
fix(redis3): prevent filer crash from inconsistent skiplist ends (#9602)
* fix(redis3): prevent filer crash from inconsistent skiplist ends DeleteByKey updated the two ends asymmetrically: the start side decided whether to clear StartLevels[index] by comparing a cached reference key, while the end side cleared EndLevels[index] structurally. The redis3 ItemList re-keys a node while keeping its id, so that cached key drifts. When such a node was the only one at a level and got deleted, the stale key comparison left StartLevels dangling while EndLevels was cleared to nil. The next InsertByKey then dereferenced a nil EndLevels[0] and took down the whole filer during a rename or delete. Match the deleted node by its unique id so both ends stay consistent, and guard each end in InsertByKey so an already-corrupted skiplist persisted in Redis self-heals instead of crashing on load. * fix(redis3): propagate errors from WriteName node split The case 2.3 split path returned nil instead of the error in seven branches. Because the split runs a multi-step sequence (DeleteByKey on the skiplist, ItemAdd, redis range-delete, ItemAdd), a swallowed failure let WriteName report success while the skiplist was half-updated, which the caller then persisted - silently corrupting the directory listing and setting up the very inconsistent-ends state that crashes the filer. |
||
|
|
87fdea5330 |
fix(admin): carry filer addresses as ServerAddress in plugin cluster context (#9600)
The plugin cluster context forwarded filers as gRPC-only addresses (host:grpcPort). The admin-script worker stored that in ShellOptions.FilerAddress, whose shell commands re-derive the gRPC port via ToGrpcAddress() and re-add the +10000 offset, dialing a non-existent host:28888. Carry filers in pb.ServerAddress form (host:httpPort.grpcPort) and let each consumer convert when it dials: the admin shell uses it verbatim, while the s3_lifecycle and iceberg workers collapse it to a gRPC address. Rename the proto field filer_grpc_addresses -> filer_addresses so the name matches the content. |
||
|
|
303c2be38d |
feat(fix): rebuild lost EC index (.ecx) and .vif from local shards (#9596)
weed fix -ecx reconstructs the .dat from the local data shards, scans the needles, and writes a fresh ascending-sorted .ecx containing only live entries — the same on-disk index WriteSortedFileFromIdx emits at encode time. When the .vif is also missing it is regenerated from the inferred EC ratio (flags > .vif > shard-count inference / 10+4) and the .dat size recovered from the scan. When some data shards are missing but at least dataShards shards survive, the missing shards are first reconstructed from the survivors via Reed-Solomon, so a partial shard set is repaired too. Also makes erasure_coding.WriteDatFile de-stripe using len(shardFileNames) instead of the DataShardsCount constant, so the caller's actual data-shard count is honored (behavior-preserving for the default 10, and fixing the existing caller that already passes ECContext.DataShards). This recovers an EC volume whose sealed index was lost from every node while enough shards survive, a state neither ec.rebuild nor ec.decode can repair because both require an existing .ecx. Flags: -ecx, -ecDataShards, -ecParityShards. Run with the volume server stopped. |
||
|
|
9b9fdb5b76 |
fix(s3): sync IAM policies to advanced IAM Manager policy engine (#9577)
* fix(s3): sync IAM policies to advanced IAM Manager policy engine * test(s3): add unit tests for PutPolicy/DeletePolicy IAM Manager sync * fix(s3): flush loaded policies in SetIAMIntegration, drop extra reload Sync the policies already loaded from the credential store into the IAM Manager's engine from SetIAMIntegration itself, instead of re-running a full LoadS3ApiConfigurationFromCredentialManager after setup. This covers both startup orderings without a second filer round-trip or racing the async loader goroutine: if the load won, the policies are in memory to push; if SetIAMIntegration won, the load's own sync runs afterward. Move the runtime PutPolicy/DeletePolicy sync out of the iam.m write lock so the per-request auth RLock path isn't blocked by the policy recompile. * fix(s3): serialize IAM manager policy resync to avoid stale snapshots SyncRuntimePolicies replaces the manager's full policy set, so applying a policy view captured before a later mutation can resurrect a deleted policy or drop a new one. Funnel every path (PutPolicy, DeletePolicy, SetIAMIntegration, and the credential-manager load) through a single resyncIAMManagerPolicies that serializes on a dedicated mutex and reads iam.policies fresh at apply time, so the live map always wins regardless of interleaving. The load now installs the config into iam.policies before resyncing, closing the window where the manager held policies the map didn't yet have. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
7e4691f2dc |
test(ec): make multi-disk EC balance disk-spread assertion deterministic (#9595)
test(ec): pre-populate disks so multi-disk EC balance spread is deterministic The multidisk shard-loss regression asserts EC shards spread across more than one disk per node, but that only holds for disks the balancer can see. The master enumerates a physical disk only when it already holds a volume or EC shard — an empty disk leaves no trace, since heartbeats aggregate capacity per disk type, not per physical disk. So whether the post-encode balance spread shards depended on how the master happened to place the filler volumes across disks, which varies by environment: the test passed locally (shards on 5 disks) but produced one disk per node in CI and failed the "got 3 disks across 3 nodes" assertion. Grow a few volumes on each server before encoding so every physical disk holds a volume and is visible to the balancer. The volume server places each new volume on its least-loaded disk, so a handful of grows touches every disk, making the spread deterministic. The assertion still has teeth: it counts disks holding shard files, so a balancer that failed to spread would still collapse to one disk per node. |
||
|
|
391f543ff2 |
fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes
The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.
Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.
* fix(ec): attribute EC balance source disk per shard and reject same-node moves
On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.
Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.
* fix(ec): spread EC balance moves across destination disks
Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
- fewer total EC shards on the disk,
- far fewer shards of the same volume on the disk (spread a volume's shards
across disks for fault tolerance), and
- data/parity anti-affinity (a data shard avoids disks holding the volume's
parity shards and vice versa).
Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.
* fix(ec): bring EC balance worker to parity with shell ec.balance
The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:
- Cross-rack and within-rack now run a two-pass balance: data shards spread
first, then parity shards spread while avoiding racks/nodes that already hold
the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
10+4 here), replacing the previously hardcoded constant at the call sites.
Selection is deterministic (sorted keys) to keep behavior reproducible.
* refactor(ec): extract shared ecbalancer package for shell and worker
The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.
- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
within-rack data/parity two-pass with anti-affinity, global per-rack balance,
and diversity-aware disk selection) over a caller-built Topology snapshot and
returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
task proposals; the shell builds it from its EcNode model and executes Moves
via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
(worker balances by fractional fullness; shell by raw count) and
GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
in one pass).
The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.
* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor
Address regressions and gaps from the ecbalancer extraction:
- Shell ec.balance honors -maxParallelization again: planned moves run phase by
phase (preserving cross-phase dependencies) with bounded concurrency within a
phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
instead of combined totals, so a data/parity skew is corrected even when the
per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
move, so limited-capacity targets are no longer over-planned.
* fix(ec): make EC dedup keeper deterministic and capacity-aware
When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.
* fix(ec): restore global volume-diversity and per-volume move serialization
Two more behaviors lost in the ecbalancer refactor:
- Global rack balancing again prefers moving a shard of a volume the destination
does not hold at all before adding another shard of an already-present volume
(two-pass, mirroring the old balanceEcRack), keeping each volume's shards
spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
while still running different volumes in parallel, so concurrent moves of the
same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.
* fix(ec): key EC balance shards by (collection, volume id)
A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.
* fix(ec): credit freed capacity from dedup before later balance phases
Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).
* test(ec): add multi-disk EC balance integration test
Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.
* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes
collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.
Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.
* test(ec): end-to-end multi-disk EC balance shard-loss regression
Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.
* fix(ec): bring EC balance worker/plugin path to parity with shell
- Per-volume serialization and phase order: key the plugin proposal dedupe by
(collection, volume) instead of (volume, shard, source), so the scheduler runs
only one of a volume's moves at a time (within a run and against in-flight jobs).
Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
because the planner emits a volume's moves in phase order, they now execute in
order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
read it into the worker config, mirroring ec.balance -shardReplicaPlacement.
* test(ec): rename worker in-process test (not a real integration test)
The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.
* ci(ec): remove redundant ec-integration workflow
ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.
* fix(ec): worker falls back to master default replication for EC balance
For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.
The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.
* fix(ec): plugin path populates master default replication too
The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.
* docs(ec): empty shard replica placement uses master default, not even spread
The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
|
||
|
|
afcc491517 |
test: fix fd leak in the Samba DLM handoff test (promote xfail checks) (#9592)
test(mount): fix fd leak that deadlocked the DLM handoff check The cross-mount handoff checks held a file open on mount 2 via fd 9 to keep the distributed lock, then started the SMB writer in a background subshell. The subshell inherited fd 9, so the SMB writer kept the file open and waited on a lock held by its own descriptor; the put could never complete, and the two checks were parked as expected-fail. Close fd 9 in the subshell (9>&-) so the writer does not hold the file. The waiter now acquires the freed lock within ~1s, so the two checks are real assertions and the xfail machinery is gone. |
||
|
|
a5d0e4a735 |
Samba-over-FUSE integration test and distributed-lock handoff fixes (#9590)
* test(mount): add Samba over FUSE integration test Export a SeaweedFS FUSE mount over SMB with smbd and drive it with smbclient: file round-trips, directories, rename, large-file chunking, recursive upload, cross-protocol consistency, and deletes. A second -dlm mount adds locking coverage: POSIX fcntl byte-range locks, distributed-lock write coordination, and concurrent writers. The two cross-mount handoff checks currently fail and pin a known limitation - the distributed lock is released on FUSE Release, which the kernel can delay under contention. Runs locally via test/samba/run.sh or in Docker via the compose file; wired into CI as samba-integration.yml. * fix(cluster): release distributed lock without racing the renewal goroutine Stop() closed the cancel channel, slept 10ms, then unlocked using renewToken. A renewal in flight during that window rotates the token on the server, so the unlock may be sent with a stale token, fail with a mismatch, and leave the lock to linger until its TTL expires - stalling other mounts waiting to write the same file. Wait for the renewal goroutine to exit before unlocking. The channel close also makes the renewToken read happen-after the last renewal. * fix(cluster): poll for distributed lock acquisition without exponential backoff A mount waiting to write a file held by another mount acquired through util.RetryUntil, whose backoff grows to several seconds. Once the holder released, the waiter could sleep that long before retrying, stretching the cross-mount handoff past client timeouts. Poll at the steady ~1s cadence AttemptToLock already enforces instead. * test(mount): tighten Samba harness and mark the DLM handoff checks xfail Run the workflow for weed/cluster changes, fail fast when the filer or smbd port never opens, and fold the recursive mput result into its own assertion so it cannot false-pass. Mark the two cross-mount handoff checks expected-fail: they pin the remaining DLM liveness bug (the lock is freed only on the delayed FUSE Release) without failing CI, and turn the suite red if the handoff is ever fixed. * fix(cluster): keep a wedged renewal shutdown from sending a stale unlock If the renewal goroutine is stuck in a slow RPC, Stop() fell through to unlock anyway once it timed out waiting. A late renewal can rotate renewToken, so that unlock races it, is rejected on a stale token, and leaves the lock lingering until its TTL regardless. On the timeout path, skip the unlock and let the TTL expire the lock instead. * fix(cluster): wake the long-lived lock renewal loop promptly on Stop StartLongLivedLock's renewal loop slept uninterruptibly between attempts, up to 5*renewInterval (2.5*lockTTL) while unlocked. Stop() waits only lockTTL+2s for the goroutine to exit, so a Stop() during that backoff would time out before the goroutine woke and closed renewalDone, breaking the shutdown synchronization. Sleep on a timer with a select on cancelCh so the loop exits immediately. |
||
|
|
a17dca7009 |
fix(filer): don't disable the SQL idle connection pool when unconfigured (#9591)
* fix(filer): don't disable the SQL idle connection pool when unconfigured The mysql/mysql2/postgres stores called SetMaxIdleConns(maxIdle) unconditionally, so an unset connection_max_idle (0) actively kept zero idle connections - every query opened and closed a fresh connection instead of reusing the pool. Only apply the value when it's set; otherwise leave database/sql's default idle pool of 2 in place. * comments: shorten idle-pool note * fix(filer): default the SQL idle pool via config, keep explicit 0 honored Apply the idle-pool default at the config layer with SetDefault instead of guarding the SetMaxIdleConns call. An absent connection_max_idle now reads back as 2 (pool stays on), while an explicit 0 flows through to SetMaxIdleConns(0) so operators can still disable idle pooling on purpose. |
||
|
|
024b59fb31 |
fix(ec): pack EC shards onto fewer disks instead of refusing the task (#9588)
The planner refused to create an EC task unless it found totalShards distinct (server, disk_id) targets, so a cluster with fewer disks than shards (e.g. 8 single-disk servers for a 10+4 scheme) could never encode. A disk safely holds several distinct shards of one volume: each is its own .ecNN file and ReceiveFile keys by that extension. Drop the strict check and let createECTargets round-robin shards across the available disks, matching ec.encode's "4,4,3,3" fallback. The minTotalDisks floor (ceil(total/parity)) already keeps any disk under parityShards shards, so the volume still survives losing any one disk. Reserve capacity for the actual per-disk shard count rather than assuming one shard each, so packing doesn't over-commit disk slots. |
||
|
|
5af7d12f04 |
fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589)
* fix(filer.sync): keep sync_offset fresh while the source is read-only sync_offset holds the timestamp of the last replicated source event, so monitoring derives lag from now-sync_offset. A read-only source emits no metadata events, so the gauge froze at the last write and the derived lag grew without bound, making thresholds unusable. The source filer now sends an idle heartbeat carrying its current time while a subscriber is caught up to the buffer head. filer.sync uses it to advance the gauge, so now-sync_offset reflects real lag. Heartbeats are opt-in (client_supports_idle_heartbeat), are never written to the metadata log, and do not move the resume checkpoint, so a restart still resumes from the last real event. * fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs In metadata-chunks mode persisted entries replay as log file refs and never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up subscriber with an old SinceNs would never get a heartbeat. Use the read cursor (lastReadTime), which advances in that mode too, max'd with lastSeenTsNs so the in-memory backlog-then-idle case still works while the cursor returned to the caller has not yet updated. |
||
|
|
4385b86bf1 |
fix(shell): volumeServer.evacuate no longer panics on a nil volume (#9587)
adjustAfterMove now removes the moved volume from the source disk's VolumeInfos in place: it swaps the entry with the last one and nils the tail. evacuateNormalVolumes ranges directly over that same slice, so the niled tail slot is later read as a nil *VolumeInformationMessage and the move attempt panics on vol.DiskType. Iterate over a snapshot of the slice so in-place removals during a move cannot leave nil holes in the loop. |
||
|
|
c00aa90990 |
fix(s3/audit): populate requester for GET/HEAD/IAM operations (#9581)
Authentication records the identity with r.WithContext, which returns a request copy. Handlers that log their own audit entry (PUT, DELETE, tagging) see it, but GET/HEAD object and IAM operations rely on track()'s fallback entry, which is built from the original request the auth copy never reached - so requester came out empty. Install a mutable identity holder on the request before authentication and have SetIdentityNameInContext record into it. The holder is shared by pointer across every request copy, so the fallback entry recovers the authenticated requester. The per-request context value still takes precedence, so nothing changes for handlers that see the auth copy. |
||
|
|
e332b97d52 |
fix(shell): volume.balance no longer drains all volumes onto one server (#9579)
* fix(shell): volume.balance no longer drains all volumes onto one server The density-based capacity function reads per-disk VolumeInfos sizes, but adjustAfterMove only updated VolumeCount and the selectedVolumes map. The planner re-read a stale topology after every move, so the source node's density never dropped and it kept moving volumes until that node was empty. Move the volume's size accounting between disks after each planned move so the density recomputes and the loop converges to an even distribution. * refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove removeVolumeInfo swaps with the last element instead of shifting, and the disk is fetched by key rather than ranging the DiskInfos map.4.27 |
||
|
|
868849392c | 4.27 | ||
|
|
a4415c39aa |
fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads (#9574)
* fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads The periodic flush snapshotted entry.Chunks, then ran CompactFileChunks and MaybeManifestize (the manifest upload is a network round trip) before reassigning entry.Chunks. Async uploaders append freshly uploaded chunks during that window, and the reassignment overwrote them: the data stayed on the volumes but the file lost those chunk references, leaving zero-filled holes on read. Large sequential writes such as cat of two 15 GiB files hit several flush cycles and ended up corrupted. Snapshot the chunk list under the entry lock with a length marker, do the slow compaction and manifestization on the snapshot, then splice the processed prefix back in front of whatever chunks arrived after the snapshot. * mount: drop redundant slice copies in the flush splice processedPrefix is freshly built and the tail sub-slice is consumed immediately under the entry lock, so append straight onto processedPrefix instead of allocating two throwaway copies. |
||
|
|
9914e6af30 |
chore(weed/command): prune unused functions (#9573)
* chore(weed/command): prune unused functions * drop now-unused closed field and renderLocked guard --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
cc5ef1b741 |
feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572)
* feat(s3): add TagUser, UntagUser, ListUserTags IAM actions Adds AWS IAM-compatible user tag operations on the embedded IAM endpoint. Tags persist in the Identity proto as a repeated UserTag field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits are enforced. Pagination is stubbed (IsTruncated=false) since the 50-tag cap means all tags fit in a single response. * review: validate UntagUser TagKeys entries parseTagKeysParams now rejects empty keys and keys past MaxUserTagKeyLength; UntagUser additionally requires at least one TagKeys.member.N entry to match AWS validation behavior. * review: pre-allocate user-tag merge and filter slices mergeUserTags now allocates the combined existing+incoming capacity up front; UntagUser builds the filtered slice via make with the full ident.Tags capacity instead of ident.Tags[:0:0], which forced a reallocation on every append. * review: cover duplicate-in-request and invalid TagKeys cases Regression tests assert TagUser rejects two members with the same key in one request, and UntagUser rejects missing/empty/oversized TagKeys entries. |
||
|
|
37b6a14b0d |
feat(s3): add four bucket configuration handlers (#9570)
* feat(s3): add four bucket configuration handlers - GetBucketPolicyStatus: computes IsPublic from the existing bucket policy - PutBucketRequestPayment: companion writer to the existing GET; accepts only BucketOwner - GetBucketAccelerateConfiguration: returns <Status>Suspended</Status> - GetBucketLogging: returns an empty BucketLoggingStatus Lets AWS SDK probes succeed instead of returning MethodNotAllowed. * review: route GetBucketPolicyStatus through checkBucket Mirrors the existence/auth gating used by other bucket handlers and drops the bespoke filer_pb lookup so NoSuchBucket precedence is consistent across the API surface. * review: cap PutBucketRequestPayment body with MaxBytesReader The body is unmarshalled as RequestPaymentConfiguration, which is a handful of bytes; reject excessively large payloads up front and defer Close immediately after wrapping. * review: gate static getters on checkBucket GetBucketAccelerateConfiguration and GetBucketLogging now run the standard bucket existence check before returning the static Suspended / empty-status response so a missing bucket cannot appear to have valid configuration. * review: share cache helper across misc tests; check io.ReadAll error Accelerate and Logging tests now run through newMiscTestServer like the others so the checkBucket guard sees a cached bucket; the ReadAll error is explicitly checked. |
||
|
|
cee2bf697c |
feat(s3): stub bucket configuration list endpoints (#9571)
* feat(s3): stub bucket configuration list endpoints Adds Get and List handlers for Analytics, Inventory, IntelligentTiering, and Metrics bucket configurations. List returns an empty result with IsTruncated=false; single-get returns NoSuchConfiguration so SDK error parsing remains predictable. * review: gate stubs on bucket existence All eight stub handlers now call checkBucket via stubBucketGuard so NoSuchBucket takes precedence over NoSuchConfiguration / empty-list responses, matching AWS S3 precedence. Tests provide a cached bucket so the guard sees it as present. |
||
|
|
285025eb73 |
s3api: support group inline policies + Condition enforcement (#9569)
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload → VerifyActionPermission with a synthetic 127.0.0.1 request and assert that the policy's IpAddress condition flips the outcome. The user-policy cases pass on master (hydrateRuntimePolicies already routes inline docs through the policy engine, so Condition blocks are honored end- to-end). The group-policy case fails: PutGroupPolicy still returns NotImplemented, so a group inline doc never lands in the engine. Integration counterparts live under test/s3/iam/ and exercise the same paths against a live SeaweedFS S3+IAM endpoint. * s3api: support group inline policies + Condition enforcement PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to return NotImplemented in embedded IAM mode, so anything attached to a group as an inline doc — including aws:SourceIp or any other Condition — was simply unreachable. Wire the four endpoints to the credential-store methods that were already in place (memory, postgres, filer_etc all implement GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies now also walks LoadGroupInlinePolicies, registers each doc in the IAM policy engine under __inline_group_policy__/<group>/<policy>, and appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are added to the ReloadConfiguration trigger list in DoActions. Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too. Without it iam.groups never repopulated on a memory-store reload, so group policy evaluation silently no-op'd whether the policy was inline or attached. The existing tests didn't notice because no test reloaded through cm after creating a group. The NotImplemented unit test is inverted to drive the new round-trip. * s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy DoActions already triggers ReloadConfiguration for both actions via the explicit reload list, so calling refreshIAMConfiguration inline runs the load twice per request. Per PR review. * s3api: scope group-policy resource names per test; tighten deny polling - Integration test resource names get a per-test suffix so retried or parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists. - Deny-path Eventually loops gate on AccessDenied via a typed helper rather than any non-nil error; transient setup errors no longer end the wait prematurely. - ListGroupPolicies returns ServiceFailure when the credential manager is nil, matching Put/Get/DeleteGroupPolicy. * test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs CI runners with happy-eyeballs resolve `localhost` to ::1 first, in which case a 127.0.0.0/8-only allow would silently never match and the deny-driven enforcement test would hang for the allow case. Add ::1/128 to every loopback-matching policy so the allow path works regardless of which loopback family the SDK lands on. |
||
|
|
77ac781bbd |
fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568)
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers When a volume server holds EC shards for the same vid across more than one disk, each DiskLocation registers its own EcVolume entry and Store.FindEcVolume returns whichever one it hits first. The shard-info RPC iterated only that single EcVolume's Shards, so the response missed every shard mounted on a sibling disk. The worker's verifyEcShardsBeforeDelete sums the per-server responses into a union bitmap and refuses to delete the source volume when the union falls short of dataShards+parityShards. On multi-disk destinations, the union was systematically under-counted and source deletion got blocked even though all shards were physically present and mounted. Walk every DiskLocation in the handler and emit the deduplicated union of all shards. The .ecx-backed fields (file counts, volume size) still come from a single EcVolume since every disk's entry opens the same .ecx via NewEcVolume's cross-disk fallback. Tests: - TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in weed/server/. - test/volume_server/grpc/ec_verify_multi_disk_test.go integration test drives the full generate -> mount -> redistribute -> restart -> reconcile path and asserts both VolumeEcShardsInfo and VerifyShardsAcrossServers + RequireFullShardSet (the production source-deletion gate) report all 14 shards. - ec_multi_disk_lifecycle_test.go tightened: replaces the "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a full-shard-set assertion. * review: use ShardBits bitmask + cap-pre-allocation for shard dedup |
||
|
|
f72983c1fd |
fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"
The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".
Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.
ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.
* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal
The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.
Extend the matcher so a request is recognized as S3 Tables when either:
- its AWS V4 credential scope names SERVICE=s3tables; or
- it carries the canonical AWS JSON RPC 1.1 content type and is
unsigned (a request explicitly signed for SERVICE=s3 still wins).
The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).
Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.
* s3api: extend service matcher to all S3 Tables routes; simplify scope check
- Apply serviceMatcher to every S3 Tables route, not just the bare-path
ones. ARN-bearing paths could otherwise be hit by an S3 object key
that starts with arn:aws:s3tables:..., inside a bucket named
"buckets", "namespaces", "tables", or "tag". One matcher everywhere
closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
credential-scope check. The scope shape is fixed at
AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
is exactly s3tables. Existing unit cases (including the
access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
the single Read could return short and make the equality check flaky.
* s3api: drop content-type fallback; sign s3 tables harness traffic instead
The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.
Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.
Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
|
||
|
|
cfc08fbf6c |
fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563) (#9565)
* fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0 so the on-disk needle header carries Size=0; TombstoneFileSize is only the .idx sentinel for "this entry is deleted" and is never written into a needle header. ReadBytes' size check therefore mismatched on every tombstone (-1 != 0), returned ErrorSizeMismatch, and triggered the 4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any volume large enough that offset+32 GB exceeds dat fileSize the retry read EOF, CheckVolumeDataIntegrity reported corruption, and the loader set noWriteOrDelete = true. Every volume whose last 10 .idx entries included a deletion went read-only on startup — i.e. any healthy volume where the most recent operations included a delete. Pass Size(0) so the size check matches the on-disk tombstone header. Add a regression test that writes three needles, deletes one, and asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx tail. Without this fix the test reproduces the exact log shape from the bug report: read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF The Rust port guards its integrity-check size comparison with !size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never hits this path, so no Rust mirror change is needed. * test(seaweed-volume): mirror Go regression for deletion-tombstone integrity The Rust integrity check already guards its size-mismatch comparison with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs with body_size=0, so the Go regression fixed in the previous commit does not apply. Lock that guarantee in with a parallel reload test: write three needles, delete one, sync, reopen via Volume::new, assert the volume is not flipped read-only. Catches any future change that removes the deleted-entry guard or re-introduces a size-strict path in check_volume_data_integrity for tombstones. * fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity CheckVolumeDataIntegrity relies on identity comparison against io.EOF and ErrorSizeMismatch to walk back through the last ten .idx entries and tolerate a partial truncation at the tail (the "fix and continue" loop). The live-needle branch in doCheckAndFixVolumeData already returns those sentinels unwrapped; the deletion branch wrapped them in fmt.Errorf, so a genuine .dat truncation past a tombstone offset broke the recovery and flipped the volume read-only. Mirror the live-needle handling: both verifyDeletedNeedleIntegrity and doCheckAndFixVolumeData now short-circuit on io.EOF / ErrorSizeMismatch and pass them through unwrapped. Other errors keep their existing context wrapping. Also tighten the regression test to capture lastAppendAtNs and assert it's non-zero, so a future regression that skips the tombstone body (and therefore never populates AppendAtNs) is caught even when the err check still passes. |
||
|
|
d57de6dc20 |
fix(s3): keep anonymous access working with EnableIam default (fixes #9557) (#9567)
fix(s3): keep anonymous access working with EnableIam default `docker run seaweedfs` (and `weed mini` with no config) start with EnableIam=true but no IAM config file and no identities. The advanced-IAM init path was failing in 4.25 because of the missing STS signing key, which masked a latent bug: SetIAMIntegration unconditionally flipped isAuthEnabled to true, and isEnabled() also treated a non-nil iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the STS fallback started succeeding, the integration got installed end to end, and every anonymous S3 request bounced as AccessDenied. Separate the two concerns: SetIAMIntegration just plumbs in the OIDC / embedded-IAM machinery, and a new EnableAuthEnforcement opts in to enforcement. The startup path calls it only when -s3.iam.config is actually provided, so operators with explicit IAM configs still get auth (preserves #7726). isEnabled() now reads isAuthEnabled only. |
||
|
|
4476cb282b |
feat(filer): add atime to FuseAttributes + TouchAccessTime RPC (#9556)
* feat(filer): add atime field and TouchAccessTime RPC to filer proto
Introduce POSIX-style access-time tracking on the filer:
- FuseAttributes gains atime (field 22) and atime_ns (field 23).
- New TouchAccessTime RPC (and Touch{Access,Time}{Request,Response})
lets read paths bump atime without going through UpdateEntry's
chunk-rewrite/EqualEntry short-circuit.
Additive proto changes only; zero atime is treated as unset and
existing clients are unaffected. Java client proto is kept in lock
step.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(filer): wire Atime through Attr codec with mtime fallback
Add Attr.Atime and round-trip it through EntryAttributeToPb /
EntryAttributeToExistingPb / PbToEntryAttribute. A zero proto atime
decodes as Mtime, so legacy entries report a sensible value and
freshly-created/updated entries default Atime to Mtime when callers
do not set it explicitly.
CreateEntry and UpdateEntry stamp Atime = Mtime (or Crtime) when it
is zero. TouchAccessTime later bypasses this path to write atime
alone via Store.UpdateEntry.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(filer): preserve atime in first epoch second on decode
The Atime decode branch previously treated any attr.Atime == 0 as
unset and overwrote it with Mtime, which drops valid timestamps in
the first second of the unix epoch where attr.Atime is 0 but
attr.AtimeNs > 0. Check both fields so we only fall back to Mtime
when both are zero.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
|
||
|
|
b63610cf8f |
volume: accept legacy needle CRC encoding on read (#9564)
Volumes written by versions before 3.09 (commit
|
||
|
|
c61d227613 |
s3api: verify source permission on CopyObject and UploadPartCopy (#9555)
* s3api: verify source permission on CopyObject and UploadPartCopy The Auth middleware only authorized the destination because routes key on the request URL. The source from X-Amz-Copy-Source was never evaluated, so an STS session token scoped to one prefix could copy from any other prefix in the same bucket. Add AuthorizeCopySource on IdentityAccessManagement to run the full bucket-policy + IAM/identity flow against the source, using a synthetic GetObject request so action resolution lands on s3:GetObject (or s3:GetObjectVersion when a source versionId is supplied). Both CopyObjectHandler and CopyObjectPartHandler now invoke it before reading the source. * s3api: preserve presigned-URL session token on copy-source check Presigned CopyObject / UploadPartCopy requests carry the STS session token in the query string (X-Amz-Security-Token), not in a header. Rebuilding the synthetic source URL from scratch dropped that token, so the source authorization would fall through to non-STS paths and miss session policy enforcement. Forward X-Amz-Security-Token from the original query (alongside versionId), still excluding unrelated params like uploadId/partNumber that would steer ResolveS3Action away from s3:GetObject. |
||
|
|
7c252e1f16 |
fix(volume): reopen .idx writable after MarkVolumeWritable (fixes #9515) (#9526)
* fix(volume): reopen .idx writable after MarkVolumeWritable When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable only flipped noWriteOrDelete back to false and rewrote .vif, so writes still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its writable form (in-memory or leveldb small/medium/large) before flipping the flag. Mirror the same fix in seaweed-volume: the Rust load path leaves CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume boots read-only, so post-MarkVolumeWritable puts silently succeeded in-memory only and were lost on the next restart. set_writable now reattaches an append-mode writer when one is missing. * fix(volume): keep old needle map until replacement is built; defer writable flag Go: build the writable needle map into a local before swapping. A construction failure now leaves v.nm pointing at the original SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of stranding the volume with v.nm == nil. Rust: attach the .idx writer before flipping no_write_or_delete to false. A transient open/metadata failure used to leave the volume marked writable with no writer attached, and subsequent puts would silently skip the on-disk append. |
||
|
|
7c5296dfb1 |
fix(admin): switch file browser upload/download to filer gRPC + volume HTTP (#9538)
* fix(admin): switch file browser upload/download to filer gRPC + volume HTTP The admin file browser proxied uploads and downloads through the filer's HTTP listener, so the whole feature 404'd against filers started with -disableHttp=true even though S3 still worked on its own port. Re-route through the filer gRPC service: LookupDirectoryEntry + StreamContent for reads (chunks flow straight from the volume servers), AssignVolume + volume HTTP POST + CreateEntry for writes. Volume read tokens come from jwt.signing.read.key when configured; the old jwt.filer_signing tokens no longer apply since the filer HTTP surface is bypassed. * admin file browser: propagate request context + track response writes Pass r.Context() into uploadFileToFiler so a client disconnect cancels the in-flight chunked upload instead of letting it run to completion against the volume servers. For DownloadFile, replace the Content-Type probe with a small response-writer wrapper that records whether headers or bytes have actually been sent, so the error path can't silently convert a pre-stream failure into a partial response if future code moves the header-setting around. |