Commit Graph

14102 Commits

Author SHA1 Message Date
7y-9 e6ab9e7b09 fix(s3api): reject zero default retention years (#9860)
Problem: Default object-lock retention accepted an explicitly provided Years value of zero, even though a default retention period must be positive when present.

Root cause: validateDefaultRetention rejected zero Days but only rejected negative Years, leaving YearsSet with Years=0 as a successful validation path.

Fix: Treat an explicitly provided zero Years value as ErrInvalidRetentionPeriod, matching the existing Days validation.

Reproduction: go test ./weed/s3api -run TestValidateDefaultRetention -count=1 failed before the fix because the Zero years case returned nil.

Validation: go test ./weed/s3api -run TestValidateDefaultRetention -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check
2026-06-07 20:53:45 -07:00
Chris Lu f9d3105e80 ec placement: spread EC shards evenly across machines, not onto the lowest-id one (#9855)
* ec placement: steer shards to less-loaded machines, not the lowest id

EC encode places every volume against one shared topology snapshot (it reserves the
shards it assigns so later volumes see reduced capacity), but node selection ranked
only by this volume's shard count and broke ties by sorted id. So the lowest-id
machine won the first shard of every volume and accumulated far more total shards
than the rest -- on a 6-machine cluster the first machines drifted to ~1.5x.

Rank eligible nodes by the machine's shards of this volume, then the machine's free
capacity, then the node's shards of this volume, then the node's free capacity. Free
capacity reflects the load already placed, so ties steer toward the least-loaded
machine instead of the lowest id, keeping total EC shards even across machines.

* test: ec.balance converges to even per-machine load from a skew

Starts machine 10.0.0.1 at 4 shards/volume and the rest at 2, then runs repeated worker-style capped passes; asserts convergence to an even per-machine total (reaches exactly even in ~13 rounds).

* reduce comments on the placement fix

Trim narration to the non-obvious why.

* test: assert convergence and count zero-shard machines

Seed the per-machine map with every host so a fully drained machine still registers, and fail explicitly if balance doesn't converge before the round cap.
2026-06-07 20:45:17 -07:00
Chris Lu 89cbb1c558 admin: default -dataDir to "." so maintenance task state persists across restarts (#9856)
admin: default -dataDir to "." so maintenance task state persists

Previously -dataDir defaulted to empty, so the admin ran maintenance in
memory only: task state was never saved and maintenance tasks (notably EC
balance/rebuild) were re-issued every scan cycle without converging,
churning EC shards (moves landed shards without their .ecx index, leaving
EC volumes unloadable/missing shards).

Default -dataDir to "." (the process working directory, which under the
standard systemd unit is the admin's data dir) so state persists out of
the box.
2026-06-07 20:45:03 -07:00
Chris Lu f0d2a0d417 Treat co-located volume servers as one fault domain when balancing and allocating (#9854)
* admin/topology: carry the volume server address on DiskInfo

The planning DiskInfo exposed only the node id, which can be an opaque label rather than ip:port. Record the address too so callers can resolve the physical machine a disk sits on.

* ec.balance: spread a volume's shards across machines, not just nodes

Volume servers sharing a host are one fault domain, but the within-rack spread treated them as independent nodes, so one box could end up holding more shards of a volume than EC can afford to lose. Add a machine (host) tier between rack and node: the within-rack pass spreads each volume across machines, and the global load phase no longer re-concentrates a volume onto a machine it already sits on. Host defaults to the node id, so clusters with one server per host are unchanged.

* ec placement: prefer machines holding fewer of a volume's shards

EC allocation and repair picked the least-loaded node in a rack with no regard for which physical machine it sits on, so a volume's shards could pile onto several servers of one box. Rank candidate nodes by their machine's shard count first, then the node's own. The machine is derived from the volume server address carried on DiskInfo, falling back to the node id, matching how the balancer resolves it.

* volume.balance: don't move a replica onto a machine already holding one

isGoodMove only rejected a move onto the same data node, so two replicas could land on two volume servers of one box and a single machine failure would lose both. Reject a target whose host already holds another replica of the volume. Best-effort: balancing simply skips and tries the next target.

* volume allocation: spread same-rack replicas across machines

PickNodesByWeight filled the same-rack replica picks by weight alone, so replicas could co-locate on one box. Prefer candidates on not-yet-used hosts, falling back when too few distinct machines exist. Data-center and rack tiers have no host, so their ordering is unchanged.

* ec.balance: harden machine spread against re-concentration and capped machines

Two cases where the machine-aware spread could still leave a volume badly placed:

- The global load phase could move a shard of a volume onto a machine that
  already held it, raising that machine's count and undoing the within-rack
  spread (a 4/4/3/3 layout could become 3/5/3/3, past parity for 10+4). Limit
  the load-only fallback to same-machine moves, which leave a machine's count
  unchanged; cross-machine concentration is no longer allowed for load alone.

- The within-rack spread chose a destination machine by free slots alone, so if
  that machine's only nodes were already at the SameRackCount cap it skipped the
  move instead of trying another machine. Require a machine to have a node that
  can actually take the shard before selecting it.

* reduce comments across the machine-affinity change

Trim narration down to the non-obvious why; one terse line where a block was overkill.

* ec.balance: gate machine spread on fault-tolerance feasibility

Spreading a volume evenly across machines only helps when there are enough that
each can stay within EC's parity tolerance (numMachines >= ceil(total/parity)).
With fewer -- or wildly unequal -- machines it can't make a machine loss
survivable anyway, and forcing it fights capacity: e.g. a cluster of 12 volume
servers on one host and 2 on another would have half of every volume crammed onto
the 2-server box. So spread across machines only when it's achievable; otherwise
fall back to per-node spread and let capacity/global balancing decide.

The global load phase applies the same test: it protects a volume's machine spread
(no cross-machine move that raises a machine's count past the source's) only where
that spread is achievable, so heterogeneous clusters still level by fullness.

* ec.balance worker: group servers by host when planning

The worker built its planner topology without recording each server's host, so
automated ec.balance treated ports on one machine as independent nodes and could
concentrate a volume's shards on one physical box. Set the host from the volume
server address, matching the shell path.

* volume.balance worker: don't move a replica onto a machine holding one

The worker compared only node ids, and the replica map dropped the server address,
so it could move replicas onto different ports of one machine. Carry the host on
ReplicaLocation (from the server address) and reject a target whose host already
holds another replica of the volume. Best-effort, matching the shell.

* ec.balance: judge machine-spread feasibility by the rack's shards

The within-rack and global feasibility checks compared the whole volume's shard
count against a rack's machine count, so a rack holding only part of a volume after
cross-rack spreading -- e.g. 7 of a 10+4 volume across 2 machines -- was wrongly
judged infeasible and fell back to node spread, which could pile 6 shards onto one
host, past parity. Gate on the rack's own shard count of the volume instead.

* ec.balance: spread a volume's shards across machines by combined count

EC recovers from any loss within parity regardless of shard type, so what bounds a
machine's exposure is its total shards of the volume, not data and parity
separately. Spreading the two independently let each type's remainder land on the
same machine -- ceil(d/M)+ceil(p/M) can exceed ceil(total/M), e.g. a 5/3 split where
4/4 was achievable, past parity. Balance the combined count in one pass; disk-level
data/parity anti-affinity stays in pickBestDiskOnNode.

* ec.balance: don't let the imbalance threshold skip an over-parity machine

The within-rack spread gated on relative skew ((max-min)/avg > threshold), so a
worker threshold of 0.5 skipped an exactly-50%-skewed layout like 5/4/3 for a 10+4
volume, leaving 5 shards -- past parity -- on one machine. The even cap
(ceil(shards/groups)) is the real bound and the move loop already sheds only what
exceeds it, so drop the threshold gate from the within-rack phase (machine and node):
a balanced rack stays a no-op while any over-cap machine is always fixed.

* ec.balance: keep the imbalance threshold for the node fallback

Dropping the threshold from the whole within-rack phase made the node fallback too
eager: it runs only when machine fault tolerance is unachievable, so it is cosmetic
load distribution that should defer to the global utilization phase. Without the
gate it would, for a one-server-per-host 6/4 split at threshold 0.5, schedule a count
move that worsens utilization balance. Restore the threshold there; machine spreading
keeps bypassing it, since that bound is durability, not cosmetic skew.
2026-06-07 14:14:45 -07:00
7y-9 25f36cd13d fix(s3api): require space in v2 auth prefix (#9852)
* fix(s3api): require space in v2 auth prefix

Problem: Signature V2 Authorization headers with a malformed algorithm token such as AWSX... are accepted as if they were AWS ... headers.

Root cause: validateV2AuthHeader checks HasPrefix("AWS") but then slices past an assumed trailing space, so an extra character after AWS is skipped and the rest is parsed as credentials.

Fix: Require the Authorization header to start with the exact AWS plus space prefix before parsing fields.

Reproduction: go test ./weed/s3api -run 'TestValidateV2AuthHeader/algorithm_prefix_without_space|TestDoesSignV2Match/malformed_auth_-_no_space_after_AWS' -count=1 fails before the fix because AWSXAKIA... is accepted.

Validation: go test ./weed/s3api -run 'TestValidateV2AuthHeader/algorithm_prefix_without_space|TestDoesSignV2Match/malformed_auth_-_no_space_after_AWS' -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check

* Update weed/s3api/auth_signature_v2.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-07 11:52:09 -07:00
7y-9 99bb5db1e3 fix(needle): use discovered file content type (#9851)
Problem: Multipart uploads where the first part was a form field and a later part contained the file used the first part's Content-Type for the file metadata.

Root cause: After finding a later part with a filename, parseUpload copied data and MD5 from part2 but read Content-Type from the original part variable.

Fix: Read Content-Type from the discovered file part.

Reproduction: go test ./weed/storage/needle -run TestParseUploadUsesDiscoveredFilePartContentType -count=1 failed before the fix because the parsed MIME type was text/plain instead of application/x-seaweed-test.

Validation: go test ./weed/storage/needle -run TestParseUploadUsesDiscoveredFilePartContentType -count=1; go test ./weed/storage/needle -count=1; git diff --check; git diff --cached --check
2026-06-07 11:50:34 -07:00
Chris Lu 058569c77b operation: index VidCache by map instead of slice (#9853)
VidCache.cache was a []VidInfo indexed directly by volume id, so caching
one volume with a large id grew the backing array to that many entries
(each 48 bytes), allocating a zeroed slot for every unused id below it. A
single id of 32M cost ~1.5GB resident, plus geometric realloc churn as the
append loop doubled the array.

Use map[uint32]VidInfo so memory scales with the number of volumes actually
cached rather than the largest id seen. Parse ids with ParseUint(.,32) so
values outside the uint32 volume-id range are rejected instead of silently
wrapping into a key.
2026-06-07 11:46:57 -07:00
Chris Lu 755af4adf4 s3: actually bind outbound connections when -ip.bind is set (#9849)
* s3: set outbound bind IP before the first filer dial

Standalone weed s3 dialed the filer for GetFilerConfiguration before
SetOutboundLocalIP ran, so that gRPC conn was created with the stock
dialer and no source address. gRPC caches conns by address and reuses
the original dialer on reconnect, so the s3->filer connection kept
leaving from the OS-chosen source for the life of the process even
after the bind IP was set a moment later.

* grpc: install the outbound-bind dialer unconditionally

The dialer was installed only when OutboundLocalAddr was already set at
GrpcDial time, baking the source-address decision into the cached conn,
so a conn dialed before the bind IP was configured never bound.

Install the context dialer always and decide per dial: bind through
OutboundDialContext once a source is set, otherwise fall back to the
stock net.Dialer so default deployments keep gRPC's dial timeout and
keepalive behavior. The bind now applies on the next reconnect
regardless of ordering, matching the HTTP transport's unconditional
DialContext.
2026-06-07 10:20:58 -07:00
Chris Lu 0e9fc6c5ba worker: drop ec.balance from the default admin script (#9848)
The dedicated ec_balance task worker handles EC shard balancing now,
so the periodic admin script no longer needs to run it.
2026-06-07 00:55:11 -07:00
Chris Lu b2127c86f4 admin: show S3 servers under Cluster (#9847)
* s3: register data center with master on startup

* admin: show S3 servers under Cluster

* admin: add S3 servers to the dashboard
2026-06-07 00:32:20 -07:00
Chris Lu 01637410e2 test(s3): address review feedback on the versioning suite (#9846)
- Different-users bucket test: use getNewBucketName() so the bucket carries the
  tracked prefix and run id and gets swept if the test leaks, instead of an
  untracked name.
- Makefile: clarify that '.' matches the opt-in stress tests but they self-skip
  without ENABLE_STRESS_TESTS, so they don't execute in the default run.
- Versioned list test: guard the Object.Size dereference with require.NotNil.
2026-06-06 20:50:09 -07:00
Chris Lu d321f9efb4 s3: collapse suspended-versioning deletes onto one null marker (#9845)
A suspended-versioning DELETE was recorded with createDeleteMarker, which mints a
fresh real version id each time, so repeated suspended deletes piled up delete
markers instead of overwriting a single null marker as S3 specifies. Record the
suspended delete as a 'null' marker with a fixed file name (v_null) and point the
latest-version pointer at it explicitly; putSuspendedVersioningObject's existing
null-version cleanup removes it on the next suspended PUT, so the object undeletes
cleanly and at most one null marker exists. Enabled-versioning deletes are
unchanged (still distinct historical markers).

Update TestSuspendedVersioningDeleteBehavior to the AWS-correct counts: one null
marker after a suspended delete, and the null marker plus one real marker after a
re-enabled delete.
2026-06-06 20:49:38 -07:00
Chris Lu fa9bf58c86 test(s3): make the whole versioning suite pass and gate it in CI (#9844)
* test(s3): correct bucket-recreate expectations and cover the different-owner case

A same-owner CreateBucket on an existing bucket returns BucketAlreadyOwnedByYou
(idempotent recreate); the suite expected BucketAlreadyExists, which only applies
when the name is owned by someone else. Fix the same-owner cases (plain and
Object-Lock) and implement the previously-skipped different-owner test, which now
exercises the BucketAlreadyExists path via a second identity.

* test(s3): assert the deletion invariant for suspended-versioning delete

A suspended-versioning DELETE removes the null version and records a delete marker
so the object reads as deleted; the test expected no marker, which would let an
older version resurface. Assert that a marker is recorded (and read DeleteMarker
through aws.ToBool) rather than an exact count, so it holds whether or not the
suspended-marker id/dedup is later collapsed to AWS's single null marker.

* test(s3): run the whole versioning suite by default

TEST_PATTERN was TestVersioning, which left bucket-creation, suspended-delete and
directory/version-listing tests ungated. Default to '.' so every test runs; opt-in
stress tests self-skip without ENABLE_STRESS_TESTS and keep their own targets.
2026-06-06 18:38:28 -07:00
Chris Lu 795349d796 test(s3): deref Object.Size in versioned list assertion (#9843)
TestVersionedObjectListBehavior compared int64 against listedObject.Size,
which is *int64, so the assertion always failed on a type mismatch once
reached. Dereference it (and in the log line).
2026-06-06 18:02:36 -07:00
Chris Lu 309cb32416 s3: list directory key objects in versioned bucket version listings (#9842)
ListObjectVersions gated explicit directory objects on Mime ==
FolderMimeType, but an SDK PutObject of "dir/" carries a default
Content-Type (e.g. application/octet-stream), so those directory keys
were dropped from the version listing while ListObjectsV2 - which keys
off IsDirectoryKeyObject (any non-empty mime) - still showed them. Use
the same IsDirectoryKeyObject check so the two listings agree.

The directory test's storage-class assertion compared an ObjectStorageClass
constant against ObjectVersion.StorageClass (ObjectVersionStorageClass);
the values matched but the SDK enum types did not, so it only surfaced
once the directories started appearing. Use the matching constant.
2026-06-06 18:02:33 -07:00
Chris Lu 6c1fd3aeab s3: rescan .versions when the cached latest pointer is missing on a list (#9841)
* s3: rescan .versions when the cached latest pointer is missing on a list

ListObjectsV2 resolves each versioned object's current version from the
latest-version pointer cached on the .versions directory entry. When that
pointer is absent on the filer serving the list, the object was dropped
from the listing. Fall back to a read-only rescan of .versions/ to pick
the newest version - the version files are present locally even when the
cached pointer is not - so the object still lists. This mirrors the read
path's recoverLatestVersionWithoutPointer; the scan loop is shared.

Read-only by design: a list can touch many objects, so it does not persist
a pointer.

* s3: copy scanned Extended before stamping the version id
2026-06-06 18:02:30 -07:00
Chris Lu 9ede92a7cc filer: replicate RECOMPUTE_LATEST pointer updates to peers (#9840)
applyRecomputeLatest wrote the .versions latest-version pointer and the
demoted prior version's stamp through UpdateEntry without a following
NotifyUpdateEvent, so neither change entered the metadata log. Across
filers the pointer then lived only on whichever filer ran the mutation,
and ListObjects served by any other filer dropped those objects from a
versioned bucket. Emit the events the way PATCH_EXTENDED already does,
keeping a pre-update image for the notification diff.
2026-06-06 18:02:28 -07:00
Chris Lu 6e16994615 s3: make lifecycle TTL fast path per-bucket opt-in (#9825)
Stamping an Expiration.Days rule as a volume TTL at write time bakes an
irreversible TTL into the object: removing or lengthening the rule later
can't un-expire it, unlike worker-driven expiration. The metadata-only
delete it enables also skips per-chunk DeleteFile, so dead bytes linger in
a not-yet-expired TTL volume with no deleted-byte accounting until the
whole volume ages out.

Gate the resolver on a per-bucket flag, off by default; toggle with the
s3.bucket.lifecycle.fastpath shell command. Default writes take the worker
path: real deletes that honor current policy and let vacuum reclaim space.
2026-06-06 11:20:15 -07:00
Aleksei Sviridkin 3688be82f5 fix(helm): deduplicate all-in-one extra environment variables (#9837)
* fix(helm): deduplicate all-in-one extra environment variables

The all-in-one Deployment looped global.seaweedfs.extraEnvironmentVars and
allInOne.extraEnvironmentVars in two separate ranges, so any key present in
both maps was emitted as two env entries with conflicting values. It also
computed a merged map for the cluster-default lookup but never used it for
the env loop.

Use the existing seaweedfs.mergeExtraEnvironmentVars helper (as the filer,
master and s3 templates already do) so a key set in both maps renders once
with the component value taking precedence, and add a chart-CI render
assertion covering it.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>

* ci(helm): drop checkmark glyphs from chart test output

---------

Signed-off-by: Aleksei Sviridkin <f@lex.la>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-05 15:31:18 -07:00
Aleksei Sviridkin ae4ad6859d fix(helm): suspend bucket versioning for YAML bool false (#9836)
* fix(helm): suspend bucket versioning for YAML bool false

createBuckets[].versioning accepts both a YAML bool and a string. The
string branch maps "false"/"disable"/"suspended" to Suspended, but the
bool branch only handled true (Enabled) and left false as a silent no-op.
The same logical value therefore behaved differently depending on its
YAML type: `versioning: false` did nothing while `versioning: "false"`
suspended the bucket.

Mirror the string behaviour in the bool branch so bool false suspends the
bucket, and add a chart-CI render assertion covering it.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>

* ci(helm): trim versioning regression-test comment

* chart: document bool false for createBuckets versioning

---------

Signed-off-by: Aleksei Sviridkin <f@lex.la>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-05 15:18:10 -07:00
Chris Lu be7f417a03 ip.bind: bind outbound connections to the configured address (#9834)
* ip.bind: bind outbound connections to the configured address

-ip.bind only governed listeners; outbound gRPC and HTTP connections let
the OS pick the source IP, which may not even be able to reach the
target. Mirror the bind address into a process-global source address and
apply it to outbound TCP dials: the gRPC context dialer, the per-client
HTTP transports, and the default transport. Loopback targets and unix
sockets keep the OS-chosen source so same-host traffic still works.

* ip.bind: first-write-wins source IP, skip on address-family mismatch

Make SetOutboundLocalIP first-write-wins so a `weed server` component's own
bind setting (run in its goroutine) can't clobber the process-wide source
address the top-level -ip.bind already established for the other components.

Skip source binding when the target is a literal IP of a different family
than the bind address, since forcing a mismatched source fails the dial.
2026-06-05 12:44:21 -07:00
Nguyễn Lộc Phúc 7f15a9fed4 fix(s3api): standardize ETag calculation in copy handlers (#9829)
* fix(s3api): standardize ETag calculation across S3 API handlers

* s3: make copyEntryETag nil-safe

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-05 12:41:18 -07:00
Chris Lu 6bd0091c72 master: grow rack-spanning volumes once per DC, capped at copy_N (#9835)
* master: grow rack-spanning volumes once per DC, capped at copy_N

The periodic rack-aware growth scan grew once per rack. For rack-spanning
replication (DiffRackCount > 0) a single logical volume already covers every
rack the placement needs, so a crowded volume made every rack report
should-grow and the scan created racks×step too many volumes: with "010"
across two racks that is 2 racks x step 2 = 4 logical (8 physical) volumes.

Plan one DC-wide grow for rack-spanning replication, and cap the per-event
step at master.volume_growth.copy_N so lowering it reduces periodic growth.

* master: distribute lastGrowCount evenly across uneven DCs

The non-rack-spanning grow divisor used the current DC's rack count, so DCs
with different rack counts each over-grew. Sum every rack up front and divide
lastGrowCount by that global count instead.
2026-06-05 12:39:59 -07:00
Chris Lu ab7be7867d security: hot-reload JWT signing keys on SIGHUP (#9826)
* security: reload JWT signing keys on SIGHUP

Signing keys were read once in the server constructors and never
refreshed. After a key rotation (Secret update, divergent reads) the
in-memory key stayed stale and every request kept failing "wrong jwt"
until the affected process was restarted.

Add Guard.UpdateSigningKeys and call it from the master, volume and
filer reload paths and the s3 reload hook, next to the existing
whitelist refresh. Make the global chunk-read JWT cache reloadable via
an atomic swap, and register the master's Reload with grace.OnReload --
it was never wired, so the master ignored SIGHUP entirely.

Mirror the same refresh in the Rust volume server's SIGHUP handler.

* security: swap signing keys behind an atomic pointer

Addresses review feedback on the in-place key swap: SigningKey is a
[]byte, so reassigning the Guard fields while a request handler reads
them is a data race that can tear the multi-word slice header and read
out of bounds.

Hold the four signing-key fields in an immutable signingConfig snapshot
behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a
reader sees either the old keys or the new ones. Reads go through new
SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors.

The Rust guard is already safe: every read and the SIGHUP write go
through the shared RwLock<Guard>.

* security: fold whitelist + auth state into the atomic snapshot

Review follow-up. UpdateSigningKeys still wrote isWriteActive while the
request path read it (and the whitelist maps) unsynchronized, so a SIGHUP
under load could expose an inconsistent mix of activation bits and
whitelist contents.

Move all hot-reloadable Guard state -- keys, expirations, whitelist, and
the activation flags -- into a single immutable guardState swapped behind
one atomic.Pointer. The Update* methods take a small mutex to serialize
the read-modify-write; readers stay lock-free. The concurrency test now
also rotates the whitelist and probes IsWhiteListed under -race.

Also read each signing key once per branch in the volume/filer JWT auth
checks, so a reload landing mid-check can't take the allow-fast-path
after auth was enabled or verify against a different key than the branch
saw.
2026-06-04 22:26:08 -07:00
Chris Lu 0d72023fac fix(master): advance maxVolumeId when registering EC shards (#9827)
* fix(master): advance maxVolumeId when registering EC shards

After EC encoding the original normal volume is deleted, so a
high-numbered volume can exist only as EC shards. Only regular volumes
advanced maxVolumeId (Disk.doAddOrUpdateVolume), so a master that
rebuilt its state from heartbeats (raft state not resumed) undercounted
the max and NextVolumeId could re-issue an id that EC shards still
occupy. A new volume then gets created on top of the EC volume id; new
writes land on it, but reads route to the old EC shards whose .ecx never
held the new needle, returning 404 and corrupting that object.

Advance maxVolumeId when EC shards are registered, mirroring the
regular-volume path. RegisterEcShards is the chokepoint both the full
and incremental heartbeat sync paths funnel through.

* test: cover incremental heartbeat path for EC maxVolumeId

Both SyncDataNodeEcShards and IncrementalSyncDataNodeEcShards funnel
through RegisterEcShards; assert the invariant on the incremental path
too.
2026-06-04 22:25:30 -07:00
Chris Lu 8d59069a0a s3: return BucketAlreadyOwnedByYou when recreating your own bucket (#9822)
* s3: return BucketAlreadyOwnedByYou when recreating your own bucket

PutBucket returned BucketAlreadyExists for every existing bucket, even
when the caller already owns it, so idempotent re-creation (e.g. a
container that creates its bucket on startup) couldn't tell "someone
else took the name" from "it's already mine".

Recreating a bucket you own now returns BucketAlreadyOwnedByYou, unless
the request conflicts with the existing bucket: a different Object Lock
setting, or an ACL on the request or the existing bucket. To detect the
latter, a requested non-default canned/grant ACL is now persisted on
creation instead of being dropped.

* s3: fail PutBucket when the existing bucket's config can't be read

When a bucket already exists, an unreadable config left the recreate
defaulting to BucketAlreadyOwnedByYou, masking the backend error and
possibly accepting a conflicting recreate (Object Lock / ACL unknown).
Surface the read error instead.

* s3: return the stored bucket ACL from GetBucketAcl

GetBucketAcl always returned the owner's default full-control grant and
ignored any stored ACL, so a bucket created with a canned ACL or one set
via PutBucketAcl never read back correctly. Decode the stored grants
instead, sharing one grants-to-XML helper with the object ACL handler.

The shared helper also emits each grantee's real xsi:type (e.g. Group for
public-read) instead of a hardcoded CanonicalUser, so group grants read
back correctly for both bucket and object ACLs.

* s3: resolve the right already-exists error on the concurrent-create race

When two requests create the same bucket at once, the loser's mkdir
fails and the handler fell back to a flat BucketAlreadyExists, bypassing
the same-owner idempotency check. Route both the pre-check and the race
fallback through one existingBucketError helper so a same-owner recreate
still gets BucketAlreadyOwnedByYou.

* s3: record the bucket owner's account id at creation

setBucketOwner only stored the creating identity name, so the canonical
account id wasn't available later. Persist it under ExtAmzOwnerKey too,
the same field PutBucketAcl writes, so the bucket owner can be reported
independently of whoever reads it.

* s3: report the bucket owner from GetBucketAcl, not the caller

GetBucketAcl built the ACL Owner from the caller's account header, so an
admin or cross-account read returned the wrong owner. Use the owner
persisted on the bucket, falling back to the caller only when none is
recorded.
2026-06-04 15:33:03 -07:00
Chris Lu a24f4844d3 filer: keep S3 list order byte-lexicographic regardless of SQL name column collation (#9824)
* mysql: keep S3 list order byte-lexicographic regardless of name column collation

ORDER BY name and the name > ? pagination predicate follow the column
collation, so a case-insensitive filemeta.name (e.g. utf8mb3_general_ci)
returns S3 keys out of byte order and breaks clients that merge two sorted
listings.

Detect the live name collation at startup; only when it isn't binary, wrap
the list comparison, prefix, and ORDER BY in BINARY name so order and
pagination stay consistent. Correctly configured utf8mb4_bin tables keep
their indexed range scan unchanged, and the operator gets a warning to
convert the column.

* postgres: keep S3 list order byte-lexicographic regardless of name column collation

ORDER BY name and the name > $n pagination predicate follow the column or
database collation, so a locale-aware filemeta.name (e.g. the en_US.UTF-8
database default) returns S3 keys out of byte order and breaks clients that
merge two sorted listings.

Detect the live name collation at startup; only when it isn't byte-ordered,
wrap the list comparison, prefix, and ORDER BY in COLLATE "C" so order and
pagination stay consistent. A byte-ordered (C/POSIX/C.UTF-8) column keeps its
indexed range scan unchanged, and the operator gets a warning to declare the
column COLLATE "C".
2026-06-04 14:33:41 -07:00
Chris Lu 8c2d9f466f filer: stream persisted log files when serving metadata subscriptions (#9821)
* filer: stream persisted log files when serving metadata subscriptions

readFileEntries buffered every LogEntry of a whole log file into memory
before returning them one by one, making a subscription read O(entries in
one log file) instead of O(one entry). On a filer with large per-entry
metadata, many concurrent SubscribeMetadata streams each loading a full
log file exhausted memory.

Keep a current LogFileIterator and return one entry at a time, advancing
files as each is exhausted. The deleted-volume skip is preserved.

* filer: close the log file iterator on read errors too

A genuine read error returned early without closing the current
LogFileIterator, leaving its ChunkStreamReader alive until GC. Close on
every exit path and propagate only a real error.

* filer: close persisted-log iterators when a subscription stops early

The streaming iterator keeps a log file reader open across calls, so a
subscription that returns before EOF (early stop, cancellation) left the
reader alive until GC. Add idempotent Close on LogFileQueueIterator and
OrderedLogVisitor, and have ReadPersistedLogBuffer wait for the readahead
goroutine and close the visitor on the way out.
2026-06-04 13:27:25 -07:00
7y-9 6e8002f065 fix: handle meta backup offset errors safely (#9818)
* fix: log meta backup offset errors

* fix: log meta backup offset errors

* fix: exit on meta backup offset errors

Exit with a non-zero status when the initial metadata backup offset cannot be persisted.

Classify offset-read failures during streaming so the backup process exits instead of retrying forever, allowing supervisors to restart and bootstrap from a missing checkpoint.

* meta backup: read offset in the loop, drop offset error type

Reading the saved offset inside the retry loop makes an offset read
failure a clean exit and a stream error a retry, without a typed error
to tell them apart. streamMetadataBackup now takes the start time.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-04 10:53:23 -07:00
Chris Lu 3e8ec879c4 s3: keep dynamic IAM live when -iam.config is set (#9817)
* s3: keep dynamic IAM live when -iam.config is set

-iam.config was treated like a static -config identity file: it set
useStaticConfig, which makes the filer metadata subscription skip
reloads. Identities and policies created at runtime (the IAM gRPC API)
then never took effect, so advanced IAM (OIDC/STS) and dynamic IAM were
mutually exclusive.

Gate useStaticConfig on whether inline identities were actually loaded.
An OIDC/STS-only config carries none, so it keeps the dynamic credential
store live; a -config identity file still freezes its identities as
before.

* s3: mark static identities on config reload too

A -config reload (grace.OnReload) re-reads the file, but only the startup
path marked its identities static, so identities added to the file and
reloaded were left unprotected from dynamic filer updates. Move the
marking into loadS3ApiConfigurationFromFile and make it additive and
scoped to the file's identities, so a reload protects newly added ones
without freezing dynamic filer-managed identities.

* s3: sync reloaded static identities into the credential manager

After marking a (re)loaded config file's identities static, push the
updated set into the credential manager so reloaded identities still
appear in listings and survive later dynamic merges. Centralize the sync
in loadS3ApiConfigurationFromFile and drop the now-redundant call in the
reload hook.
2026-06-03 23:28:25 -07:00
Fabian Hardt ce6a51468a sftpd: support SSH user certificates signed by a trusted CA (#9815)
* sftpd: support SSH user certificates signed by a trusted CA

Adds a new "certificate" auth method to weed sftp. When enabled, the server
loads trusted CA public keys from -trustedUserCAKeysFile (OpenSSH
authorized_keys format, one or more keys) and accepts only ssh.Certificate
blobs of type UserCert on the public-key channel. Validation uses
ssh.CertChecker: CA signature, ValidAfter/ValidBefore, non-empty
ValidPrincipals and SSH login user must appear in ValidPrincipals. The
authenticated user must exist in the user store; home dir and permissions
resolve as before.

Behaviour mirrors MinIO's --sftp=trusted-user-ca-key and OpenSSH's
TrustedUserCAKeys: when certificate auth is active, plain (non-cert) public
keys are rejected even if "publickey" is also listed. Default authMethods
remain "password,publickey", so existing deployments are unaffected.

* Update weed/sftpd/auth/certificate.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* sftpd: address review feedback on certificate auth

- Pre-marshal trusted CA public keys in IsUserAuthority instead of
  re-marshaling on every authentication attempt (gemini-code-assist).
- Differentiate user-not-found from underlying store errors via
  errors.As(*user.UserNotFoundError) so backend/read failures are no
  longer reported as bad credentials (coderabbitai).
- Fix the corresponding sanity check in the missing-file test to use
  errors.As instead of errors.Is (UserNotFoundError has no Is method,
  so the previous check never matched) (coderabbitai).

* sftpd: register trustedUserCAKeysFile flag in filer and server commands

The new field on SftpOptions is dereferenced unconditionally in
resolvePaths(), but only the standalone `weed sftp` command was wiring
its flag. `weed filer` and `weed server` both embed an SftpOptions value
and call resolvePaths() on it, so they hit a nil pointer dereference at
startup.

Register `-sftp.trustedUserCAKeysFile` in both commands and update the
-sftp.authMethods help text to mention the new "certificate" method.

Fixes the SFTP Integration Tests CI failure on this PR.

* helm: expose SFTP certificate auth in the SeaweedFS chart

Adds Helm-chart support for the new SSH user-certificate auth method:

- values.yaml (sftp:) gains `trustedUserCAKeys` (inline OpenSSH
  authorized_keys-format CA public keys) and `existingCAKeysSecret`
  (reference an externally managed Secret). Same pair added under
  allInOne.sftp with a null default that falls back to the top-level
  sftp.* setting.
- New template templates/sftp/sftp-ca-secret.yaml renders a
  chart-managed Secret <release>-sftp-ca-secret with `ca_user.pub`,
  but only when SFTP is enabled, "certificate" is in authMethods,
  inline keys are provided, and no existingCAKeysSecret is set.
- templates/sftp/sftp-deployment.yaml and the all-in-one deployment
  template add `-trustedUserCAKeysFile=/etc/sw/sftp_ca/ca_user.pub`
  to the weed sftp command, mount the CA secret at /etc/sw/sftp_ca
  and add the corresponding volume. All cert-auth bits are guarded
  by `contains "certificate" authMethods` so existing users see no
  change.
- authMethods help text updated to mention "certificate".

Verified end-to-end on a local k3d cluster: cert login succeeds,
plain-pubkey login is rejected with "public key without certificate
not allowed".

* helm: fail render when SFTP certificate auth lacks CA keys

When certificate is in authMethods but neither trustedUserCAKeys nor
existingCAKeysSecret is set, the deployment mounted a secret that the
chart never renders, leaving the pod stuck on a missing volume. Fail at
template time with a clear message instead.

* sftpd: fix stale auth-method list in SFTPServiceOptions comment

keyboard-interactive was never implemented; certificate is the new
supported method. Match the CLI help text.

* sftpd: test Manager wiring of certificate vs public-key channel

Cover the channel takeover at the Manager level: certificate auth
displaces plain public-key auth when both are enabled, public-key auth
stays put otherwise, and enabling certificate without a CA file errors.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-03 22:32:47 -07:00
Chris Lu df879e1ed7 filer: bound TraverseBfsMetadata memory by queuing directory paths (#9814)
* filer: bound TraverseBfsMetadata memory by queuing directory paths

The BFS enqueued every entry, so it held the whole subtree in memory
including each file's chunk list. A filer serving a peer's first-time
bootstrap traversal of a large tree could exhaust memory and get killed.

Stream each entry as it is visited and queue only directory paths to
descend into. Memory is now bounded by the number of directories rather
than the entire tree, and the streamed output order is unchanged.

* filer: match excluded prefixes on path-component boundaries

Only treat an excluded prefix as a match when it ends at a path
boundary, so excluding /a/b does not also drop a sibling like /a/bc.
Short-circuit the trie walk on the first real match.
2026-06-03 10:28:42 -07:00
Konstantin Lebedev df833d485f [test] update docker image for s3test (#9811) 2026-06-03 09:45:00 -07:00
Lars Lehtonen d321e463e9 chore(weed/storage/needle): prune unused test functions (#9812) 2026-06-03 09:26:28 -07:00
Chris Lu ef1aa4f936 s3: defer a recently-unreachable owner that is also the current filer (#9808)
Blanking preferred kept route-by-key reads from dialing a flagged owner first,
but withFilerClientFailover always re-adds the current filer, so when the owner
is the gateway's current filer it stayed in the candidate list and got dialed
anyway. Treat a recently-unreachable filer as unhealthy in the health partition
so it is deferred to the last-resort tail instead of tried before healthy
replicas; preferred is still tried first, and a live owner is unaffected.
2026-06-03 00:28:56 -07:00
Chris Lu 2d1b8be22b s3: route object reads to the key's owner filer (#9806)
* s3: route object reads to the key's owner filer

Writes already route by key to the owner filer on the lock ring, where the
entry is created. Reads went to the gateway's local filer and treated its
NotFound as authoritative, so a GET on one gateway could miss an object
another gateway had just written until the filers' metadata replication
caught up.

Resolve an object's entry from the key's owner first, failing over to the
gateway's filer set only on transport errors. An owner NotFound stays
authoritative: no fan-out across filers, and no resurrecting a peer's
not-yet-replicated tombstone, so a delete routed to the owner is visible at
once and a genuine miss costs one lookup. Keys owned by the local filer are
unchanged. Objects written through the non-routed lock path land on a
gateway's local filer, so they can still read as absent on the owner until
they replicate.

withFilerClientFailover takes a preferred start filer; the object-entry
reads pass the owner, every other caller passes "" and keeps the
current-filer fast path.

* s3: consult the prior owner on a rebalance-window read miss

Owner-first reads route a key to its current ring owner. When a filer joins,
~1/N of keys reassign to it, and the new owner may not have replicated a
just-moved key yet, so an owner NotFound would surface a transient 404 for an
object that already exists elsewhere.

Retain the previous ring on the gateway's LockClient for a cooling-off window
(PriorOwnerForKey, mirroring the master's LockRing.PriorOwner) and, on the
owner's NotFound, probe the key's previous owner once before treating the miss
as final. The probe is scoped to keys whose ownership actually moved and only
within the window, so steady-state reads are untouched.

This trades the transient scale-up 404 for a transient stale read if a delete
routed to the new owner races the same window — the same authoritative-NotFound
tradeoff, narrowed to the rebalance.

* s3: try healthy filers before unhealthy ones on failover

The candidate list probed its first entry (usually the current filer)
unconditionally, so a health-flagged current filer cost a transport timeout on
every ordinary call before failover reached a replica. Partition candidates into
healthy and unhealthy, keep priority within each, and fall back to unhealthy
ones only when all healthy ones fail.

* reduce comments on the routed read and lock client paths

* s3: skip a recently-unreachable owner on route-by-key reads

The gateway's filer health tracking no-ops for an owner outside the static
-filer list, so during a sustained owner outage every route-by-key read
re-dials the dead owner before failing over. Flag an owner whose owner-first
read hit a transport error and skip it (read local-first) for a short TTL, so
reads pay one dead dial per TTL instead of one per request; the flag expires so
owner-first reads resume once the owner or the ring recovers.

* s3: always try the preferred owner first, health-order only the rest

The healthy/unhealthy partition also demoted a health-flagged preferred owner
behind healthy replicas, so a replica's authoritative NotFound could mask a
write that had only reached the owner — the read-after-write race this routing
exists to close. Pull preferred out of the partition and keep it first; the
recently-unreachable gate already steers reads away from a genuinely dead owner.
2026-06-03 00:12:28 -07:00
Chris Lu 4e5839ce82 fix(iam): return a valid user ARN from CreateUser and GetUser (#9794)
* fix(iam): return a valid user ARN from CreateUser and GetUser

The terraform aws provider 6.41 reads a user back after creating it and
blocks until GetUser returns a value that passes arn.IsARN. We only set
UserName, so the ARN was empty and apply hung until the 2m timeout.
Populate Arn (and Path) via a shared iam.NewUser helper in both the
embedded and standalone IAM handlers.

* fix(iam): use the userName parameter directly in NewUser

Drop the redundant local copy; the value parameter is already function-local.

* fix(iam): return full user objects with ARNs from GetGroup

GetGroup listed members with only UserName set. Build them via the shared
NewUser helper so group members carry a valid Arn and Path like the other
user responses, in both the embedded and standalone IAM handlers.
2026-06-02 22:01:57 -07:00
Chris Lu f711868fb6 fix(log_buffer): re-check buffer before bailing with ResumeFromDiskError (#9804)
ReadFromBuffer and HasData() take the read lock separately, so a write
that lands between them can make a subscriber which just read a
momentarily empty buffer return ResumeFromDiskError even though the data
is now servable from memory. Re-read under a fresh lock and only bail
when the position is genuinely behind the in-memory window (flushed to
disk); otherwise loop back and read it.
2026-06-02 21:37:15 -07:00
dependabot[bot] 24159fbff9 build(deps): bump opentofu/setup-opentofu from 1 to 2 (#9801)
Bumps [opentofu/setup-opentofu](https://github.com/opentofu/setup-opentofu) from 1 to 2.
- [Release notes](https://github.com/opentofu/setup-opentofu/releases)
- [Commits](https://github.com/opentofu/setup-opentofu/compare/v1...v2)

---
updated-dependencies:
- dependency-name: opentofu/setup-opentofu
  dependency-version: '2'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
2026-06-02 21:37:05 -07:00
Chris Lu 7b44cf5627 fix(iam): implement CreatePolicyVersion for managed policies (#9795)
* fix(iam): implement CreatePolicyVersion for managed policies

The AWS Terraform provider updates a managed policy in place via
CreatePolicyVersion, which returned 501 NotImplemented and broke
terraform apply on any policy change.

Implement CreatePolicyVersion (plus ListPolicyVersions, GetPolicyVersion
and DeletePolicyVersion) on both the standalone IAM server and the
embedded S3 IAM API. Managed policies keep a single current document, so
each is modeled as one default version "v1": CreatePolicyVersion replaces
the document, List/GetPolicyVersion expose it, and DeletePolicyVersion
rejects deleting the default. GetPolicy now reports DefaultVersionId so
the provider's read can fetch the document. The standalone path also
refreshes the cached Identity.Actions of every identity the policy is
attached to so the new document takes effect.

* fix(iam): reject CreatePolicyVersion unless SetAsDefault=true

With a single always-default managed-policy version, a request with
SetAsDefault=false (or omitted) would stage a non-default version on AWS
but here silently replaced the active document. Reject it on both the
standalone and embedded paths.

Isolate the new policy-version tests from the shared package fixtures so
they stay order-independent, and assert IsDefaultVersion on the response.
2026-06-02 21:35:02 -07:00
Chris Lu b6a0bde16b test(s3/iam): scope ListBucket isolation via s3:prefix condition (#9805)
The username-isolation policy denied s3:ListBucket through an object-path
NotResource. ListBucket is bucket-level, so its resource ARN is the bucket
and never matches an object path: the Deny always fired and a user could
not list their own prefix. Scope the per-user List deny with a StringNotLike
s3:prefix condition instead, the same mechanism the matching Allow uses.
2026-06-02 18:41:10 -07:00
dependabot[bot] 6bffe3f56a build(deps): bump google.golang.org/grpc from 1.80.0 to 1.81.1 (#9797)
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.80.0 to 1.81.1.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.80.0...v1.81.1)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.81.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 18:27:52 -07:00
dependabot[bot] 69bc7325ca build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 from 3.134.2 to 3.139.5 (#9796)
build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3

Bumps [github.com/ydb-platform/ydb-go-sdk/v3](https://github.com/ydb-platform/ydb-go-sdk) from 3.134.2 to 3.139.5.
- [Release notes](https://github.com/ydb-platform/ydb-go-sdk/releases)
- [Changelog](https://github.com/ydb-platform/ydb-go-sdk/blob/master/CHANGELOG.md)
- [Commits](https://github.com/ydb-platform/ydb-go-sdk/compare/v3.134.2...v3.139.5)

---
updated-dependencies:
- dependency-name: github.com/ydb-platform/ydb-go-sdk/v3
  dependency-version: 3.139.5
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
2026-06-02 17:22:03 -07:00
dependabot[bot] 348de64f13 build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.14 to 1.32.21 (#9798)
build(deps): bump github.com/aws/aws-sdk-go-v2/config

Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.14 to 1.32.21.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.14...config/v1.32.21)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.32.21
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 17:14:58 -07:00
dependabot[bot] 30e0116611 build(deps): bump go.etcd.io/etcd/client/pkg/v3 from 3.6.11 to 3.6.12 (#9799)
Bumps [go.etcd.io/etcd/client/pkg/v3](https://github.com/etcd-io/etcd) from 3.6.11 to 3.6.12.
- [Release notes](https://github.com/etcd-io/etcd/releases)
- [Commits](https://github.com/etcd-io/etcd/compare/v3.6.11...v3.6.12)

---
updated-dependencies:
- dependency-name: go.etcd.io/etcd/client/pkg/v3
  dependency-version: 3.6.12
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 17:14:50 -07:00
dependabot[bot] d6db545b4c build(deps): bump github.com/pierrec/lz4/v4 from 4.1.26 to 4.1.27 (#9800)
Bumps [github.com/pierrec/lz4/v4](https://github.com/pierrec/lz4) from 4.1.26 to 4.1.27.
- [Release notes](https://github.com/pierrec/lz4/releases)
- [Commits](https://github.com/pierrec/lz4/compare/v4.1.26...v4.1.27)

---
updated-dependencies:
- dependency-name: github.com/pierrec/lz4/v4
  dependency-version: 4.1.27
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 17:14:28 -07:00
dependabot[bot] cb67542d01 build(deps): bump docker/setup-qemu-action from 4.0.0 to 4.1.0 (#9802)
Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 4.0.0 to 4.1.0.
- [Release notes](https://github.com/docker/setup-qemu-action/releases)
- [Commits](https://github.com/docker/setup-qemu-action/compare/v4...v4.1.0)

---
updated-dependencies:
- dependency-name: docker/setup-qemu-action
  dependency-version: 4.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 17:13:52 -07:00
dependabot[bot] 6908445c5d build(deps): bump actions/checkout from 5 to 6 (#9803)
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](https://github.com/actions/checkout/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 16:26:55 -07:00
Chris Lu 160e68dd65 fix(s3api): keep ListBucket resource ARN at bucket level (#9792)
* fix(s3api): keep ListBucket resource ARN at bucket level

ListObjects with ?prefix= was denied for IAM users granted s3:ListBucket
on the bucket ARN. authRequestWithAuthType promotes the prefix into object
so the legacy CanDo path can honor prefix-scoped Action strings, and that
promoted object leaked into the policy resource ARN, producing
arn:aws:s3:::bucket/<prefix> which never matches a bucket-level statement.

Keep the resource bucket-level for List in the bucket-policy and
IAM-attached-policy evaluators; prefix scoping stays in the s3:prefix
Condition. The CanDo path is untouched.

* fix(s3api): resolve List action at bucket level when prefix is promoted

The IAM evaluator built a bucket-level resource ARN but still passed the
prefix-promoted object to ResolveS3Action, so listing with a prefix made
hasObject true and misresolved ListBucketVersions/ListBucketMultipartUploads
to ListBucket. Resolve the action against the same zeroed object, and trim
the resource-ARN comments.
2026-06-02 14:45:45 -07:00
Chris Lu 8e4022d5c7 fix(s3api): authorize DeleteObjects per key so object-scoped policies match (#9793)
Bulk DeleteObjects carries the keys in the request body, so the route Auth
middleware ran a single bucket-level check with object="", building the
resource ARN as arn:aws:s3:::<bucket>. That never matches an s3:DeleteObject
policy scoped to <bucket>/*, so the entire batch was denied even though the
single-key DELETE worked with the same credentials.

Defer authorization to the handler and check each key via AuthorizeBatchDeleteKey,
mirroring AuthorizeCopySource: a synthetic DELETE /<bucket>/<key> request resolves
s3:DeleteObject (or s3:DeleteObjectVersion when a versionId is given) against the
object ARN. Denied keys come back as per-key errors while authorized keys still
delete, matching AWS semantics.
2026-06-02 14:45:05 -07:00