seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-13 23:36:45 +03:00

Author	SHA1	Message	Date
7y-9	e6ab9e7b09	fix(s3api): reject zero default retention years (#9860 ) Problem: Default object-lock retention accepted an explicitly provided Years value of zero, even though a default retention period must be positive when present. Root cause: validateDefaultRetention rejected zero Days but only rejected negative Years, leaving YearsSet with Years=0 as a successful validation path. Fix: Treat an explicitly provided zero Years value as ErrInvalidRetentionPeriod, matching the existing Days validation. Reproduction: go test ./weed/s3api -run TestValidateDefaultRetention -count=1 failed before the fix because the Zero years case returned nil. Validation: go test ./weed/s3api -run TestValidateDefaultRetention -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check	2026-06-07 20:53:45 -07:00
Chris Lu	f9d3105e80	ec placement: spread EC shards evenly across machines, not onto the lowest-id one (#9855 ) * ec placement: steer shards to less-loaded machines, not the lowest id EC encode places every volume against one shared topology snapshot (it reserves the shards it assigns so later volumes see reduced capacity), but node selection ranked only by this volume's shard count and broke ties by sorted id. So the lowest-id machine won the first shard of every volume and accumulated far more total shards than the rest -- on a 6-machine cluster the first machines drifted to ~1.5x. Rank eligible nodes by the machine's shards of this volume, then the machine's free capacity, then the node's shards of this volume, then the node's free capacity. Free capacity reflects the load already placed, so ties steer toward the least-loaded machine instead of the lowest id, keeping total EC shards even across machines. * test: ec.balance converges to even per-machine load from a skew Starts machine 10.0.0.1 at 4 shards/volume and the rest at 2, then runs repeated worker-style capped passes; asserts convergence to an even per-machine total (reaches exactly even in ~13 rounds). * reduce comments on the placement fix Trim narration to the non-obvious why. * test: assert convergence and count zero-shard machines Seed the per-machine map with every host so a fully drained machine still registers, and fail explicitly if balance doesn't converge before the round cap.	2026-06-07 20:45:17 -07:00
Chris Lu	89cbb1c558	admin: default -dataDir to "." so maintenance task state persists across restarts (#9856 ) admin: default -dataDir to "." so maintenance task state persists Previously -dataDir defaulted to empty, so the admin ran maintenance in memory only: task state was never saved and maintenance tasks (notably EC balance/rebuild) were re-issued every scan cycle without converging, churning EC shards (moves landed shards without their .ecx index, leaving EC volumes unloadable/missing shards). Default -dataDir to "." (the process working directory, which under the standard systemd unit is the admin's data dir) so state persists out of the box.	2026-06-07 20:45:03 -07:00
Chris Lu	f0d2a0d417	Treat co-located volume servers as one fault domain when balancing and allocating (#9854 ) * admin/topology: carry the volume server address on DiskInfo The planning DiskInfo exposed only the node id, which can be an opaque label rather than ip:port. Record the address too so callers can resolve the physical machine a disk sits on. * ec.balance: spread a volume's shards across machines, not just nodes Volume servers sharing a host are one fault domain, but the within-rack spread treated them as independent nodes, so one box could end up holding more shards of a volume than EC can afford to lose. Add a machine (host) tier between rack and node: the within-rack pass spreads each volume across machines, and the global load phase no longer re-concentrates a volume onto a machine it already sits on. Host defaults to the node id, so clusters with one server per host are unchanged. * ec placement: prefer machines holding fewer of a volume's shards EC allocation and repair picked the least-loaded node in a rack with no regard for which physical machine it sits on, so a volume's shards could pile onto several servers of one box. Rank candidate nodes by their machine's shard count first, then the node's own. The machine is derived from the volume server address carried on DiskInfo, falling back to the node id, matching how the balancer resolves it. * volume.balance: don't move a replica onto a machine already holding one isGoodMove only rejected a move onto the same data node, so two replicas could land on two volume servers of one box and a single machine failure would lose both. Reject a target whose host already holds another replica of the volume. Best-effort: balancing simply skips and tries the next target. * volume allocation: spread same-rack replicas across machines PickNodesByWeight filled the same-rack replica picks by weight alone, so replicas could co-locate on one box. Prefer candidates on not-yet-used hosts, falling back when too few distinct machines exist. Data-center and rack tiers have no host, so their ordering is unchanged. * ec.balance: harden machine spread against re-concentration and capped machines Two cases where the machine-aware spread could still leave a volume badly placed: - The global load phase could move a shard of a volume onto a machine that already held it, raising that machine's count and undoing the within-rack spread (a 4/4/3/3 layout could become 3/5/3/3, past parity for 10+4). Limit the load-only fallback to same-machine moves, which leave a machine's count unchanged; cross-machine concentration is no longer allowed for load alone. - The within-rack spread chose a destination machine by free slots alone, so if that machine's only nodes were already at the SameRackCount cap it skipped the move instead of trying another machine. Require a machine to have a node that can actually take the shard before selecting it. * reduce comments across the machine-affinity change Trim narration down to the non-obvious why; one terse line where a block was overkill. * ec.balance: gate machine spread on fault-tolerance feasibility Spreading a volume evenly across machines only helps when there are enough that each can stay within EC's parity tolerance (numMachines >= ceil(total/parity)). With fewer -- or wildly unequal -- machines it can't make a machine loss survivable anyway, and forcing it fights capacity: e.g. a cluster of 12 volume servers on one host and 2 on another would have half of every volume crammed onto the 2-server box. So spread across machines only when it's achievable; otherwise fall back to per-node spread and let capacity/global balancing decide. The global load phase applies the same test: it protects a volume's machine spread (no cross-machine move that raises a machine's count past the source's) only where that spread is achievable, so heterogeneous clusters still level by fullness. * ec.balance worker: group servers by host when planning The worker built its planner topology without recording each server's host, so automated ec.balance treated ports on one machine as independent nodes and could concentrate a volume's shards on one physical box. Set the host from the volume server address, matching the shell path. * volume.balance worker: don't move a replica onto a machine holding one The worker compared only node ids, and the replica map dropped the server address, so it could move replicas onto different ports of one machine. Carry the host on ReplicaLocation (from the server address) and reject a target whose host already holds another replica of the volume. Best-effort, matching the shell. * ec.balance: judge machine-spread feasibility by the rack's shards The within-rack and global feasibility checks compared the whole volume's shard count against a rack's machine count, so a rack holding only part of a volume after cross-rack spreading -- e.g. 7 of a 10+4 volume across 2 machines -- was wrongly judged infeasible and fell back to node spread, which could pile 6 shards onto one host, past parity. Gate on the rack's own shard count of the volume instead. * ec.balance: spread a volume's shards across machines by combined count EC recovers from any loss within parity regardless of shard type, so what bounds a machine's exposure is its total shards of the volume, not data and parity separately. Spreading the two independently let each type's remainder land on the same machine -- ceil(d/M)+ceil(p/M) can exceed ceil(total/M), e.g. a 5/3 split where 4/4 was achievable, past parity. Balance the combined count in one pass; disk-level data/parity anti-affinity stays in pickBestDiskOnNode. * ec.balance: don't let the imbalance threshold skip an over-parity machine The within-rack spread gated on relative skew ((max-min)/avg > threshold), so a worker threshold of 0.5 skipped an exactly-50%-skewed layout like 5/4/3 for a 10+4 volume, leaving 5 shards -- past parity -- on one machine. The even cap (ceil(shards/groups)) is the real bound and the move loop already sheds only what exceeds it, so drop the threshold gate from the within-rack phase (machine and node): a balanced rack stays a no-op while any over-cap machine is always fixed. * ec.balance: keep the imbalance threshold for the node fallback Dropping the threshold from the whole within-rack phase made the node fallback too eager: it runs only when machine fault tolerance is unachievable, so it is cosmetic load distribution that should defer to the global utilization phase. Without the gate it would, for a one-server-per-host 6/4 split at threshold 0.5, schedule a count move that worsens utilization balance. Restore the threshold there; machine spreading keeps bypassing it, since that bound is durability, not cosmetic skew.	2026-06-07 14:14:45 -07:00
7y-9	25f36cd13d	fix(s3api): require space in v2 auth prefix (#9852 ) * fix(s3api): require space in v2 auth prefix Problem: Signature V2 Authorization headers with a malformed algorithm token such as AWSX... are accepted as if they were AWS ... headers. Root cause: validateV2AuthHeader checks HasPrefix("AWS") but then slices past an assumed trailing space, so an extra character after AWS is skipped and the rest is parsed as credentials. Fix: Require the Authorization header to start with the exact AWS plus space prefix before parsing fields. Reproduction: go test ./weed/s3api -run 'TestValidateV2AuthHeader/algorithm_prefix_without_space\|TestDoesSignV2Match/malformed_auth_-_no_space_after_AWS' -count=1 fails before the fix because AWSXAKIA... is accepted. Validation: go test ./weed/s3api -run 'TestValidateV2AuthHeader/algorithm_prefix_without_space\|TestDoesSignV2Match/malformed_auth_-_no_space_after_AWS' -count=1; go test ./weed/s3api -count=1; git diff --check; git diff --cached --check * Update weed/s3api/auth_signature_v2.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-06-07 11:52:09 -07:00
7y-9	99bb5db1e3	fix(needle): use discovered file content type (#9851 ) Problem: Multipart uploads where the first part was a form field and a later part contained the file used the first part's Content-Type for the file metadata. Root cause: After finding a later part with a filename, parseUpload copied data and MD5 from part2 but read Content-Type from the original part variable. Fix: Read Content-Type from the discovered file part. Reproduction: go test ./weed/storage/needle -run TestParseUploadUsesDiscoveredFilePartContentType -count=1 failed before the fix because the parsed MIME type was text/plain instead of application/x-seaweed-test. Validation: go test ./weed/storage/needle -run TestParseUploadUsesDiscoveredFilePartContentType -count=1; go test ./weed/storage/needle -count=1; git diff --check; git diff --cached --check	2026-06-07 11:50:34 -07:00
Chris Lu	058569c77b	operation: index VidCache by map instead of slice (#9853 ) VidCache.cache was a []VidInfo indexed directly by volume id, so caching one volume with a large id grew the backing array to that many entries (each 48 bytes), allocating a zeroed slot for every unused id below it. A single id of 32M cost ~1.5GB resident, plus geometric realloc churn as the append loop doubled the array. Use map[uint32]VidInfo so memory scales with the number of volumes actually cached rather than the largest id seen. Parse ids with ParseUint(.,32) so values outside the uint32 volume-id range are rejected instead of silently wrapping into a key.	2026-06-07 11:46:57 -07:00
Chris Lu	755af4adf4	s3: actually bind outbound connections when -ip.bind is set (#9849 ) * s3: set outbound bind IP before the first filer dial Standalone weed s3 dialed the filer for GetFilerConfiguration before SetOutboundLocalIP ran, so that gRPC conn was created with the stock dialer and no source address. gRPC caches conns by address and reuses the original dialer on reconnect, so the s3->filer connection kept leaving from the OS-chosen source for the life of the process even after the bind IP was set a moment later. * grpc: install the outbound-bind dialer unconditionally The dialer was installed only when OutboundLocalAddr was already set at GrpcDial time, baking the source-address decision into the cached conn, so a conn dialed before the bind IP was configured never bound. Install the context dialer always and decide per dial: bind through OutboundDialContext once a source is set, otherwise fall back to the stock net.Dialer so default deployments keep gRPC's dial timeout and keepalive behavior. The bind now applies on the next reconnect regardless of ordering, matching the HTTP transport's unconditional DialContext.	2026-06-07 10:20:58 -07:00
Chris Lu	0e9fc6c5ba	worker: drop ec.balance from the default admin script (#9848 ) The dedicated ec_balance task worker handles EC shard balancing now, so the periodic admin script no longer needs to run it.	2026-06-07 00:55:11 -07:00
Chris Lu	b2127c86f4	admin: show S3 servers under Cluster (#9847 ) * s3: register data center with master on startup * admin: show S3 servers under Cluster * admin: add S3 servers to the dashboard	2026-06-07 00:32:20 -07:00
Chris Lu	01637410e2	test(s3): address review feedback on the versioning suite (#9846 ) - Different-users bucket test: use getNewBucketName() so the bucket carries the tracked prefix and run id and gets swept if the test leaks, instead of an untracked name. - Makefile: clarify that '.' matches the opt-in stress tests but they self-skip without ENABLE_STRESS_TESTS, so they don't execute in the default run. - Versioned list test: guard the Object.Size dereference with require.NotNil.	2026-06-06 20:50:09 -07:00
Chris Lu	d321f9efb4	s3: collapse suspended-versioning deletes onto one null marker (#9845 ) A suspended-versioning DELETE was recorded with createDeleteMarker, which mints a fresh real version id each time, so repeated suspended deletes piled up delete markers instead of overwriting a single null marker as S3 specifies. Record the suspended delete as a 'null' marker with a fixed file name (v_null) and point the latest-version pointer at it explicitly; putSuspendedVersioningObject's existing null-version cleanup removes it on the next suspended PUT, so the object undeletes cleanly and at most one null marker exists. Enabled-versioning deletes are unchanged (still distinct historical markers). Update TestSuspendedVersioningDeleteBehavior to the AWS-correct counts: one null marker after a suspended delete, and the null marker plus one real marker after a re-enabled delete.	2026-06-06 20:49:38 -07:00
Chris Lu	fa9bf58c86	test(s3): make the whole versioning suite pass and gate it in CI (#9844 ) * test(s3): correct bucket-recreate expectations and cover the different-owner case A same-owner CreateBucket on an existing bucket returns BucketAlreadyOwnedByYou (idempotent recreate); the suite expected BucketAlreadyExists, which only applies when the name is owned by someone else. Fix the same-owner cases (plain and Object-Lock) and implement the previously-skipped different-owner test, which now exercises the BucketAlreadyExists path via a second identity. * test(s3): assert the deletion invariant for suspended-versioning delete A suspended-versioning DELETE removes the null version and records a delete marker so the object reads as deleted; the test expected no marker, which would let an older version resurface. Assert that a marker is recorded (and read DeleteMarker through aws.ToBool) rather than an exact count, so it holds whether or not the suspended-marker id/dedup is later collapsed to AWS's single null marker. * test(s3): run the whole versioning suite by default TEST_PATTERN was TestVersioning, which left bucket-creation, suspended-delete and directory/version-listing tests ungated. Default to '.' so every test runs; opt-in stress tests self-skip without ENABLE_STRESS_TESTS and keep their own targets.	2026-06-06 18:38:28 -07:00
Chris Lu	795349d796	test(s3): deref Object.Size in versioned list assertion (#9843 ) TestVersionedObjectListBehavior compared int64 against listedObject.Size, which is *int64, so the assertion always failed on a type mismatch once reached. Dereference it (and in the log line).	2026-06-06 18:02:36 -07:00
Chris Lu	309cb32416	s3: list directory key objects in versioned bucket version listings (#9842 ) ListObjectVersions gated explicit directory objects on Mime == FolderMimeType, but an SDK PutObject of "dir/" carries a default Content-Type (e.g. application/octet-stream), so those directory keys were dropped from the version listing while ListObjectsV2 - which keys off IsDirectoryKeyObject (any non-empty mime) - still showed them. Use the same IsDirectoryKeyObject check so the two listings agree. The directory test's storage-class assertion compared an ObjectStorageClass constant against ObjectVersion.StorageClass (ObjectVersionStorageClass); the values matched but the SDK enum types did not, so it only surfaced once the directories started appearing. Use the matching constant.	2026-06-06 18:02:33 -07:00
Chris Lu	6c1fd3aeab	s3: rescan .versions when the cached latest pointer is missing on a list (#9841 ) * s3: rescan .versions when the cached latest pointer is missing on a list ListObjectsV2 resolves each versioned object's current version from the latest-version pointer cached on the .versions directory entry. When that pointer is absent on the filer serving the list, the object was dropped from the listing. Fall back to a read-only rescan of .versions/ to pick the newest version - the version files are present locally even when the cached pointer is not - so the object still lists. This mirrors the read path's recoverLatestVersionWithoutPointer; the scan loop is shared. Read-only by design: a list can touch many objects, so it does not persist a pointer. * s3: copy scanned Extended before stamping the version id	2026-06-06 18:02:30 -07:00
Chris Lu	9ede92a7cc	filer: replicate RECOMPUTE_LATEST pointer updates to peers (#9840 ) applyRecomputeLatest wrote the .versions latest-version pointer and the demoted prior version's stamp through UpdateEntry without a following NotifyUpdateEvent, so neither change entered the metadata log. Across filers the pointer then lived only on whichever filer ran the mutation, and ListObjects served by any other filer dropped those objects from a versioned bucket. Emit the events the way PATCH_EXTENDED already does, keeping a pre-update image for the notification diff.	2026-06-06 18:02:28 -07:00
Chris Lu	6e16994615	s3: make lifecycle TTL fast path per-bucket opt-in (#9825 ) Stamping an Expiration.Days rule as a volume TTL at write time bakes an irreversible TTL into the object: removing or lengthening the rule later can't un-expire it, unlike worker-driven expiration. The metadata-only delete it enables also skips per-chunk DeleteFile, so dead bytes linger in a not-yet-expired TTL volume with no deleted-byte accounting until the whole volume ages out. Gate the resolver on a per-bucket flag, off by default; toggle with the s3.bucket.lifecycle.fastpath shell command. Default writes take the worker path: real deletes that honor current policy and let vacuum reclaim space.	2026-06-06 11:20:15 -07:00
Aleksei Sviridkin	3688be82f5	fix(helm): deduplicate all-in-one extra environment variables (#9837 ) * fix(helm): deduplicate all-in-one extra environment variables The all-in-one Deployment looped global.seaweedfs.extraEnvironmentVars and allInOne.extraEnvironmentVars in two separate ranges, so any key present in both maps was emitted as two env entries with conflicting values. It also computed a merged map for the cluster-default lookup but never used it for the env loop. Use the existing seaweedfs.mergeExtraEnvironmentVars helper (as the filer, master and s3 templates already do) so a key set in both maps renders once with the component value taking precedence, and add a chart-CI render assertion covering it. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la> * ci(helm): drop checkmark glyphs from chart test output --------- Signed-off-by: Aleksei Sviridkin <f@lex.la> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-05 15:31:18 -07:00
Aleksei Sviridkin	ae4ad6859d	fix(helm): suspend bucket versioning for YAML bool false (#9836 ) * fix(helm): suspend bucket versioning for YAML bool false createBuckets[].versioning accepts both a YAML bool and a string. The string branch maps "false"/"disable"/"suspended" to Suspended, but the bool branch only handled true (Enabled) and left false as a silent no-op. The same logical value therefore behaved differently depending on its YAML type: `versioning: false` did nothing while `versioning: "false"` suspended the bucket. Mirror the string behaviour in the bool branch so bool false suspends the bucket, and add a chart-CI render assertion covering it. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la> * ci(helm): trim versioning regression-test comment * chart: document bool false for createBuckets versioning --------- Signed-off-by: Aleksei Sviridkin <f@lex.la> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-05 15:18:10 -07:00
Chris Lu	be7f417a03	ip.bind: bind outbound connections to the configured address (#9834 ) * ip.bind: bind outbound connections to the configured address -ip.bind only governed listeners; outbound gRPC and HTTP connections let the OS pick the source IP, which may not even be able to reach the target. Mirror the bind address into a process-global source address and apply it to outbound TCP dials: the gRPC context dialer, the per-client HTTP transports, and the default transport. Loopback targets and unix sockets keep the OS-chosen source so same-host traffic still works. * ip.bind: first-write-wins source IP, skip on address-family mismatch Make SetOutboundLocalIP first-write-wins so a `weed server` component's own bind setting (run in its goroutine) can't clobber the process-wide source address the top-level -ip.bind already established for the other components. Skip source binding when the target is a literal IP of a different family than the bind address, since forcing a mismatched source fails the dial.	2026-06-05 12:44:21 -07:00
Nguyễn Lộc Phúc	7f15a9fed4	fix(s3api): standardize ETag calculation in copy handlers (#9829 ) * fix(s3api): standardize ETag calculation across S3 API handlers * s3: make copyEntryETag nil-safe --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-05 12:41:18 -07:00
Chris Lu	6bd0091c72	master: grow rack-spanning volumes once per DC, capped at copy_N (#9835 ) * master: grow rack-spanning volumes once per DC, capped at copy_N The periodic rack-aware growth scan grew once per rack. For rack-spanning replication (DiffRackCount > 0) a single logical volume already covers every rack the placement needs, so a crowded volume made every rack report should-grow and the scan created racks×step too many volumes: with "010" across two racks that is 2 racks x step 2 = 4 logical (8 physical) volumes. Plan one DC-wide grow for rack-spanning replication, and cap the per-event step at master.volume_growth.copy_N so lowering it reduces periodic growth. * master: distribute lastGrowCount evenly across uneven DCs The non-rack-spanning grow divisor used the current DC's rack count, so DCs with different rack counts each over-grew. Sum every rack up front and divide lastGrowCount by that global count instead.	2026-06-05 12:39:59 -07:00
Chris Lu	ab7be7867d	security: hot-reload JWT signing keys on SIGHUP (#9826 ) * security: reload JWT signing keys on SIGHUP Signing keys were read once in the server constructors and never refreshed. After a key rotation (Secret update, divergent reads) the in-memory key stayed stale and every request kept failing "wrong jwt" until the affected process was restarted. Add Guard.UpdateSigningKeys and call it from the master, volume and filer reload paths and the s3 reload hook, next to the existing whitelist refresh. Make the global chunk-read JWT cache reloadable via an atomic swap, and register the master's Reload with grace.OnReload -- it was never wired, so the master ignored SIGHUP entirely. Mirror the same refresh in the Rust volume server's SIGHUP handler. * security: swap signing keys behind an atomic pointer Addresses review feedback on the in-place key swap: SigningKey is a []byte, so reassigning the Guard fields while a request handler reads them is a data race that can tear the multi-word slice header and read out of bounds. Hold the four signing-key fields in an immutable signingConfig snapshot behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a reader sees either the old keys or the new ones. Reads go through new SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors. The Rust guard is already safe: every read and the SIGHUP write go through the shared RwLock<Guard>. * security: fold whitelist + auth state into the atomic snapshot Review follow-up. UpdateSigningKeys still wrote isWriteActive while the request path read it (and the whitelist maps) unsynchronized, so a SIGHUP under load could expose an inconsistent mix of activation bits and whitelist contents. Move all hot-reloadable Guard state -- keys, expirations, whitelist, and the activation flags -- into a single immutable guardState swapped behind one atomic.Pointer. The Update* methods take a small mutex to serialize the read-modify-write; readers stay lock-free. The concurrency test now also rotates the whitelist and probes IsWhiteListed under -race. Also read each signing key once per branch in the volume/filer JWT auth checks, so a reload landing mid-check can't take the allow-fast-path after auth was enabled or verify against a different key than the branch saw.	2026-06-04 22:26:08 -07:00
Chris Lu	0d72023fac	fix(master): advance maxVolumeId when registering EC shards (#9827 ) * fix(master): advance maxVolumeId when registering EC shards After EC encoding the original normal volume is deleted, so a high-numbered volume can exist only as EC shards. Only regular volumes advanced maxVolumeId (Disk.doAddOrUpdateVolume), so a master that rebuilt its state from heartbeats (raft state not resumed) undercounted the max and NextVolumeId could re-issue an id that EC shards still occupy. A new volume then gets created on top of the EC volume id; new writes land on it, but reads route to the old EC shards whose .ecx never held the new needle, returning 404 and corrupting that object. Advance maxVolumeId when EC shards are registered, mirroring the regular-volume path. RegisterEcShards is the chokepoint both the full and incremental heartbeat sync paths funnel through. * test: cover incremental heartbeat path for EC maxVolumeId Both SyncDataNodeEcShards and IncrementalSyncDataNodeEcShards funnel through RegisterEcShards; assert the invariant on the incremental path too.	2026-06-04 22:25:30 -07:00
Chris Lu	8d59069a0a	s3: return BucketAlreadyOwnedByYou when recreating your own bucket (#9822 ) * s3: return BucketAlreadyOwnedByYou when recreating your own bucket PutBucket returned BucketAlreadyExists for every existing bucket, even when the caller already owns it, so idempotent re-creation (e.g. a container that creates its bucket on startup) couldn't tell "someone else took the name" from "it's already mine". Recreating a bucket you own now returns BucketAlreadyOwnedByYou, unless the request conflicts with the existing bucket: a different Object Lock setting, or an ACL on the request or the existing bucket. To detect the latter, a requested non-default canned/grant ACL is now persisted on creation instead of being dropped. * s3: fail PutBucket when the existing bucket's config can't be read When a bucket already exists, an unreadable config left the recreate defaulting to BucketAlreadyOwnedByYou, masking the backend error and possibly accepting a conflicting recreate (Object Lock / ACL unknown). Surface the read error instead. * s3: return the stored bucket ACL from GetBucketAcl GetBucketAcl always returned the owner's default full-control grant and ignored any stored ACL, so a bucket created with a canned ACL or one set via PutBucketAcl never read back correctly. Decode the stored grants instead, sharing one grants-to-XML helper with the object ACL handler. The shared helper also emits each grantee's real xsi:type (e.g. Group for public-read) instead of a hardcoded CanonicalUser, so group grants read back correctly for both bucket and object ACLs. * s3: resolve the right already-exists error on the concurrent-create race When two requests create the same bucket at once, the loser's mkdir fails and the handler fell back to a flat BucketAlreadyExists, bypassing the same-owner idempotency check. Route both the pre-check and the race fallback through one existingBucketError helper so a same-owner recreate still gets BucketAlreadyOwnedByYou. * s3: record the bucket owner's account id at creation setBucketOwner only stored the creating identity name, so the canonical account id wasn't available later. Persist it under ExtAmzOwnerKey too, the same field PutBucketAcl writes, so the bucket owner can be reported independently of whoever reads it. * s3: report the bucket owner from GetBucketAcl, not the caller GetBucketAcl built the ACL Owner from the caller's account header, so an admin or cross-account read returned the wrong owner. Use the owner persisted on the bucket, falling back to the caller only when none is recorded.	2026-06-04 15:33:03 -07:00
Chris Lu	a24f4844d3	filer: keep S3 list order byte-lexicographic regardless of SQL name column collation (#9824 ) * mysql: keep S3 list order byte-lexicographic regardless of name column collation ORDER BY name and the name > ? pagination predicate follow the column collation, so a case-insensitive filemeta.name (e.g. utf8mb3_general_ci) returns S3 keys out of byte order and breaks clients that merge two sorted listings. Detect the live name collation at startup; only when it isn't binary, wrap the list comparison, prefix, and ORDER BY in BINARY name so order and pagination stay consistent. Correctly configured utf8mb4_bin tables keep their indexed range scan unchanged, and the operator gets a warning to convert the column. * postgres: keep S3 list order byte-lexicographic regardless of name column collation ORDER BY name and the name > $n pagination predicate follow the column or database collation, so a locale-aware filemeta.name (e.g. the en_US.UTF-8 database default) returns S3 keys out of byte order and breaks clients that merge two sorted listings. Detect the live name collation at startup; only when it isn't byte-ordered, wrap the list comparison, prefix, and ORDER BY in COLLATE "C" so order and pagination stay consistent. A byte-ordered (C/POSIX/C.UTF-8) column keeps its indexed range scan unchanged, and the operator gets a warning to declare the column COLLATE "C".	2026-06-04 14:33:41 -07:00
Chris Lu	8c2d9f466f	filer: stream persisted log files when serving metadata subscriptions (#9821 ) * filer: stream persisted log files when serving metadata subscriptions readFileEntries buffered every LogEntry of a whole log file into memory before returning them one by one, making a subscription read O(entries in one log file) instead of O(one entry). On a filer with large per-entry metadata, many concurrent SubscribeMetadata streams each loading a full log file exhausted memory. Keep a current LogFileIterator and return one entry at a time, advancing files as each is exhausted. The deleted-volume skip is preserved. * filer: close the log file iterator on read errors too A genuine read error returned early without closing the current LogFileIterator, leaving its ChunkStreamReader alive until GC. Close on every exit path and propagate only a real error. * filer: close persisted-log iterators when a subscription stops early The streaming iterator keeps a log file reader open across calls, so a subscription that returns before EOF (early stop, cancellation) left the reader alive until GC. Add idempotent Close on LogFileQueueIterator and OrderedLogVisitor, and have ReadPersistedLogBuffer wait for the readahead goroutine and close the visitor on the way out.	2026-06-04 13:27:25 -07:00
7y-9	6e8002f065	fix: handle meta backup offset errors safely (#9818 ) * fix: log meta backup offset errors * fix: log meta backup offset errors * fix: exit on meta backup offset errors Exit with a non-zero status when the initial metadata backup offset cannot be persisted. Classify offset-read failures during streaming so the backup process exits instead of retrying forever, allowing supervisors to restart and bootstrap from a missing checkpoint. * meta backup: read offset in the loop, drop offset error type Reading the saved offset inside the retry loop makes an offset read failure a clean exit and a stream error a retry, without a typed error to tell them apart. streamMetadataBackup now takes the start time. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-04 10:53:23 -07:00
Chris Lu	3e8ec879c4	s3: keep dynamic IAM live when -iam.config is set (#9817 ) * s3: keep dynamic IAM live when -iam.config is set -iam.config was treated like a static -config identity file: it set useStaticConfig, which makes the filer metadata subscription skip reloads. Identities and policies created at runtime (the IAM gRPC API) then never took effect, so advanced IAM (OIDC/STS) and dynamic IAM were mutually exclusive. Gate useStaticConfig on whether inline identities were actually loaded. An OIDC/STS-only config carries none, so it keeps the dynamic credential store live; a -config identity file still freezes its identities as before. * s3: mark static identities on config reload too A -config reload (grace.OnReload) re-reads the file, but only the startup path marked its identities static, so identities added to the file and reloaded were left unprotected from dynamic filer updates. Move the marking into loadS3ApiConfigurationFromFile and make it additive and scoped to the file's identities, so a reload protects newly added ones without freezing dynamic filer-managed identities. * s3: sync reloaded static identities into the credential manager After marking a (re)loaded config file's identities static, push the updated set into the credential manager so reloaded identities still appear in listings and survive later dynamic merges. Centralize the sync in loadS3ApiConfigurationFromFile and drop the now-redundant call in the reload hook.	2026-06-03 23:28:25 -07:00
Fabian Hardt	ce6a51468a	sftpd: support SSH user certificates signed by a trusted CA (#9815 ) * sftpd: support SSH user certificates signed by a trusted CA Adds a new "certificate" auth method to weed sftp. When enabled, the server loads trusted CA public keys from -trustedUserCAKeysFile (OpenSSH authorized_keys format, one or more keys) and accepts only ssh.Certificate blobs of type UserCert on the public-key channel. Validation uses ssh.CertChecker: CA signature, ValidAfter/ValidBefore, non-empty ValidPrincipals and SSH login user must appear in ValidPrincipals. The authenticated user must exist in the user store; home dir and permissions resolve as before. Behaviour mirrors MinIO's --sftp=trusted-user-ca-key and OpenSSH's TrustedUserCAKeys: when certificate auth is active, plain (non-cert) public keys are rejected even if "publickey" is also listed. Default authMethods remain "password,publickey", so existing deployments are unaffected. * Update weed/sftpd/auth/certificate.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * sftpd: address review feedback on certificate auth - Pre-marshal trusted CA public keys in IsUserAuthority instead of re-marshaling on every authentication attempt (gemini-code-assist). - Differentiate user-not-found from underlying store errors via errors.As(user.UserNotFoundError) so backend/read failures are no longer reported as bad credentials (coderabbitai). - Fix the corresponding sanity check in the missing-file test to use errors.As instead of errors.Is (UserNotFoundError has no Is method, so the previous check never matched) (coderabbitai). sftpd: register trustedUserCAKeysFile flag in filer and server commands The new field on SftpOptions is dereferenced unconditionally in resolvePaths(), but only the standalone `weed sftp` command was wiring its flag. `weed filer` and `weed server` both embed an SftpOptions value and call resolvePaths() on it, so they hit a nil pointer dereference at startup. Register `-sftp.trustedUserCAKeysFile` in both commands and update the -sftp.authMethods help text to mention the new "certificate" method. Fixes the SFTP Integration Tests CI failure on this PR. * helm: expose SFTP certificate auth in the SeaweedFS chart Adds Helm-chart support for the new SSH user-certificate auth method: - values.yaml (sftp:) gains `trustedUserCAKeys` (inline OpenSSH authorized_keys-format CA public keys) and `existingCAKeysSecret` (reference an externally managed Secret). Same pair added under allInOne.sftp with a null default that falls back to the top-level sftp.* setting. - New template templates/sftp/sftp-ca-secret.yaml renders a chart-managed Secret <release>-sftp-ca-secret with `ca_user.pub`, but only when SFTP is enabled, "certificate" is in authMethods, inline keys are provided, and no existingCAKeysSecret is set. - templates/sftp/sftp-deployment.yaml and the all-in-one deployment template add `-trustedUserCAKeysFile=/etc/sw/sftp_ca/ca_user.pub` to the weed sftp command, mount the CA secret at /etc/sw/sftp_ca and add the corresponding volume. All cert-auth bits are guarded by `contains "certificate" authMethods` so existing users see no change. - authMethods help text updated to mention "certificate". Verified end-to-end on a local k3d cluster: cert login succeeds, plain-pubkey login is rejected with "public key without certificate not allowed". * helm: fail render when SFTP certificate auth lacks CA keys When certificate is in authMethods but neither trustedUserCAKeys nor existingCAKeysSecret is set, the deployment mounted a secret that the chart never renders, leaving the pod stuck on a missing volume. Fail at template time with a clear message instead. * sftpd: fix stale auth-method list in SFTPServiceOptions comment keyboard-interactive was never implemented; certificate is the new supported method. Match the CLI help text. * sftpd: test Manager wiring of certificate vs public-key channel Cover the channel takeover at the Manager level: certificate auth displaces plain public-key auth when both are enabled, public-key auth stays put otherwise, and enabling certificate without a CA file errors. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-03 22:32:47 -07:00
Chris Lu	df879e1ed7	filer: bound TraverseBfsMetadata memory by queuing directory paths (#9814 ) * filer: bound TraverseBfsMetadata memory by queuing directory paths The BFS enqueued every entry, so it held the whole subtree in memory including each file's chunk list. A filer serving a peer's first-time bootstrap traversal of a large tree could exhaust memory and get killed. Stream each entry as it is visited and queue only directory paths to descend into. Memory is now bounded by the number of directories rather than the entire tree, and the streamed output order is unchanged. * filer: match excluded prefixes on path-component boundaries Only treat an excluded prefix as a match when it ends at a path boundary, so excluding /a/b does not also drop a sibling like /a/bc. Short-circuit the trie walk on the first real match.	2026-06-03 10:28:42 -07:00
Konstantin Lebedev	df833d485f	[test] update docker image for s3test (#9811 )	2026-06-03 09:45:00 -07:00
Lars Lehtonen	d321e463e9	chore(weed/storage/needle): prune unused test functions (#9812 )	2026-06-03 09:26:28 -07:00
Chris Lu	ef1aa4f936	s3: defer a recently-unreachable owner that is also the current filer (#9808 ) Blanking preferred kept route-by-key reads from dialing a flagged owner first, but withFilerClientFailover always re-adds the current filer, so when the owner is the gateway's current filer it stayed in the candidate list and got dialed anyway. Treat a recently-unreachable filer as unhealthy in the health partition so it is deferred to the last-resort tail instead of tried before healthy replicas; preferred is still tried first, and a live owner is unaffected.	2026-06-03 00:28:56 -07:00
Chris Lu	2d1b8be22b	s3: route object reads to the key's owner filer (#9806 ) * s3: route object reads to the key's owner filer Writes already route by key to the owner filer on the lock ring, where the entry is created. Reads went to the gateway's local filer and treated its NotFound as authoritative, so a GET on one gateway could miss an object another gateway had just written until the filers' metadata replication caught up. Resolve an object's entry from the key's owner first, failing over to the gateway's filer set only on transport errors. An owner NotFound stays authoritative: no fan-out across filers, and no resurrecting a peer's not-yet-replicated tombstone, so a delete routed to the owner is visible at once and a genuine miss costs one lookup. Keys owned by the local filer are unchanged. Objects written through the non-routed lock path land on a gateway's local filer, so they can still read as absent on the owner until they replicate. withFilerClientFailover takes a preferred start filer; the object-entry reads pass the owner, every other caller passes "" and keeps the current-filer fast path. * s3: consult the prior owner on a rebalance-window read miss Owner-first reads route a key to its current ring owner. When a filer joins, ~1/N of keys reassign to it, and the new owner may not have replicated a just-moved key yet, so an owner NotFound would surface a transient 404 for an object that already exists elsewhere. Retain the previous ring on the gateway's LockClient for a cooling-off window (PriorOwnerForKey, mirroring the master's LockRing.PriorOwner) and, on the owner's NotFound, probe the key's previous owner once before treating the miss as final. The probe is scoped to keys whose ownership actually moved and only within the window, so steady-state reads are untouched. This trades the transient scale-up 404 for a transient stale read if a delete routed to the new owner races the same window — the same authoritative-NotFound tradeoff, narrowed to the rebalance. * s3: try healthy filers before unhealthy ones on failover The candidate list probed its first entry (usually the current filer) unconditionally, so a health-flagged current filer cost a transport timeout on every ordinary call before failover reached a replica. Partition candidates into healthy and unhealthy, keep priority within each, and fall back to unhealthy ones only when all healthy ones fail. * reduce comments on the routed read and lock client paths * s3: skip a recently-unreachable owner on route-by-key reads The gateway's filer health tracking no-ops for an owner outside the static -filer list, so during a sustained owner outage every route-by-key read re-dials the dead owner before failing over. Flag an owner whose owner-first read hit a transport error and skip it (read local-first) for a short TTL, so reads pay one dead dial per TTL instead of one per request; the flag expires so owner-first reads resume once the owner or the ring recovers. * s3: always try the preferred owner first, health-order only the rest The healthy/unhealthy partition also demoted a health-flagged preferred owner behind healthy replicas, so a replica's authoritative NotFound could mask a write that had only reached the owner — the read-after-write race this routing exists to close. Pull preferred out of the partition and keep it first; the recently-unreachable gate already steers reads away from a genuinely dead owner.	2026-06-03 00:12:28 -07:00
Chris Lu	4e5839ce82	fix(iam): return a valid user ARN from CreateUser and GetUser (#9794 ) * fix(iam): return a valid user ARN from CreateUser and GetUser The terraform aws provider 6.41 reads a user back after creating it and blocks until GetUser returns a value that passes arn.IsARN. We only set UserName, so the ARN was empty and apply hung until the 2m timeout. Populate Arn (and Path) via a shared iam.NewUser helper in both the embedded and standalone IAM handlers. * fix(iam): use the userName parameter directly in NewUser Drop the redundant local copy; the value parameter is already function-local. * fix(iam): return full user objects with ARNs from GetGroup GetGroup listed members with only UserName set. Build them via the shared NewUser helper so group members carry a valid Arn and Path like the other user responses, in both the embedded and standalone IAM handlers.	2026-06-02 22:01:57 -07:00
Chris Lu	f711868fb6	fix(log_buffer): re-check buffer before bailing with ResumeFromDiskError (#9804 ) ReadFromBuffer and HasData() take the read lock separately, so a write that lands between them can make a subscriber which just read a momentarily empty buffer return ResumeFromDiskError even though the data is now servable from memory. Re-read under a fresh lock and only bail when the position is genuinely behind the in-memory window (flushed to disk); otherwise loop back and read it.	2026-06-02 21:37:15 -07:00
dependabot[bot]	24159fbff9	build(deps): bump opentofu/setup-opentofu from 1 to 2 (#9801 ) Bumps [opentofu/setup-opentofu](https://github.com/opentofu/setup-opentofu) from 1 to 2. - [Release notes](https://github.com/opentofu/setup-opentofu/releases) - [Commits](https://github.com/opentofu/setup-opentofu/compare/v1...v2) --- updated-dependencies: - dependency-name: opentofu/setup-opentofu dependency-version: '2' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>	2026-06-02 21:37:05 -07:00
Chris Lu	7b44cf5627	fix(iam): implement CreatePolicyVersion for managed policies (#9795 ) * fix(iam): implement CreatePolicyVersion for managed policies The AWS Terraform provider updates a managed policy in place via CreatePolicyVersion, which returned 501 NotImplemented and broke terraform apply on any policy change. Implement CreatePolicyVersion (plus ListPolicyVersions, GetPolicyVersion and DeletePolicyVersion) on both the standalone IAM server and the embedded S3 IAM API. Managed policies keep a single current document, so each is modeled as one default version "v1": CreatePolicyVersion replaces the document, List/GetPolicyVersion expose it, and DeletePolicyVersion rejects deleting the default. GetPolicy now reports DefaultVersionId so the provider's read can fetch the document. The standalone path also refreshes the cached Identity.Actions of every identity the policy is attached to so the new document takes effect. * fix(iam): reject CreatePolicyVersion unless SetAsDefault=true With a single always-default managed-policy version, a request with SetAsDefault=false (or omitted) would stage a non-default version on AWS but here silently replaced the active document. Reject it on both the standalone and embedded paths. Isolate the new policy-version tests from the shared package fixtures so they stay order-independent, and assert IsDefaultVersion on the response.	2026-06-02 21:35:02 -07:00
Chris Lu	b6a0bde16b	test(s3/iam): scope ListBucket isolation via s3:prefix condition (#9805 ) The username-isolation policy denied s3:ListBucket through an object-path NotResource. ListBucket is bucket-level, so its resource ARN is the bucket and never matches an object path: the Deny always fired and a user could not list their own prefix. Scope the per-user List deny with a StringNotLike s3:prefix condition instead, the same mechanism the matching Allow uses.	2026-06-02 18:41:10 -07:00
dependabot[bot]	6bffe3f56a	build(deps): bump google.golang.org/grpc from 1.80.0 to 1.81.1 (#9797 ) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.80.0 to 1.81.1. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.80.0...v1.81.1) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.81.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-02 18:27:52 -07:00
dependabot[bot]	69bc7325ca	build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 from 3.134.2 to 3.139.5 (#9796 ) build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 Bumps [github.com/ydb-platform/ydb-go-sdk/v3](https://github.com/ydb-platform/ydb-go-sdk) from 3.134.2 to 3.139.5. - [Release notes](https://github.com/ydb-platform/ydb-go-sdk/releases) - [Changelog](https://github.com/ydb-platform/ydb-go-sdk/blob/master/CHANGELOG.md) - [Commits](https://github.com/ydb-platform/ydb-go-sdk/compare/v3.134.2...v3.139.5) --- updated-dependencies: - dependency-name: github.com/ydb-platform/ydb-go-sdk/v3 dependency-version: 3.139.5 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>	2026-06-02 17:22:03 -07:00
dependabot[bot]	348de64f13	build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.14 to 1.32.21 (#9798 ) build(deps): bump github.com/aws/aws-sdk-go-v2/config Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.14 to 1.32.21. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.14...config/v1.32.21) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.32.21 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-02 17:14:58 -07:00
dependabot[bot]	30e0116611	build(deps): bump go.etcd.io/etcd/client/pkg/v3 from 3.6.11 to 3.6.12 (#9799 ) Bumps [go.etcd.io/etcd/client/pkg/v3](https://github.com/etcd-io/etcd) from 3.6.11 to 3.6.12. - [Release notes](https://github.com/etcd-io/etcd/releases) - [Commits](https://github.com/etcd-io/etcd/compare/v3.6.11...v3.6.12) --- updated-dependencies: - dependency-name: go.etcd.io/etcd/client/pkg/v3 dependency-version: 3.6.12 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-02 17:14:50 -07:00
dependabot[bot]	d6db545b4c	build(deps): bump github.com/pierrec/lz4/v4 from 4.1.26 to 4.1.27 (#9800 ) Bumps [github.com/pierrec/lz4/v4](https://github.com/pierrec/lz4) from 4.1.26 to 4.1.27. - [Release notes](https://github.com/pierrec/lz4/releases) - [Commits](https://github.com/pierrec/lz4/compare/v4.1.26...v4.1.27) --- updated-dependencies: - dependency-name: github.com/pierrec/lz4/v4 dependency-version: 4.1.27 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-02 17:14:28 -07:00
dependabot[bot]	cb67542d01	build(deps): bump docker/setup-qemu-action from 4.0.0 to 4.1.0 (#9802 ) Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 4.0.0 to 4.1.0. - [Release notes](https://github.com/docker/setup-qemu-action/releases) - [Commits](https://github.com/docker/setup-qemu-action/compare/v4...v4.1.0) --- updated-dependencies: - dependency-name: docker/setup-qemu-action dependency-version: 4.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-02 17:13:52 -07:00
dependabot[bot]	6908445c5d	build(deps): bump actions/checkout from 5 to 6 (#9803 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Commits](https://github.com/actions/checkout/compare/v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-02 16:26:55 -07:00
Chris Lu	160e68dd65	fix(s3api): keep ListBucket resource ARN at bucket level (#9792 ) * fix(s3api): keep ListBucket resource ARN at bucket level ListObjects with ?prefix= was denied for IAM users granted s3:ListBucket on the bucket ARN. authRequestWithAuthType promotes the prefix into object so the legacy CanDo path can honor prefix-scoped Action strings, and that promoted object leaked into the policy resource ARN, producing arn:aws:s3:::bucket/<prefix> which never matches a bucket-level statement. Keep the resource bucket-level for List in the bucket-policy and IAM-attached-policy evaluators; prefix scoping stays in the s3:prefix Condition. The CanDo path is untouched. * fix(s3api): resolve List action at bucket level when prefix is promoted The IAM evaluator built a bucket-level resource ARN but still passed the prefix-promoted object to ResolveS3Action, so listing with a prefix made hasObject true and misresolved ListBucketVersions/ListBucketMultipartUploads to ListBucket. Resolve the action against the same zeroed object, and trim the resource-ARN comments.	2026-06-02 14:45:45 -07:00
Chris Lu	8e4022d5c7	fix(s3api): authorize DeleteObjects per key so object-scoped policies match (#9793 ) Bulk DeleteObjects carries the keys in the request body, so the route Auth middleware ran a single bucket-level check with object="", building the resource ARN as arn:aws:s3:::<bucket>. That never matches an s3:DeleteObject policy scoped to <bucket>/*, so the entire batch was denied even though the single-key DELETE worked with the same credentials. Defer authorization to the handler and check each key via AuthorizeBatchDeleteKey, mirroring AuthorizeCopySource: a synthetic DELETE /<bucket>/<key> request resolves s3:DeleteObject (or s3:DeleteObjectVersion when a versionId is given) against the object ARN. Denied keys come back as per-key errors while authorized keys still delete, matching AWS semantics.	2026-06-02 14:45:05 -07:00

1 2 3 4 5 ...

14102 Commits