401 Commits

Author SHA1 Message Date
Chris Lu f724828bcb fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941)
* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)

EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.

validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
  - shards smaller than expected: an interrupted local encode left partial
    shards and the .dat is the complete source -> reclaim the .dat.
  - shards equal to expected: a valid (or still-distributing) EC volume ->
    keep; the shards may be the only copy.
  - shards larger than expected: the .dat is the stale/partial side (e.g. an
    interrupted decode left a half-written .dat next to the real shards) ->
    keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.

handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.

pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).

Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.

Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.

* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments

The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.

Also condense the verbose comments added in this change to the repo's
concise style.
2026-06-12 23:51:29 -07:00
Chris Lu 34f9b91d69 fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards

A leftover empty .dat stub (a phantom from the pre-fix loader; zero
needles) next to a distributed EC volume's local shards made startup
classify the volume as an interrupted local encode: validateEcVolume
requires >= dataShards local shards when a .dat is present, fails with
the 1-2 shards a distributed volume keeps per disk, and the cleanup
deletes those shards -- the only copies of that part of the volume.
Repeated across restart waves this destroys enough shards cluster-wide
to make the volume unrecoverable.

Go:
- loadExistingVolume: hoist the empty-stub sweep above the EC presence
  checks. Previously the .vif-next-to-.ecx guard returned before the
  sweep ever ran, so exactly the dangerous layout (stub + .ecx + local
  shards) kept its stub and then lost its shards in loadAllEcShards.
- validateEcVolume / checkDatFileExists: treat a .dat <= a superblock
  (zero needles) as absent. An empty .dat cannot be the encode source,
  so it must never gate shard deletion; this also covers stubs without
  a .vif, which the sweep cannot prove are EC leftovers.

Rust mirror (seaweed-volume): the same gate in validate_ec_volume and
check_dat_file_exists (the Rust sweep already ran before validation);
the volume-load skip keeps a plain existence check so fresh,
needle-less volumes still load.

Regression tests in Go and Rust reproduce the production layout (a
zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with
and without a .vif) and fail without the fix with the shards deleted.

* fix(ec): gate source volume deletion on a recoverable shard set

After EC encode, the shell command and the (plugin) worker task refused
to delete the source volume unless every shard was present, and aborted
otherwise -- leaving the source .dat next to live shards, exactly the
mixed state the startup cleanup mishandles.

Replace the full-set requirement with a recoverability gate shared by
both callers (RequireRecoverableShardSet): deleting a non-empty source
.dat requires at least dataShards distinct shards cluster-wide. Below
that the source is kept and the encode fails as before. A degraded but
recoverable set (>= dataShards, < total) now proceeds with a warning
instead of aborting: the missing shards can be rebuilt from the
survivors, while keeping the source would preserve the dangerous mixed
state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an
empty .dat has nothing to lose.

dataShards/totalShards stay parameters so enterprise custom EC ratios
share the helper verbatim.

* test(ec): use recoverable shard verification gate
2026-06-11 20:26:20 -07:00
Chris Lu 3eb550a3f1 fix(tests): 32-bit build of EC e2e tests, type-check linux/386 in CI (#9922)
* fix(tests): keep EC e2e fid cookie arithmetic in uint32

The cookie constants 0x9490CA00 and 0x9500CA00 were added to the int
loop variable before conversion, overflowing 32-bit int at compile
time on linux/386 and linux/arm. Convert the loop variable instead so
the addition stays in uint32.

* fix(tests): pass s3client max backoff in milliseconds

MaxBackoffDelay is documented as milliseconds and multiplied by 1e6
before use, but the example set it to 5s in nanoseconds, yielding an
absurd backoff on 64-bit and a compile-time int overflow on 32-bit.

* ci: type-check code and tests for linux/386

64-bit-only constant arithmetic keeps slipping into test files and
breaking 32-bit downstream builds. Vet the whole root module under
GOOS=linux GOARCH=386 so these fail in CI instead of after release.

* fix(tests): convert s3client backoff to Duration before scaling

The ms-to-ns multiplication ran in int, wrapping at runtime on 32-bit;
scale by time.Millisecond after the Duration conversion instead.
2026-06-11 09:05:54 -07:00
Chris Lu 79ac279fe1 fix(ec): don't mix EC shards from different encode runs (#9880)
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC

EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns
(encode time in unix nanos). It rides in .vif and the read request so a
read can be scoped to the encode run that produced the index.

* fix(ec): stamp each encode and reject cross-run shard reads

Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the
shard's owning volume (resolved together via FindEcVolumeWithShard, so a
multi-disk server validates the disk that actually serves the bytes) and
reject a shard from a different encode run, recovering from parity. A
zero on either side (pre-upgrade volume) skips the guard.

* fix(ec): stamp the encode identity on the worker-generated .vif

The worker-local encode path now writes EncodeTsNs (and the resolved EC
ratio) into the .vif, so the read guard is not silently off for volumes
encoded by the maintenance worker.

* fix(ec): wipe stale EC artifacts before re-encoding

VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and
removes its on-disk shard/index/sidecar files before writing fresh ones,
so a retried encode never builds on a partial prior run and the unlink
frees the inodes instead of leaving open fds serving old bytes.

* fix(ec): unmount EC shards across all disks

UnmountEcShards walked only the first disk holding the shard, leaving a
duplicate copy mounted on a sibling disk (split-disk reconciled volumes)
still serving and heartbeating. Traverse every disk and emit one
deletion delta per disk.

* fix(ec): delete orphan shards without a local .ecx

deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx,
so it could not clean an orphan .ecNN left by a failed copy on a disk
with no index. Delete the requested shard files unconditionally; the
index-file (.ecx/.ecj/.vif) routing stays gated as before.

* fix(ec): clear stale EC shards cluster-wide before re-encoding

ec.encode unmounts and deletes EC shards for the target volumes on every
node before regenerating: fatal for the shards the topology reports
(mounted leftovers), best-effort for the rest (a sweep that catches
unmounted failed-copy orphans). A down node is a no-op.

* fix(ec): don't nil EC fds on close so reads can't race eviction

A reader resolves an EcVolume/shard under the lock then reads after it is
released, so an eviction that nils ecxFile/ecdFile would race that read
and panic. Close the fds without nilling the fields: the field is now
write-once (no data race) and a concurrent read hits a closed fd, getting
a clean error that the caller recovers from parity.

* fix(ec): wipe stale EC artifacts on every disk and surface failures

The pre-encode wipe only deleted beside the source volume, so a stale
shard on a sibling disk survived and could be mounted against the new
index at reconcile. Sweep every disk. Removal also ignored os.Remove
errors, reporting a failed cleanup as success and letting a stale shard
join the next generation; surface the first real failure (treating
already-gone as success) from removeStaleEcArtifacts and the shard delete.

* fix(ec): log when a local shard is skipped for a different encode run

The cross-run guard returned errShardNotLocal, indistinguishable in logs
from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so
operators can tell "wrong encode generation" from "shard not here".

* fix(ec): surface metadata removal failures in the shard delete path

deleteEcShardIdsForEachLocation still dropped os.Remove errors on the
.ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index
condition this path prevents, so route those through removeFileIfExists and
return the first real failure instead of reporting cleanup as success.

* fix(ec): fail orphan cleanup when a reachable node's delete fails

The pre-encode orphan sweep swallowed every error for unreported (node,
volume) pairs. That is only safe for an unreachable node, which cannot
receive this encode's new generation. A reachable node whose delete
genuinely failed (permission/IO) keeps an orphan shard that a later copy
re-stamps with the new run's volume-level .vif identity, so the read guard
would accept stale data. Surface those; stay best-effort only for
unreachable nodes (gRPC Unavailable / no status).

* fix(ec): guard ecjFile under its lock in the EC delete path

EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved
its .ecx lookup before a concurrent eviction (the generate-time
UnloadEcVolume) could then reach the journal append with a nil fd. Bail
with a clear "volume closed" error under the lock instead.

* fix(ec): reject an unstamped shard when the caller has an encode identity

The read guard required both identities nonzero, so a current (stamped)
caller accepted a holder with identity 0 and could be served a stale
pre-upgrade shard. Reject when the caller is stamped and the holder
differs (including unstamped); stay lenient only when the caller itself
has no identity (pre-upgrade reader). A skipped shard recovers from parity.

* fix(ec): full-teardown delete so cluster cleanup wipes a whole generation

The pre-encode cluster sweep deleted only the listed canonical shards on
remote nodes, leaving index/sidecar (and, on builds with versioned
generations, those too) behind. Add a full_teardown flag to
VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for
it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode
cleanup paths set it. Other delete callers (balance/decode/repair) are
unchanged.

* fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close

Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while
Close nils it under the lock, a data race on the field. Take the lock
first, then nil-check inside, in both.

* fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success

An old volume server silently ignores full_teardown and returns success
for an ordinary delete, so the caller wrongly believes the generation was
wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done
in the response; the worker destination cleanup fails when it is absent, and
the shell cluster sweep fails for a reported (mounted) leftover while staying
best-effort for an unreported node. encode_ts_ns stays an accepted transient
(an old server just skips the new read guard, no regression).

* fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown

A reachable pre-upgrade server ignores full_teardown and returns success
without wiping an orphan, which a later copy then folds into the new
generation. Treat a missing full_teardown_done ack as fatal for every
reachable node (best-effort only for a gRPC-unreachable one), not just for
topology-reported pairs.

* fix(ec): return the served shard identity and validate it client-side

The encode identity was only enforced server-side, so a pre-upgrade server
ignored the request field and served bytes unchecked. Echo the served
shard's EncodeTsNs on every read response chunk and have the client reject a
mismatch (including 0 from an old server), so the guard holds regardless of
server version; a rejected read recovers from parity.

* fix(ec): reject a short/empty remote shard read instead of serving zeros

doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and
returned success with a partly zero-filled, unvalidated buffer (the server
stamps the identity only on chunks that carry bytes). A non-deleted interval
must arrive whole: require n == len(buf), exempting the is_deleted
short-circuit (n=0), matching readLocalEcShardInterval's local check. A short
read now fails so the caller recovers from parity.

* test(ec): fake volume server echoes the full_teardown acknowledgement

The worker now fails a teardown delete that isn't acknowledged (so a
pre-upgrade server can't silently skip the wipe). The fake server's no-op
VolumeEcShardsDelete returned an empty response, which the worker read as a
skipped teardown and aborted the encode. Echo full_teardown_done.

* feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server

The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif
and rejects a read served from a shard of a different run; full_teardown wipes a
whole generation and acknowledges it. The Rust volume server had none of it.
Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume,
stamp it on every read response, and reject a request/response mismatch on both
the server and the distributed-read client (recovering from parity); handle
full_teardown by evicting the volume and wiping every EC artifact on each disk,
echoing full_teardown_done so the caller can detect a server that ignored it.

* fix(ec): remove a stale .vif on full teardown of a shard-only node

A shard copy installs shards + .ecx before .vif, so an interrupted copy after a
teardown could mount the new files under the previous run's identity / version /
shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during
full teardown, gated on .idx absence so a source-volume holder keeps its live
.vif. In Rust this lives in a teardown-only helper so the reconcile / load-
fallback paths (which share the base removal) still preserve .vif.

* fix(ec): treat a missing teardown ack as fatal, not as an unreachable node

isNodeUnreachable returned true for any non-gRPC-status error, so a reachable
pre-upgrade server's missing full_teardown_done ack (a plain error) was
classified unreachable and the unreported pair was silently skipped. Classify
only a real codes.Unavailable as unreachable, and wrap the missing ack in a
sentinel the sweep treats as fatal regardless. A genuinely down node still
surfaces as Unavailable from the RPC and stays best-effort.

* fix(ec): reject a short shard read in the local EC needle reader

read_ec_shard_needle ignored the byte count from shard.read_at and appended the
whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later
length check and parsed as garbage. Require n == buf.len() per interval, erroring
on a short read like the local interval reader already does.

* fix(ec): probe reachability before skipping a node that returns Unavailable

The pre-encode sweep skipped any node whose teardown delete returned
codes.Unavailable, but a reachable volume server in maintenance mode also
returns that code for the maintenance-gated delete, so its stale EC files were
left behind on a node that can still receive the new generation. Confirm with a
non-maintenance-gated empty-target Ping: skip only when the node fails the probe
too (genuinely unreachable).

* fix(ec): use try_exists for the teardown .vif .idx guard

The teardown-only .vif removal gated on Path::exists(), which returns false on a
permission/IO stat error, so a stat failure on a present .idx would read as a
shard-only node and delete the live source volume's .vif. Gate on
try_exists() == Ok(false) instead, preserving the sidecar on any stat error.

* fix(ec): only skip a sweep node when a Ping confirms it is transport-down

The pre-encode sweep skipped a node whenever its teardown delete and a liveness
Ping both failed, but it treated ANY Ping error as down — an application-level
Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a
reachable node's stale generation in place. Classify the Ping tri-state and skip
only when it transport-fails with codes.Unavailable; a reachable or inconclusive
node stays fatal.

* fix(ec): exclude sweep-skipped nodes from the encode's rebalance

The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance
then recollected the current topology — a node that recovered between the two
could become a copy target and receive the new generation while still holding
its stale, never-cleared shards. Have the sweep return the skipped set and
exclude those nodes from the rebalance for this encode, so a node we could not
clean cannot receive the new generation. Standalone ec.balance is unaffected.

* fix(ec): re-sweep recovered nodes before generation so they aren't stranded

A node skipped as down by the pre-encode sweep is excluded from the rebalance,
but it can recover and become the generation host — mounting all shards locally,
then being excluded from distribution. Union-only verification accepts all
shards on one node and deletes the originals: a single point of failure. Re-sweep
the skipped nodes just before generation; one whose teardown now succeeds leaves
the skipped set and rebalances normally, while a node still down stays skipped.

* fix(ec): abort the encode if a selected source is still skipped after re-sweep

The re-sweep un-skips a recovered node, but the source was selected before it and
a node can stay down through the re-sweep then recover just in time to be the
generation host — mounting all shards locally while still excluded from the
rebalance, which union-only verification accepts before deleting the originals.
Abort the encode when a selected source remains skipped after the re-sweep.

* fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch

If a volume is not EC at the batch-delete classification but is encoded to EC and
its .dat deleted before the regular-volume mutation, the mutation returns an exact
"not found" that the filer chunk-GC treats as completed, dropping the delete.
Recheck EC presence under the mutation lock and return a retriable 503 with the
"try again" token so the filer requeues it onto the EC path.

* fix(ec): recheck EC state before the regular batch-delete mutation

ec.encode mounts EC shards (copied from the .dat) before deleting the originals,
so a volume can be EC while its .dat still exists. The batch delete only rechecked
EC after a NotFound, so a successful regular-volume delete in that window wrote a
tombstone to the soon-removed .dat — the delete was lost and the needle resurrected
from the pre-tombstone shards. Recheck has_ec_volume under the write lock before
delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path.

* fix(volume): make the metrics push test independent of test order

test_push_metrics_once asserted the pushed body contains the request-counter
family without ever touching the counter — a CounterVec with no children emits
nothing, so the assertion only held when another test had already created a
labelset in the shared registry. Create one in the test itself.
2026-06-10 22:31:18 -07:00
Chris Lu caadd6ca79 ci(s3tables): stop Lakekeeper flaking on Docker Hub pull timeouts (#9920)
* ci(s3tables): drop docker pre-pull from Lakekeeper job

The lakekeeper repro is pure Go against the local weed binary; the job
kept failing on Docker Hub timeouts pulling python:3 and localstack
images the test never runs. Also drop the stale python-in-docker
comments left from the old harness.

* ci(s3tables): serve python:3 from GHA cache in the STS job

Retried pulls still die when both mirror.gcr.io and registry-1.docker.io
are unreachable from the runner. Cache the saved image tarball under a
weekly key: an exact hit skips the registry entirely, a miss pulls fresh
and refreshes the cache, and a stale tarball from a previous week is the
fallback when Docker Hub is down.

* ci(spark): pre-pull the spark tag the test actually runs

The workflow warmed apache/spark:3.5.8 with retries while the
testcontainers setup runs apache/spark:3.5.1, so the real image was
pulled at test time with no retry at all.
2026-06-10 13:26:30 -07:00
Chris Lu 2871e6552a fix(s3api): drop ancestor directory markers from prefixed ListObjectVersions (#9885)
processExplicitDirectory appended a directory-key object as a version
without checking it against the prefix. A versioned listing descends
through ancestor markers to reach a deeper prefix, so every ancestor
(Veeam/, Veeam/Backup/, ...) leaked into Versions even though none of
them match the prefix - which makes Veeam's immutable repository scan
abort on an unexpected key. Guard on the prefix so only keys at or under
it surface, matching ListObjectsV2 and AWS.
2026-06-09 00:01:06 -07:00
dependabot[bot] 2945f7e226 build(deps): bump io.netty:netty-handler from 4.2.13.Final to 4.2.15.Final in /test/java/spark (#9875)
build(deps): bump io.netty:netty-handler in /test/java/spark

Bumps [io.netty:netty-handler](https://github.com/netty/netty) from 4.2.13.Final to 4.2.15.Final.
- [Release notes](https://github.com/netty/netty/releases)
- [Commits](https://github.com/netty/netty/compare/netty-4.2.13.Final...netty-4.2.15.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-handler
  dependency-version: 4.2.15.Final
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-08 22:08:07 -07:00
Chris Lu 01637410e2 test(s3): address review feedback on the versioning suite (#9846)
- Different-users bucket test: use getNewBucketName() so the bucket carries the
  tracked prefix and run id and gets swept if the test leaks, instead of an
  untracked name.
- Makefile: clarify that '.' matches the opt-in stress tests but they self-skip
  without ENABLE_STRESS_TESTS, so they don't execute in the default run.
- Versioned list test: guard the Object.Size dereference with require.NotNil.
2026-06-06 20:50:09 -07:00
Chris Lu d321f9efb4 s3: collapse suspended-versioning deletes onto one null marker (#9845)
A suspended-versioning DELETE was recorded with createDeleteMarker, which mints a
fresh real version id each time, so repeated suspended deletes piled up delete
markers instead of overwriting a single null marker as S3 specifies. Record the
suspended delete as a 'null' marker with a fixed file name (v_null) and point the
latest-version pointer at it explicitly; putSuspendedVersioningObject's existing
null-version cleanup removes it on the next suspended PUT, so the object undeletes
cleanly and at most one null marker exists. Enabled-versioning deletes are
unchanged (still distinct historical markers).

Update TestSuspendedVersioningDeleteBehavior to the AWS-correct counts: one null
marker after a suspended delete, and the null marker plus one real marker after a
re-enabled delete.
2026-06-06 20:49:38 -07:00
Chris Lu fa9bf58c86 test(s3): make the whole versioning suite pass and gate it in CI (#9844)
* test(s3): correct bucket-recreate expectations and cover the different-owner case

A same-owner CreateBucket on an existing bucket returns BucketAlreadyOwnedByYou
(idempotent recreate); the suite expected BucketAlreadyExists, which only applies
when the name is owned by someone else. Fix the same-owner cases (plain and
Object-Lock) and implement the previously-skipped different-owner test, which now
exercises the BucketAlreadyExists path via a second identity.

* test(s3): assert the deletion invariant for suspended-versioning delete

A suspended-versioning DELETE removes the null version and records a delete marker
so the object reads as deleted; the test expected no marker, which would let an
older version resurface. Assert that a marker is recorded (and read DeleteMarker
through aws.ToBool) rather than an exact count, so it holds whether or not the
suspended-marker id/dedup is later collapsed to AWS's single null marker.

* test(s3): run the whole versioning suite by default

TEST_PATTERN was TestVersioning, which left bucket-creation, suspended-delete and
directory/version-listing tests ungated. Default to '.' so every test runs; opt-in
stress tests self-skip without ENABLE_STRESS_TESTS and keep their own targets.
2026-06-06 18:38:28 -07:00
Chris Lu 795349d796 test(s3): deref Object.Size in versioned list assertion (#9843)
TestVersionedObjectListBehavior compared int64 against listedObject.Size,
which is *int64, so the assertion always failed on a type mismatch once
reached. Dereference it (and in the log line).
2026-06-06 18:02:36 -07:00
Chris Lu 309cb32416 s3: list directory key objects in versioned bucket version listings (#9842)
ListObjectVersions gated explicit directory objects on Mime ==
FolderMimeType, but an SDK PutObject of "dir/" carries a default
Content-Type (e.g. application/octet-stream), so those directory keys
were dropped from the version listing while ListObjectsV2 - which keys
off IsDirectoryKeyObject (any non-empty mime) - still showed them. Use
the same IsDirectoryKeyObject check so the two listings agree.

The directory test's storage-class assertion compared an ObjectStorageClass
constant against ObjectVersion.StorageClass (ObjectVersionStorageClass);
the values matched but the SDK enum types did not, so it only surfaced
once the directories started appearing. Use the matching constant.
2026-06-06 18:02:33 -07:00
Konstantin Lebedev df833d485f [test] update docker image for s3test (#9811) 2026-06-03 09:45:00 -07:00
Chris Lu b6a0bde16b test(s3/iam): scope ListBucket isolation via s3:prefix condition (#9805)
The username-isolation policy denied s3:ListBucket through an object-path
NotResource. ListBucket is bucket-level, so its resource ARN is the bucket
and never matches an object path: the Deny always fired and a user could
not list their own prefix. Scope the per-user List deny with a StringNotLike
s3:prefix condition instead, the same mechanism the matching Allow uses.
2026-06-02 18:41:10 -07:00
Jaehoon Kim 4b23204023 fix(vacuum): writable volume re-notification after worker VACUUM (#9732)
* fix(vacuum): notify master writable after worker vacuum commit

Add Phase 3 (markWritableOne) that walks vacuumTargets and calls
VolumeMarkWritable on each replica's volume server, mirroring
batchVacuumVolumeCommit's per-replica SetVolumeAvailable. Failures are
logged at WARN; the task does not fail because the vacuum itself
already succeeded. See upstream seaweedfs#9685.

* fix(vacuum): delay Phase 3 to let post-commit heartbeats settle

Phase 3's VolumeMarkWritable can race with the volume server's first
post-commit heartbeat. SetVolumeWritable adds the vid to writables,
but a racing heartbeat whose ReadOnly value changed re-runs
EnsureCorrectWritables against the master's per-replica cache, and any
replica still cached as ReadOnly=true silently removes the vid again
— with no further heartbeat change to trigger another recovery.

Sleep 30s after Phase 2 (Commit) so every replica's post-vacuum
heartbeat has reached the master before Phase 3 fires. Cancel cleanly
on ctx.Done so a shutdown during the wait still exits.

* fix(vacuum): reduce post-commit settle from 30s to 10s

VolumePulsePeriod is 5s, so 10s (2x) is enough margin for every
replica's post-commit heartbeat to reach the master before Phase 3
fires. 30s was overly conservative and made TestVacuumExecutionIntegration
hit its 30s context deadline.

* fix(vacuum): use flat 1m timeout for VolumeMarkWritable RPC

VolumeMarkWritable on the volume server is a metadata operation
(reopen idx + flags + master ReadOnly=false heartbeat), independent
of volume size. Scaling via vacuumTimeout(time.Minute) gave it tens
of minutes — even hours on TB volumes — so a single unresponsive
replica could block Phase 3 indefinitely. Use a flat 1m cap.

* fix(vacuum): gate post-vacuum mark-writable on commit read-only state

Phase 3 force-called VolumeMarkWritable on every replica unconditionally,
clearing the read-only flag and persisting ReadOnly=false even for a
replica left read-only by an operator, an EIO quarantine, or low disk.
That overrode states the master deliberately keeps out of writables;
master built-in vacuum gates the same step on the commit's IsReadOnly via
SetVolumeAvailable.

Capture the VacuumVolumeCommit response and skip Phase 3 when any replica
came back read-only, letting it recover on its own ReadOnly=false
heartbeat. Drop the 10s post-commit settle sleep: the heartbeat race it
guarded needed a replica cached read-only at the master, which the gate
now excludes.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-29 23:43:24 -07:00
Chris Lu f5b833ab6a test(ec): end-to-end encode over a multi-server multi-disk stuck layout (#9728)
* test(framework): support multiple disks per server in MultiVolumeCluster

StartMultiVolumeClusterWithDisks gives each volume server N data
directories (one DiskLocation each), passed to -dir as a comma list, with
a per-server disk-dir accessor for file inspection. StartMultiVolumeCluster
keeps its one-disk default.

* test(ec): end-to-end encode over a multi-server multi-disk stuck layout

A volume in the stuck state — real .dat source, a 0-byte stub replica, and
partial stale EC shards from an interrupted encode — must converge to one
valid EC layout. Asserts the full shard set across servers, .ecx/.vif kept
per server (info file survives the source-volume delete), stale shards
cleared, and no regular .dat/.idx left behind.
2026-05-28 16:44:42 -07:00
Chris Lu dfd05d14cb refactor(filer): remove the inode->path index and the NFS gateway (#9724)
* fix(filer): derive inodes by hash instead of a snowflake sequencer

Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.

* chore: remove the experimental NFS gateway

The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.

* refactor(filer): drop the inode->path index

With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
2026-05-28 15:00:18 -07:00
Konstantin Lebedev 3537312045 [docker] add make test_keycloak_s3 for local develop and debug (#9719)
* add make test_keylock_s3 for local develop and debug

* fix typos

* add condition oidc:azp

* docker: reuse test/s3/iam realm and iam config for keycloak dev compose

Point the keycloak dev compose at the existing test/s3/iam configs instead
of a parallel realm/port/key/role set. Adds one declarative realm import
(seaweedfs-test-realm.json) as the single realm source and drops the
duplicated iam.json/s3.json.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-28 13:39:32 -07:00
Jaehoon Kim d00acded8a fix(vacuum): batch all replicas in a single plugin worker task (#9702)
* fix(vacuum): batch all replicas in a single plugin worker task

The plugin worker vacuum path emitted one TaskDetectionResult per
(volume, server) replica, but the dispatcher gates duplicate tasks per
volume via ActiveTopology.HasAnyTask. The first replica's task was
created and the remaining N-1 replicas were silently dropped, so only
one replica per volume was ever vacuumed — leaving the others with all
their garbage intact.

Mirror the master built-in flow (topology.vacuumOneVolumeId →
batchVacuumVolumeCheck/Compact/Commit/Cleanup) by:

- aggregating detection metrics by VolumeID so a single task carries
  every replica in TaskParams.Sources
- having VacuumTask accept []string servers (instead of a single
  string), re-check each replica's garbage ratio at execute time to
  derive a vacuumTargets subset, and run Compact/Commit/Cleanup against
  only that subset
- updating the dispatcher (plugin_handler.Execute, register.CreateTask)
  to forward every Sources node to NewVacuumTask

* fix(vacuum): run all-replica vacuum in two phases to keep failure atomic

The prior implementation iterated Compact → Commit → Cleanup against
each replica in sequence. A Compact failure on the second replica left
the first one already committed (its active files swapped with the
.cp* files), producing replica divergence with no automatic recovery.

Split performVacuum into two phases, matching topology.vacuumOneVolumeId:

  Phase 1 — Compact all targets. If any fails, run VacuumVolumeCleanup
  on every target to drop the .cpd/.cpx/.cpldb temp files, then abort.
  No replica has swapped yet, so every replica returns to its original
  state.

  Phase 2 — Commit all targets. Best-effort, matching
  batchVacuumVolumeCommit: per-replica errors are collected and
  surfaced together. Once any replica has swapped there is no clean
  rollback, so a partial Phase 2 failure requires operator
  reconciliation.

Adds compactOne / commitOne / cleanupOne / cleanupAll helpers and
removes the old performVacuumOne.

* fix(vacuum): abort when any replica's garbage check fails

The prior check tolerated per-replica RPC errors and only failed the
task if every replica errored — partial failures were silently treated
as "ineligible" so the responding replicas would still be vacuumed.
That produces divergence the moment the unreachable replica comes
back: it still carries the original garbage while the others have
been compacted.

Match topology.batchVacuumVolumeCheck's contract instead — its return
value (errCount == 0 && len(vacuumLocationList.list) > 0) gates the
whole vacuum on every replica's check succeeding. If any replica is
unreachable or its VacuumVolumeCheck RPC errors, abort the task; the
volume will be retried on the next detection cycle once the replica
is healthy.

* fix(vacuum): guard against nil metrics and TaskSource entries

Detection's bucket-building loop dereferenced m.VolumeID without
checking m for nil. VacuumTask.Validate built sourceSet from
params.Sources without checking each entry for nil. Both paths would
panic on a malformed protobuf payload that managed to deliver a nil
slot. Skip nil entries in both loops — neutral with the existing
nil/empty filtering already done in register.CreateTask and
plugin_handler.Execute.

* test(vacuum): success path no longer calls VacuumVolumeCleanup

The plugin worker vacuum is now two-phase (Compact-all → Commit-all,
with Cleanup only invoked on Compact failure to roll back .cp* temp
files). This matches topology.vacuumOneVolumeId, where
batchVacuumVolumeCleanup runs only on the Compact-failure branch.

On a successful Commit the temp files do not linger:
  - CommitCompactVolume renames .cpd → .dat and .cpx → .idx
  - leveldb needle map renames .cpldb → .ldb (needle_map_leveldb.go)

so calling VacuumVolumeCleanup afterwards is a redundant no-op. The
prior worker code called it unconditionally and the integration test
asserted that — switch the expectation to cleanupCalls == 0 to
document the new (and master-aligned) contract.
2026-05-27 11:15:25 -07:00
Chris Lu 4f17c6661a test: keep AllocateMiniPorts off weed mini default ports
Random allocation could pick 33646 = admin.port (23646) + GrpcPortOffset.
weed mini reserves that as Admin's gRPC port even when the test only
overrides Master/Filer/S3/Iceberg, so the explicit Filer flag failed
with "reserved for gRPC calculation" and TestRisingWaveIcebergCatalog
flaked. Pre-seed the reserved set with every mini default HTTP port
plus its +10000 offset so a random pick (or its own gRPC offset) cannot
land on a service the caller left at its default.
2026-05-26 16:48:46 -07:00
Chris Lu 77dcb20a74 writeJson: drop unused JSONP branch (#9686)
* writeJson: drop unused JSONP branch

No in-tree caller uses ?callback=. Always serve application/json
with X-Content-Type-Options: nosniff.

* seaweed-volume: drop unused JSONP branch

Mirror Go: always serve application/json with
X-Content-Type-Options: nosniff.

* writeJson: drop unreachable StatusNotModified check

bodyAllowedForStatus already returns early for 304.

* test/volume_server: rename and rewrite JSONP test to assert callback is ignored

CI: /status?callback=myFunc now returns plain application/json
with X-Content-Type-Options: nosniff.
2026-05-26 01:05:07 -07:00
Chris Lu b21c263328 test/fuse_dlm: cross-mount POSIX locks + survival across a ring change (#9677)
Adds two FUSE integration tests on the existing dlm cluster harness (the
-dlm mounts route advisory locks to the owner filer):

- TestPosixLockCrossMount: an flock taken on one mount blocks the other,
  and is grantable after release — the routed-to-owner path end to end.
- TestPosixLockSurvivesFilerLoss: hold flocks on many files, stop filer1
  so keys it owned migrate to filer0; after the ring settles and the
  holding mount re-asserts, every lock is still honored. Asserts only the
  settled state; the transient migration window is unit-covered.

Locks are taken on read-only fds so the -dlm whole-file write lock (a
different mechanism, held until close) isn't involved. Skipped on
non-Linux: only Linux forwards advisory locks (SETLK) to the FUSE server;
macFUSE handles flock in-kernel per mount.
2026-05-25 16:20:23 -07:00
Chris Lu e8e7cd6fac filer: POSIX advisory lock set primitive (phase 1 of distributed FUSE locking) (#9660)
* filer: POSIX advisory lock set primitive (phase 1)

Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range
and flock whole-file locks, extracted from the mount's PosixLockTable
without its wait queue or inode-map concurrency. Owner identity is
(Sid, Owner) so the same FUSE owner on different mounts never aliases,
and ReleaseSession reaps a dead mount's locks. The owner filer will hold
one Set per inode under the per-path lock; no concurrency control here.

* test: tolerate transient FUSE invisibility in ConcurrentReadWrite

A concurrent truncating overwrite leaves a short-lived dentry/cache window
where the file is momentarily ENOENT to another opener. Retry the reads and
writes a few times before failing, as ConcurrentDirectoryOperations does.
2026-05-24 21:56:48 -07:00
Chris Lu 6fc212cedb test: wait for a writable volume before lifecycle tests' first write (#9658)
Probe one throwaway write once per process before the lifecycle tests run, absorbing the post-start volume-growth window so the first real PutObject doesn't race volume growth and 500. Each call is bounded by the remaining 60s budget; CreateBucket is retried within it.
2026-05-24 14:01:13 -07:00
Chris Lu dc5621d2ae s3: use oidc: prefix for trust-policy conditions in IAM example (#9653)
* s3: use oidc: prefix for trust-policy conditions in IAM example

Trust-policy conditions for AssumeRoleWithWebIdentity see OIDC claims
under the oidc: prefix, so the docker example's bare "roles" key never
matched and denied every web-identity assume against those roles. Switch
the three roles to oidc:roles.

Also document the available trust-policy condition keys (oidc:iss/sub/aud,
oidc:<claim>, aws:FederatedProvider, aws:userid, sts:DurationSeconds) and
note that roleMapping selects the role for direct OIDC bearer auth while
STS uses the explicit RoleArn plus trust policy.

* s3: clarify aws:userid differs between trust policy and request auth

aws:userid is the raw sub claim during trust-policy evaluation, but a
stable sub+iss hash (ComputeParentUser) during S3 request authorization
after the role is assumed. Note both so the two contexts aren't conflated.
2026-05-23 20:02:48 -07:00
Chris Lu 3392493f0a test(volume): fix race in TestReplicatedUploadSucceedsImmediatelyAfterAllocate (#9613)
test(volume): wait for master to register both replicas before replicated upload

TestReplicatedUploadSucceedsImmediatelyAfterAllocate allocated the volume on
both nodes via direct AllocateVolume gRPC calls, then uploaded immediately. The
master only learns about replica locations through volume-server heartbeats,
which lag behind those direct gRPC calls, so the replicated write could look up
the master before the second replica was registered and fail with a 500
("replicating operations [1] is less than volume replication copy count [2]").

In production a client obtains its fid from the master assign flow, which
guarantees the master already knows every replica. The test crafts the fid by
hand, bypassing that guarantee, so wait until the master reports both replicas
before uploading.
2026-05-21 09:58:37 -07:00
Chris Lu 3825035f07 test(ec): deterministically populate disks before multi-disk EC balance check (#9611)
The disk-spread assertion raced volume growth and heartbeats. volume.grow
-count is a writable-target topup, not add-N, and swallows partial-failure
errors, so one grow could leave a node's data on a single disk; ec.encode
then piles all that node's shards there and ec.balance can't spread them.

Retry grow on under-spread nodes until the master topology shows every node
holding volumes on at least two physical disks, then encode.
2026-05-21 09:39:55 -07:00
Chris Lu cd15ae1395 fix(ec): bring ec.encode worker and EC/volume helpers to parity with shell (#9599)
* refactor(volume): extract replica sync/select into shared volume_replica package

Move the volume replica reconciliation helpers (status, union builder,
SyncAndSelectBestReplica, ReadNeedleMeta) out of the shell into a new
weed/storage/volume_replica package so both the shell (ec.encode, volume.tier.move,
volume.check.disk) and the EC encode worker can reuse them. No behavior change.

* fix(ec): bring ec.encode worker to parity with the shell

- Sync replicas and encode the most-complete one (via the shared
  volume_replica.SyncAndSelectBestReplica) instead of a possibly-stale replica,
  marking all replicas readonly first. Prevents silent data loss when a stale
  replica is encoded and the originals deleted.
- Skip remote/tiered volumes in detection (shell ec.encode excludes them).
- Min-node safety gate: refuse to encode when cluster nodes < parity shards.
- Align default thresholds with the shell (fullness 0.95, quiet 1h).

* fix(vacuum): plugin path honors min_volume_age_seconds override

deriveVacuumConfig hard-coded MinVolumeAgeSeconds=0, dropping any configured
value. Read it from worker config (default 0, matching the shell/master vacuum
which has no age gate) so an explicit override is honored.

* address review feedback

- config.go: align GetConfigSpec schema defaults (quiet_for_seconds=3600,
  fullness_ratio=0.95) with the runtime defaults so UI/bootstrap flows match the
  shell (coderabbitai).
- ec_task.go: roll back readonly when markReplicasReadonly fails partway, so
  already-marked replicas don't stay readonly (coderabbitai).
- volume_replica: pass the caller's replica statuses into buildUnionReplica instead
  of re-fetching them, and skip the per-needle ReadNeedleMeta RPC when the source
  replica is read-only (gemini-code-assist).

* test(plugin_workers/ec): make fixtures eligible under the new defaults

The default EC encode thresholds were raised to match the shell (fullness 0.95,
quiet 1h), but the plugin-worker integration fixtures still used 90%-full /
10-minute-old volumes, so detection found no eligible volumes and the tests failed
in CI. Bump the eligible fixtures to 96% full and 2h old.
2026-05-21 02:16:28 -07:00
Chris Lu 7e4691f2dc test(ec): make multi-disk EC balance disk-spread assertion deterministic (#9595)
test(ec): pre-populate disks so multi-disk EC balance spread is deterministic

The multidisk shard-loss regression asserts EC shards spread across more
than one disk per node, but that only holds for disks the balancer can see.
The master enumerates a physical disk only when it already holds a volume
or EC shard — an empty disk leaves no trace, since heartbeats aggregate
capacity per disk type, not per physical disk. So whether the post-encode
balance spread shards depended on how the master happened to place the
filler volumes across disks, which varies by environment: the test passed
locally (shards on 5 disks) but produced one disk per node in CI and failed
the "got 3 disks across 3 nodes" assertion.

Grow a few volumes on each server before encoding so every physical disk
holds a volume and is visible to the balancer. The volume server places
each new volume on its least-loaded disk, so a handful of grows touches
every disk, making the spread deterministic. The assertion still has teeth:
it counts disks holding shard files, so a balancer that failed to spread
would still collapse to one disk per node.
2026-05-21 00:17:14 -07:00
Chris Lu 391f543ff2 fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes

The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.

Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.

* fix(ec): attribute EC balance source disk per shard and reject same-node moves

On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.

Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.

* fix(ec): spread EC balance moves across destination disks

Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
  - fewer total EC shards on the disk,
  - far fewer shards of the same volume on the disk (spread a volume's shards
    across disks for fault tolerance), and
  - data/parity anti-affinity (a data shard avoids disks holding the volume's
    parity shards and vice versa).

Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.

* fix(ec): bring EC balance worker to parity with shell ec.balance

The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:

- Cross-rack and within-rack now run a two-pass balance: data shards spread
  first, then parity shards spread while avoiding racks/nodes that already hold
  the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
  and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
  constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
  keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
  10+4 here), replacing the previously hardcoded constant at the call sites.

Selection is deterministic (sorted keys) to keep behavior reproducible.

* refactor(ec): extract shared ecbalancer package for shell and worker

The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.

- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
  within-rack data/parity two-pass with anti-affinity, global per-rack balance,
  and diversity-aware disk selection) over a caller-built Topology snapshot and
  returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
  task proposals; the shell builds it from its EcNode model and executes Moves
  via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
  each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
  (worker balances by fractional fullness; shell by raw count) and
  GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
  in one pass).

The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.

* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor

Address regressions and gaps from the ecbalancer extraction:

- Shell ec.balance honors -maxParallelization again: planned moves run phase by
  phase (preserving cross-phase dependencies) with bounded concurrency within a
  phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
  updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
  instead of combined totals, so a data/parity skew is corrected even when the
  per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
  custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
  move, so limited-capacity targets are no longer over-planned.

* fix(ec): make EC dedup keeper deterministic and capacity-aware

When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.

* fix(ec): restore global volume-diversity and per-volume move serialization

Two more behaviors lost in the ecbalancer refactor:

- Global rack balancing again prefers moving a shard of a volume the destination
  does not hold at all before adding another shard of an already-present volume
  (two-pass, mirroring the old balanceEcRack), keeping each volume's shards
  spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
  while still running different volumes in parallel, so concurrent moves of the
  same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.

* fix(ec): key EC balance shards by (collection, volume id)

A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.

* fix(ec): credit freed capacity from dedup before later balance phases

Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).

* test(ec): add multi-disk EC balance integration test

Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.

* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes

collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.

Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.

* test(ec): end-to-end multi-disk EC balance shard-loss regression

Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.

* fix(ec): bring EC balance worker/plugin path to parity with shell

- Per-volume serialization and phase order: key the plugin proposal dedupe by
  (collection, volume) instead of (volume, shard, source), so the scheduler runs
  only one of a volume's moves at a time (within a run and against in-flight jobs).
  Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
  because the planner emits a volume's moves in phase order, they now execute in
  order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
  a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
  instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
  read it into the worker config, mirroring ec.balance -shardReplicaPlacement.

* test(ec): rename worker in-process test (not a real integration test)

The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.

* ci(ec): remove redundant ec-integration workflow

ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.

* fix(ec): worker falls back to master default replication for EC balance

For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.

The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.

* fix(ec): plugin path populates master default replication too

The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.

* docs(ec): empty shard replica placement uses master default, not even spread

The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
2026-05-20 23:31:21 -07:00
Chris Lu afcc491517 test: fix fd leak in the Samba DLM handoff test (promote xfail checks) (#9592)
test(mount): fix fd leak that deadlocked the DLM handoff check

The cross-mount handoff checks held a file open on mount 2 via fd 9 to
keep the distributed lock, then started the SMB writer in a background
subshell. The subshell inherited fd 9, so the SMB writer kept the file
open and waited on a lock held by its own descriptor; the put could
never complete, and the two checks were parked as expected-fail.

Close fd 9 in the subshell (9>&-) so the writer does not hold the file.
The waiter now acquires the freed lock within ~1s, so the two checks are
real assertions and the xfail machinery is gone.
2026-05-20 16:17:13 -07:00
Chris Lu a5d0e4a735 Samba-over-FUSE integration test and distributed-lock handoff fixes (#9590)
* test(mount): add Samba over FUSE integration test

Export a SeaweedFS FUSE mount over SMB with smbd and drive it with
smbclient: file round-trips, directories, rename, large-file chunking,
recursive upload, cross-protocol consistency, and deletes.

A second -dlm mount adds locking coverage: POSIX fcntl byte-range locks,
distributed-lock write coordination, and concurrent writers. The two
cross-mount handoff checks currently fail and pin a known limitation -
the distributed lock is released on FUSE Release, which the kernel can
delay under contention.

Runs locally via test/samba/run.sh or in Docker via the compose file;
wired into CI as samba-integration.yml.

* fix(cluster): release distributed lock without racing the renewal goroutine

Stop() closed the cancel channel, slept 10ms, then unlocked using
renewToken. A renewal in flight during that window rotates the token on
the server, so the unlock may be sent with a stale token, fail with a
mismatch, and leave the lock to linger until its TTL expires - stalling
other mounts waiting to write the same file.

Wait for the renewal goroutine to exit before unlocking. The channel
close also makes the renewToken read happen-after the last renewal.

* fix(cluster): poll for distributed lock acquisition without exponential backoff

A mount waiting to write a file held by another mount acquired through
util.RetryUntil, whose backoff grows to several seconds. Once the holder
released, the waiter could sleep that long before retrying, stretching
the cross-mount handoff past client timeouts.

Poll at the steady ~1s cadence AttemptToLock already enforces instead.

* test(mount): tighten Samba harness and mark the DLM handoff checks xfail

Run the workflow for weed/cluster changes, fail fast when the filer or
smbd port never opens, and fold the recursive mput result into its own
assertion so it cannot false-pass.

Mark the two cross-mount handoff checks expected-fail: they pin the
remaining DLM liveness bug (the lock is freed only on the delayed FUSE
Release) without failing CI, and turn the suite red if the handoff is
ever fixed.

* fix(cluster): keep a wedged renewal shutdown from sending a stale unlock

If the renewal goroutine is stuck in a slow RPC, Stop() fell through to
unlock anyway once it timed out waiting. A late renewal can rotate
renewToken, so that unlock races it, is rejected on a stale token, and
leaves the lock lingering until its TTL regardless. On the timeout path,
skip the unlock and let the TTL expire the lock instead.

* fix(cluster): wake the long-lived lock renewal loop promptly on Stop

StartLongLivedLock's renewal loop slept uninterruptibly between attempts,
up to 5*renewInterval (2.5*lockTTL) while unlocked. Stop() waits only
lockTTL+2s for the goroutine to exit, so a Stop() during that backoff
would time out before the goroutine woke and closed renewalDone,
breaking the shutdown synchronization. Sleep on a timer with a select on
cancelCh so the loop exits immediately.
2026-05-20 14:52:17 -07:00
Chris Lu 285025eb73 s3api: support group inline policies + Condition enforcement (#9569)
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap

Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload
→ VerifyActionPermission with a synthetic 127.0.0.1 request and assert that
the policy's IpAddress condition flips the outcome.

The user-policy cases pass on master (hydrateRuntimePolicies already routes
inline docs through the policy engine, so Condition blocks are honored end-
to-end). The group-policy case fails: PutGroupPolicy still returns
NotImplemented, so a group inline doc never lands in the engine.

Integration counterparts live under test/s3/iam/ and exercise the same
paths against a live SeaweedFS S3+IAM endpoint.

* s3api: support group inline policies + Condition enforcement

PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to
return NotImplemented in embedded IAM mode, so anything attached to a
group as an inline doc — including aws:SourceIp or any other Condition —
was simply unreachable.

Wire the four endpoints to the credential-store methods that were
already in place (memory, postgres, filer_etc all implement
GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies
now also walks LoadGroupInlinePolicies, registers each doc in the IAM
policy engine under __inline_group_policy__/<group>/<policy>, and
appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up
through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are
added to the ReloadConfiguration trigger list in DoActions.

Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too.
Without it iam.groups never repopulated on a memory-store reload, so
group policy evaluation silently no-op'd whether the policy was inline
or attached. The existing tests didn't notice because no test reloaded
through cm after creating a group.

The NotImplemented unit test is inverted to drive the new round-trip.

* s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy

DoActions already triggers ReloadConfiguration for both actions via the
explicit reload list, so calling refreshIAMConfiguration inline runs the
load twice per request. Per PR review.

* s3api: scope group-policy resource names per test; tighten deny polling

- Integration test resource names get a per-test suffix so retried or
  parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists.
- Deny-path Eventually loops gate on AccessDenied via a typed helper
  rather than any non-nil error; transient setup errors no longer end
  the wait prematurely.
- ListGroupPolicies returns ServiceFailure when the credential manager
  is nil, matching Put/Get/DeleteGroupPolicy.

* test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs

CI runners with happy-eyeballs resolve `localhost` to ::1 first, in
which case a 127.0.0.0/8-only allow would silently never match and the
deny-driven enforcement test would hang for the allow case. Add ::1/128
to every loopback-matching policy so the allow path works regardless of
which loopback family the SDK lands on.
2026-05-19 16:03:45 -07:00
Chris Lu 77ac781bbd fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568)
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers

When a volume server holds EC shards for the same vid across more than
one disk, each DiskLocation registers its own EcVolume entry and
Store.FindEcVolume returns whichever one it hits first. The shard-info
RPC iterated only that single EcVolume's Shards, so the response missed
every shard mounted on a sibling disk.

The worker's verifyEcShardsBeforeDelete sums the per-server responses
into a union bitmap and refuses to delete the source volume when the
union falls short of dataShards+parityShards. On multi-disk
destinations, the union was systematically under-counted and source
deletion got blocked even though all shards were physically present and
mounted.

Walk every DiskLocation in the handler and emit the deduplicated union
of all shards. The .ecx-backed fields (file counts, volume size) still
come from a single EcVolume since every disk's entry opens the same
.ecx via NewEcVolume's cross-disk fallback.

Tests:
- TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in
  weed/server/.
- test/volume_server/grpc/ec_verify_multi_disk_test.go integration test
  drives the full generate -> mount -> redistribute -> restart ->
  reconcile path and asserts both VolumeEcShardsInfo and
  VerifyShardsAcrossServers + RequireFullShardSet (the production
  source-deletion gate) report all 14 shards.
- ec_multi_disk_lifecycle_test.go tightened: replaces the
  "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a
  full-shard-set assertion.

* review: use ShardBits bitmask + cap-pre-allocation for shard dedup
2026-05-19 14:58:56 -07:00
Chris Lu f72983c1fd fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"

The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".

Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.

ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.

* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal

The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.

Extend the matcher so a request is recognized as S3 Tables when either:

  - its AWS V4 credential scope names SERVICE=s3tables; or
  - it carries the canonical AWS JSON RPC 1.1 content type and is
    unsigned (a request explicitly signed for SERVICE=s3 still wins).

The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).

Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.

* s3api: extend service matcher to all S3 Tables routes; simplify scope check

- Apply serviceMatcher to every S3 Tables route, not just the bare-path
  ones. ARN-bearing paths could otherwise be hit by an S3 object key
  that starts with arn:aws:s3tables:..., inside a bucket named
  "buckets", "namespaces", "tables", or "tag". One matcher everywhere
  closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
  credential-scope check. The scope shape is fixed at
  AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
  and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
  is exactly s3tables. Existing unit cases (including the
  access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
  the single Read could return short and make the equality check flaky.

* s3api: drop content-type fallback; sign s3 tables harness traffic instead

The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.

Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.

Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
2026-05-19 14:24:25 -07:00
Chris Lu a761441926 fix(test): reserve mini ports on all interfaces; bound risingwave cleanup shell (#9545)
The 127.0.0.1-only reservation in AllocateMiniPorts/AllocatePortSet let
another process hold the gRPC port on a different interface, so weed
mini's isPortAvailable check failed and it shifted master.grpc. weed
shell -master=<HTTP> still derives grpc as HTTP+10000 and dialed the
unused port, hanging until the 30s context deadline killed it. Bind the
reservation listeners on :port to match mini's check.

Also bound listFilerContents in catalog_risingwave with a 30s
exec.CommandContext so a hung weed shell during failure-cleanup can't
burn the 20-minute test budget.
2026-05-18 14:16:22 -07:00
Chris Lu 6cab199400 fix(iceberg): dial filer gRPC address verbatim in plugin worker (#9527)
* fix(iceberg): dial filer gRPC address verbatim in plugin worker

dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress,
whose single-port fallback adds +10000 to any host:port — so when the admin
forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker,
the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent
28888 and dispatched jobs fail with connection refused.

Drop the conversion; the address is already dialable. Tests that produced fake
filer addresses in dual-port form now return host:grpcPort to match the new
contract.

* test(ec): use renamed detection_interval_minutes field

The admin_runtime.detection_interval_seconds field was renamed to
detection_interval_minutes back in May. This integration test was not
updated, so the unknown JSON field was silently ignored and the scheduler
fell back to the default detection interval (17 min for erasure_coding),
which exceeds the test's 5-minute wait and times out.

Switch to detection_interval_minutes: 1 — local run completes in ~120s.
2026-05-17 23:03:00 -07:00
Chris Lu c11ff6657b fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525)
* fix(ec): mirror EC sidecars onto every shard-bearing disk at startup

In a multi-disk volume server, ec.balance and ec.rebuild can land shards
on a disk that does not also hold the matching .ecx / .ecj / .vif index
files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks
already loads those shards by pointing the EcVolume at the sibling
disk's index files; reads work, but any failure on the index-owning
disk silently disables every shard on the other disk, even though those
shards are physically fine.

This change adds mirrorEcMetadataToShardDisks, a startup pass that
physically replicates .ecx / .ecj / .vif onto each disk that holds
shards but is missing them. Each copy is atomic (tmp + fsync + rename)
and idempotent (a destination that already has the sidecar is
preserved). After mirroring, the cross-disk reconciler prefers the
local IdxDirectory so the EcVolume mounts self-contained; the
cross-disk virtual mount remains as a fallback for volumes whose mirror
failed (read-only target, out of space, partial copy on a previous
boot).

The same-disk invariant the EC lifecycle (encode / decode / balance /
vacuum / repair) was already documented as promising is now actually
restored at boot, so a future failure of one disk in a split-shards
layout no longer takes the other disk's shards with it.

Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj /
.vif from dir1) and idempotency (an existing destination .ecx is not
overwritten with the owner's copy).

* fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check

hasAllEcSidecarsLocally checked only the modern destination path
(IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk
that still had a legacy .ecx in its data dir (written before -dir.idx
was set) would report "not present" and the mirror would write a
second copy to IdxDirectory, leaving two .ecx files on disk.

Matches HasEcxFileOnDisk's open-with-fallback contract: check the
modern path first, then the opposite directory. Factored the
exists-and-not-a-dir check into a small statRegular helper so the
fallback ladder stays readable.

* rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup

Port of the Go fix (commit 088e26ea6) to the Rust volume server.
Adds Store::mirror_ec_metadata_to_shard_disks, called from
add_location / load_new_volumes before the cross-disk orphan
reconciler. Physically copies .ecx / .ecj / .vif from the disk that
owns the index files onto every disk holding shards but missing
sidecars, so each shard-bearing disk ends up self-contained.

The reconciler now prefers the local idx_directory when the mirror
has installed a .ecx there; the cross-disk virtual mount remains as
the fallback for volumes whose mirror failed (read-only target, out
of space, partial copy on a previous boot). Adds ec_local_ecx_path
helper shared between reconcile and mirror to detect the post-mirror
fast path.

Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when
-dir.idx is configured and the destination still has a legacy .ecx
in its data dir, that's recognized so the mirror does not write a
duplicate copy into idx_directory.

Tests cover the two key cases: orphan layout (dir0 receives the
sidecars from dir1) and idempotency (a pre-existing destination .ecx
is not overwritten).

* trim verbose comments on EC mirror code

Comments now lead with the WHY (non-obvious constraints, the
post-mirror fast path, why local copies are authoritative) and drop
restate-the-code blocks, headers, and section dividers. Behavior is
unchanged; all existing tests still pass on both the Go volume
server and the seaweed-volume Rust port.

* drop github issue refs from added comments

Two stray "#9212" references slipped into comments I added on the
cross-disk reconciler call site. The git log carries the issue
history; comments stand on their own.

* test(ec): accept rebuild on either disk after sidecar mirror

TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must
land at the disk-0 path. With the boot-time sidecar mirror, every
shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now
picks whichever disk hosts the most shards — disk 1 in this layout
after the deletion. The shard can legitimately rebuild on either
disk; the test now accepts both and uses the chosen path for the
subsequent mount + read verification.
2026-05-17 19:55:15 -07:00
Chris Lu b4289abb0a admin: convert filer address to gRPC form before dispatch (#9523)
The master returns each registered filer in pb.ServerAddress dual-port
form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's
plugin context builder forwarded that string verbatim as
filer_grpc_address, so workers calling grpc.DialContext on it failed
every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port".

Run each entry through pb.ServerAddress.ToGrpcAddress before populating
ClusterContext.FilerGrpcAddresses.

The lifecycle integration test now pins filer.port.grpc to a value that
breaks the FILER_PORT+10000 assumption, and a new dispatch test drives
the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and
asserts the dispatched job both reaches the filer and deletes the
backdated object.
2026-05-17 11:33:54 -07:00
Chris Lu 2a41e76101 fix(ec): blanket-clean every destination over the full shard range (#9512)
* fix(ec): blanket-clean every destination over the full shard range

The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.

Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.

Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.

* log(ec): include disk_id in EC unmount/delete/refusal log lines

The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:

  UnmountEcShards 4121.1                              -> add disk_id
  ec volume video-recordings_4121 shard delete [1 5]  -> add per-loc disk_id
  volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
  ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids

ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.

Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).

* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server

The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
2026-05-17 11:31:37 -07:00
Chris Lu 3a8389cd68 fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490)

Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.

- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
  union the bitmaps, and refuse to call deleteOriginalVolume unless all
  TotalShardsCount distinct shard ids are observed. A failed gate leaves
  the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
  topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
  source destruction is traceable in default-verbosity production logs.

The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.

* fix(ec): trim doc comments on the new shard-verification path

Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).

* fix(ec): drop issue-number anchors from new comments

Issue references age poorly — the why behind each comment already
stands on its own.

* fix(ec): parametrize RequireFullShardSet on totalShards

Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.

* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo

The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.

* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files

Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.

* fix(ec/decode): verify regenerated .dat before deleting EC shards

After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.

Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.

* fix(ec): address PR review nits in verification + fake server

- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
  bound is explicit (Set already no-ops on overflow, this is for
  clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
  doesn't panic the test process.
2026-05-13 19:29:24 -07:00
dependabot[bot] 453c735d02 build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 in /test/kafka (#9489)
build(deps): bump github.com/go-git/go-billy/v5 in /test/kafka

Bumps [github.com/go-git/go-billy/v5](https://github.com/go-git/go-billy) from 5.8.0 to 5.9.0.
- [Release notes](https://github.com/go-git/go-billy/releases)
- [Commits](https://github.com/go-git/go-billy/compare/v5.8.0...v5.9.0)

---
updated-dependencies:
- dependency-name: github.com/go-git/go-billy/v5
  dependency-version: 5.9.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-13 14:45:44 -07:00
Chris Lu d5c0a7b153 fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response

LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume
builds Location with only Url / PublicUrl / DataCenter, so callers fall back to
ServerToGrpcAddress (httpPort + 10000). On any deployment where that
convention does not hold — multi-disk integration tests, custom port layouts
— EC reads dial the wrong port and quietly degrade to parity recovery.

* fix(volume/ec): probe every DiskLocation when serving local shard reads

reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the
DiskLocation that physically owns it, so a multi-disk volume server can hold
shards for the same vid in two separate ecVolumes — one per disk — with .ecx
on whichever disk owned the original .dat. The read path only consulted the
single EcVolume FindEcVolume picked, so requests for shards on the sibling
disk fell through to errShardNotLocal and then to remote/loopback recovery.

Walk all DiskLocations after the first probe in both readLocalEcShardInterval
and the VolumeEcShardRead gRPC handler; the latter also covers the loopback
that recoverOneRemoteEcShardInterval falls back to when a peer dial fails.

* test(volume/ec): cover the multi-disk EC lifecycle end-to-end

Two integration tests against a real volume server with two data dirs:

TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read ->
drop .dat -> stop -> redistribute shards across disks -> restart -> verify
reconcileEcShardsAcrossDisks attached the orphan shards and reads still
work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild
pulls input from both disks -> reads still work.

TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478
reproducer at the cluster level: seed a healthy .dat on disk 0, plant the
on-disk footprint of an interrupted EC encode on disk 1, restart, and assert
pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0.

Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous
run's volume.log is rotated to volume.log.previous so a startup regression on
the second run does not lose the first run's diagnostics.

* review: trim verbose comments

* review: drop racy fast-path, use locked findEcShard directly

gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval
and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads
the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent
mount / unmount could race with it. findEcShard already walks every
DiskLocation under the right lock, so the fast-path adds nothing but the
race. Collapse both call sites to a single locked call.

Also note in RestartVolumeServer why the log-rotation error is swallowed:
absence on first call is benign; anything else surfaces in the next
os.Create in startVolume.
2026-05-13 13:56:20 -07:00
Chris Lu b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu 10cc06333b cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.

Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
  the dc/rack/node tree, kept in sync from doLinkChildNode and
  UnlinkChildNode plus the ip/port-rewrite branch in
  GetOrCreateDataNode. GetTopology now returns nil on a detached
  subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
  hasVolumeServer answers without scanning every vid location. The
  add path skips empty-address entries the same way the delete path
  already does, so a zero-value Location cannot leak a permanent
  serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
  of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
  protect currentMaster reads/writes with an RWMutex, fixing the
  data race with the heartbeat goroutine. The seed slice is copied
  on construction so external mutation cannot desync it from the
  frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
  servers no longer carry a peer-volume list, and the note next to
  the dropped probe is reworded to make clear that direct
  volume-to-volume reachability is intentionally not validated by
  this command.

Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.

Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
2026-05-12 13:00:52 -07:00
Chris Lu 69da20bdae volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle

Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the
destructive volume-server RPCs, and reject cluster-internal remote S3
endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before
dialing. Pin the validated address against DNS rebinding by routing the
AWS SDK through an HTTP transport whose DialContext re-resolves the host
and re-applies the deny list on every dial, so an endpoint that resolves
to a public IP at validate-time and then flips to 127.0.0.1 at connect
time is refused. Operators that legitimately fetch from private hosts
can opt out with -volume.allowUntrustedRemoteEndpoints.
2026-05-12 10:11:20 -07:00
Chris Lu 46bb70d93e feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions

A version's noncurrent TTL clock starts when the next version is
written, not at its own mtime. Today the lifecycle engine derives
that moment from the next-newer sibling's mtime — a heuristic that
drifts if the sibling is later modified and is unavailable when
the demoting event sits outside meta-log retention.

Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at
the two places where a PUT flips the latest pointer:
updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning. Timestamp source is
time.Now().UnixNano() captured once per demotion — the documented
Phase 1 fallback until the filer write API surfaces its own TsNs.

Engine reads the stamp on both the bootstrap walker path and the
event-driven router; missing/zero falls back to the legacy
sibling-mtime derivation, so pre-stamp entries keep working.

Prerequisite for the daily-replay lifecycle worker (Phase 2+).

* fix(s3): address CI failure and PR review feedback

- Backdating tests must move both clocks: the lifecycle integration
  tests backdate version mtimes to simulate aging, but my earlier
  commit made the engine prefer the explicit demotion stamp over
  sibling mtime, so a real-now stamp dominated a backdated mtime and
  the rule never fired. Update backdateVersionedMtime to also rewrite
  Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it.
  This is a test simplification — production stamps record when the
  successor was written, not the demoted version's own mtime — but the
  resulting clock is correctly old enough.

- Refactor stamp parsing into one shared helper. Per gemini-code-assist:
  the parsing logic for ExtNoncurrentSinceNsKey was duplicated in
  router/router.go and scheduler/bootstrap.go. Move it to a new
  weed/s3api/s3lifecycle/noncurrent_since.go as exported
  SuccessorFromEntryStamp; both call sites now go through it.

- Make the parser ordering test deterministic. Per coderabbitai:
  time.Now().UnixNano() drops the monotonic clock component, so
  two back-to-back calls can decrease if the wall clock steps
  backward — the prior test was exercising OS clock behavior rather
  than the parser. Replace with fixed nanosecond values.

- Close a suspended-versioning race. Per coderabbitai: the prior
  putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning
  after putToFiler returned, i.e. after the object write lock released.
  A concurrent PUT could promote a newer latest version, which we'd
  then wipe — leaving the older "null" object incorrectly current.
  Move the cleanup into the afterCreate callback so the null write and
  the .versions pointer clear (including the new demotion stamp) run
  atomically under the same lock. Best-effort logging is preserved.

* fix(s3/lifecycle): clear noncurrent_since stamp on test backdate

Backdating a version's mtime in tests is not a coherent claim about
when it became noncurrent — production stamps record the successor's
PUT time, which the test doesn't manipulate. The prior commit rewrote
the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent
that creates an inconsistent state: v3's stamp says "demoted 30 days
ago" while v4's mtime (the supposed demoter) is real-now. With both
NewerNoncurrentVersions and NoncurrentDays in the same rule, the
NoncurrentDays floor passes against the backdated stamp and the
rank-based check then deletes v3 via the meta-log historical replay
that misranks against current state.

Clearing the stamp instead lets the lifecycle engine fall back to the
sibling-mtime derivation the tests were originally written against:
the legacy code path is preserved end-to-end while the new explicit-
stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go
and the bootstrap-walker integration in scheduler/bootstrap_test.go.

The deeper interaction — historical meta-log replay ranking against
current state inside routePointerTransitionExpand — is pre-existing
and is no longer masked by the freshly-PUT successor's mtime once the
stamp is read. Tracked separately; not blocking this PR.

* fix(s3): stamp noncurrent_since before the .versions/ pointer flip

The pointer-flip on the .versions/ directory emits a meta-log event that
the lifecycle router consumes via routePointerTransition. The router
then calls LookupVersion on the demoted version's id. With the prior
ordering — pointer flip first, stamp second — the router could read
the demoted entry before markVersionNoncurrent landed and fall back to
the legacy sibling-mtime derivation.

Versioned COPY is the clean break: the new latest version keeps the
source object's mtime instead of recording the moment v_old was
demoted, so the fallback's successor clock can be arbitrarily wrong.
Reorder both updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning so the stamp is written
first; the pointer flip then emits an event into a state where the
stamp is already present.

Failure of the stamp write remains non-fatal — lifecycle still falls
back to the legacy derivation in that case, with the same caveats as
before the PR but no race window.
2026-05-11 13:41:33 -07:00
Chris Lu c7b01c72b2 test(s3/lifecycle): integration coverage for versioning + filters (#9415)
* test(s3/lifecycle): integration coverage for versioning + filters

First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.

Five new tests:

- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
  versioned bucket must produce a delete marker (latest after worker
  runs is a marker) AND keep the original version directly addressable
  by versionId. ListObjectVersions confirms IsLatest=true on the
  marker.

- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
  fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
  backdate v1, run worker. v1 must be gone, v2 still current.

- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
  expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
  DELETE (creates marker), backdate both, run worker. Every version
  AND marker must be gone for the key.

- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
  must not produce dispatches even on a backdated match. Negative
  test for the engine's enabled-status gate.

- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
  objects carrying the tag. Two backdated objects (one tagged, one
  not) — only the tagged one is removed.

Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).

* test(s3/lifecycle): tighten noncurrent expiration diagnostics

Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:

- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
  separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
  markers, with IsLatest) so the next failure shows which side the
  bug is on rather than just NotFound

* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload

Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).

Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.

Helpers added: putAbortMPULifecycle, backdateUploadDir.

* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions

NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.

Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.

Helper added: putNewerNoncurrentLifecycle.

* test(s3/lifecycle): integration coverage for suspended-versioning Expiration

Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.

Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.

Helper added: putVersioningSuspended.

* test(s3/lifecycle): integration coverage for multi-bucket sweep

A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.

Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.

* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter

ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.

Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.

* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup

Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:

- A later test's shard sweep would still load the prior test's
  lifecycle config (the worker reads every bucket's XML from filer
  state, and DeleteBucket alone doesn't drop lifecycle config
  cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
  ListObjectsV2 can't see, so the existing best-effort empty-then-
  delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
  an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.

Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.

* test(s3/lifecycle): backdate both versions for NoncurrentDays clock

Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.

Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.

* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion

Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.

* test(s3/lifecycle): gofmt size-filter test

* test(s3/lifecycle): integration coverage for Object Lock skip

Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.

Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.

Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.

* test(s3/lifecycle): integration coverage for config update between sweeps

An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.

Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.

Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.

* test(s3/lifecycle): integration coverage for ExpirationDate (past)

Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.

Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.

* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects

Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.

Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.

* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching

Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.

Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.

* test(s3/lifecycle): integration coverage for empty bucket sweep no-op

A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).

Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.

* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate

ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.

ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.

* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker

The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.

EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.

Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.

* test(s3/lifecycle): integration coverage for multiple rules per bucket

A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.

Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.

Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.

* fix(s3/lifecycle): mark ScanAtDate actions active in Compile

Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in e785f59d6) and Compile only marking
ModeEventDriven actions active. MatchPath / MatchOriginalWrite both
require IsActive() to emit a key, so a ScanAtDate action that's never
marked active never reaches a dispatch path even after the walker
falls through.

ScanAtDate's only dispatch path is the bootstrap walk's MatchPath
call — there's no bootstrap-completion rendezvous to wait on. Make
the active flag include ModeScanAtDate alongside the
EventDriven+BootstrapComplete combination.

ExpirationDate-based rules now actually fire end-to-end. The
TestLifecycleExpirationDateInThePast integration test exercises this.

* fix(s3/lifecycle): route date kinds via ComputeDueAt

ExpirationDate has MinTriggerAge=0, so router computed
dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry
that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's
now.Before(rule.ExpirationDate) check returned ActionNone and the
date rule never fired through the event-driven path.

ComputeDueAt already knows the per-kind shape — rule.ExpirationDate
for date kinds, ModTime+Days for the rest — so use it as the
single source of truth for dueTime in Route's main loop.

* test(s3/lifecycle): pin bootstrap walker date dispatch

The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6
behavior that the regular walker skipped ExpirationDate. That
walker was rewired to fire date rules whose date has passed (the
SCAN_AT_DATE bootstrap was never wired); update the test to match.

Split into two: post-date entries dispatch, pre-date entries don't.

* test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle

The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup
constructs a combined noncurrent + expired-marker rule inline, which
the helper doesn't cover. The blank-assignment workaround was just
hiding dead code; remove both.

* test(s3/lifecycle): tighten HeadObject termination check to typed not-found

Generic err != nil also passes on transport/auth/timeouts, letting
the test go green without proving the lifecycle action actually
fired. Switch the three Eventuallyf HeadObject predicates to
isS3NotFound, matching the pattern already in the multi-bucket and
expiration-date tests.

* test(s3/lifecycle): guard ListObjectVersions diagnostic against nil

When ListObjectVersions errors, listOut is nil and the diagnostic
log path panics on listOut.Versions before the real assertion fires.
Branch on (listErr != nil || listOut == nil) so the failure log is
robust whatever ListObjectVersions returned.
2026-05-10 09:30:50 -07:00
Chris Lu 85abf3ca88 feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch

Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash)
shard, routes events through the compiled lifecycle engine, and dispatches
due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard
cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume.

Operator-runnable harness for end-to-end Phase 3 validation while the
plugin-worker auto-scheduler is still pending. EventBudget bounds a single
invocation; flags expose dispatch + checkpoint cadence.

Discovers buckets by walking the configured DirBuckets path and reading
each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml]
through lifecycle_xml.ParseCanonical. All compiled actions are seeded
BootstrapComplete=true so the run dispatches whatever fires immediately;
production bootstrap walks set this incrementally per bucket.

* test(s3/lifecycle): integration test driving the run-shard shell command

Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix,
PUTs the target object, then rewrites the entry's Mtime via filer
UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every
shard via 'weed shell' subprocess and asserts the backdated object is
deleted within 30s, and the in-prefix-but-recent object remains.

The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable.
Backdating via the filer's gRPC sidesteps that constraint while still
exercising the real Reader -> Router -> Schedule -> Dispatcher ->
LifecycleDelete RPC path end-to-end.

Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs
all 16 shards because ShardID(bucket, key) is hash-based and the test
shouldn't couple to that detail; running every shard keeps the test
independent of the hash function.

* fix(shell/s3.lifecycle.run-shard): address review findings

- Reject negative -events explicitly. Help text already defines 0 as
  unbounded; negative budgets created ambiguous behavior in pipeline.Run.
- Bound the gRPC dial with a 30s timeout instead of context.Background()
  so an unreachable S3 endpoint doesn't hang the shell.
- Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList
  takes a single-RPC limit; the prior 4096 silently dropped buckets
  past that page on large clusters. Loop with startFrom until a page
  comes back short.
- Surface parse errors instead of swallowing them. Buckets with
  malformed lifecycle XML now print the first three errors verbatim
  and a count for the rest, so an operator running this command for
  diagnostics can find what's wrong.

* feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription

Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads
the same model through Reader and Pipeline.

- reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs;
  ShardID stays for the single-shard short form. Event carries the
  computed ShardID so consumers can route per-shard without rehashing.
- dispatcher.Pipeline gains Shards []int. When set, Run holds one
  Cursor + Schedule + Dispatcher per shard, opens one filer
  SubscribeMetadata stream with a predicate covering the whole set,
  and routes events into the matching shard's schedule from a single
  dispatch goroutine — no per-shard goroutine fan-out.
- shell command parses -shard or -shards (mutually exclusive),
  formats progress messages with a contiguous-range label when
  applicable, and validates against ShardCount.

Integration test now uses -shards 0-15 (one subprocess invocation)
instead of a 16-iteration loop.

* fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil

The reader rejected the legitimate 'fresh subscription from epoch'
state when called from a fresh Pipeline.Run on a multi-shard worker
(no cursor file yet, all shards' MinTsNs=0). The downstream
SubscribeMetadata call handles SinceNs=0 fine; the up-front check
was over-defensive and broke the auto-scheduler completely (CI
showed 5-second-cadence retries with this exact error).

* fix(s3/lifecycle): schedule from ModTime not eventTime

A backdated or out-of-band entry update has eventTime ≈ now while
ModTime is far in the past; eventTime+Delay would push the dispatch
into the future even though the rule already fires. ModTime+Delay
is the correct fire moment. The dispatcher's identity-CAS still
catches drift between schedule and dispatch.

* fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards

The CI integration test sets -events 200 expecting the subprocess to
return after 200 in-shard events. But -events counts only events that
pass the shard filter; the test produces ~5 such events (bucket
create, lifecycle PUT, two object PUTs, mtime backdate), so the
reader stays in stream.Recv forever and runShellCommand hangs the
test deadline.

- weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag.
  When > 0, Pipeline.Run runs under context.WithTimeout(D); on
  expiry the reader/dispatcher drain cleanly and the cursor saves.
- weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat
  context.DeadlineExceeded the same as context.Canceled at exit
  (both are graceful shutdown signals).

* test(s3/lifecycle): pass -runtime 10s to run-shard

Pair with the new -runtime flag so the subprocess exits cleanly
after 10s instead of waiting for an event budget that never lands
on quiet shards.

* refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg

The worker's router needs the same length-prefixed sha256 of the entry's
Extended map; pulling it out of the s3api private file lets both sides
import it.

* fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS

Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil
while the live entry on the server has a non-nil hash, so every dispatch
returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted.

* fix(s3/lifecycle): identity HeadFid via GetFileIdString

Meta-log events go through BeforeEntrySerialization, which clears
FileChunk.FileId and writes the Fid struct instead. Reading .FileId
directly returns "" on the worker side while the server's freshly
fetched entry still has a populated string, so the identity-CAS would
mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY.

* fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit

errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx
errors, so a subscription that ends because runCtx was canceled was
being logged as a fatal reader error. Check status.Code as well so the
shell's -runtime cap exits cleanly.

* fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard

run-shard's -s3 flag dials the LifecycleDelete gRPC service, which
listens on s3.port + 10000. The integration test was passing the HTTP
port instead, so the dispatcher's RPC just timed out and the shell
command exited under -runtime with no work done.

* chore(test/s3/lifecycle): drop emoji from Makefile output

* docs(test/s3/lifecycle): correct '-shards 0-15' wording

* fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run

The shell's parseShardsSpec already validates, but a programmatic caller
(scheduler, future worker config) shouldn't be able to silently produce
no-op states by passing -1 or 99.

* fix(s3/lifecycle): bound drain + final-save with their own timeouts

Shutdown was using context.Background, so a stuck dispatcher RPC or
filer save could keep Pipeline.Run from ever returning.

* fix(test/s3/lifecycle): drop self-killing pkill in stop-server

The pkill pattern \"weed mini -dir=...\" is also in the running shell's
argv (it's the recipe body), so pkill -f matches its own bash and the
recipe exits with Terminated. CI test job passed but the cleanup step
failed with exit 2. The PID file is sufficient on its own.

* docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var
2026-05-08 09:59:10 -07:00
dependabot[bot] c918660901 build(deps): bump io.netty:netty-transport-native-epoll from 4.1.132.Final to 4.2.13.Final in /test/java/spark (#9365) 2026-05-08 06:00:51 -07:00