1493 Commits

Author SHA1 Message Date
Chris Lu 37962e2445 admin: configure maintenance tasks via admin.toml (#9926)
* admin: configure maintenance tasks via admin.toml

Maintenance task settings could only be edited in the admin UI and live
under <dataDir>/conf, so they silently reverted to defaults whenever the
data directory was recreated. An optional admin.toml now declares vacuum,
balance, and erasure coding settings; keys set there are written through
to the persisted task configs at every startup, overriding UI edits, so
the configuration stays declarative. Generate an example with
"weed scaffold -config=admin".

* vacuum: round min volume age up to whole hours

MinVolumeAgeSeconds was truncated by integer division when converted to
the hour-granular protobuf field, so a sub-hour setting silently became
0 and disabled the age guard.

* admin: split and normalize preferred_tags from admin.toml

A comma-separated string, as set via environment variable, came through
viper as a single slice element. Split on commas and reuse
util.NormalizeTagList, matching the plugin config path.

* scaffold: clarify admin.toml wording
2026-06-11 11:04:52 -07:00
Chris Lu e56a1c4c05 admin: pre-gzip embedded static assets, add cache headers (#9918)
The admin UI served embedded static files uncompressed and without
cache headers: embed.FS has zero mod times, so no Last-Modified, no
ETag, no 304s -- every page load re-downloaded ~700KB of css/js in
full, which gets painful over slow or tunneled links.

Gzip the static tree at generation time (go generate ./weed/admin)
and embed only the compressed mirror, shrinking the binary ~1.5MB.
The handler hands the pre-compressed bytes to gzip-capable clients,
decompresses for the rest, and sets Cache-Control, per-variant
content-hash ETags and Vary so repeat loads revalidate with a 304.
bootstrap.min.css goes 232KB -> 30KB on the wire.

A drift test keeps static_gz/ in sync with static/.
2026-06-10 12:54:36 -07:00
Chris Lu 2ac5aa72c7 add elastic8 filer store for Elasticsearch 8 (#9916)
* elastic: fix listing against a missing or empty directory index

The refresh 404 leaked into the named return, so the first listing of a
directory whose index does not exist yet returned an error instead of an
empty result. Sorting also fails on an index with no documents
("No mapping found for [_id] in order to sort on"); unmapped_type
keeps the resumed-listing path working there.

* add elastic8 filer store for Elasticsearch 8

Elasticsearch 8 disables _id fielddata by default, so the elastic7
store's directory listings fail with "Fielddata access on the _id
field is disallowed". elastic8 uses the same client and configuration
options, but also indexes the document id as an Id field and sorts
listings on Id.keyword.
2026-06-10 12:10:49 -07:00
Chris Lu e12052ee6b fix(filer.sync): replicate a rename as an atomic move, not a no-op update (#9895)
* fix(filer.sync): replicate a rename as create-then-delete, not an in-place update

A rename arrives as a single metadata event carrying both the old and new
entry. The filer sink was routed to UpdateEntry, which looks up the old
path but issues the update against the new parent without changing the
name — and the filer UpdateEntry RPC cannot move an entry. So the rename
was dropped: the old path lingered and the new path never appeared
(same-dir renames rewrote the old name in place).

Route a real move (the sink path changed) through CreateEntry(new) then
DeleteEntry(old) in both the replicator and the filer.sync/backup driver,
the way the other sinks already handle it; reach UpdateEntry only for true
in-place updates. Create before delete so a crash between the two leaves
the entry visible rather than lost.

* fix(filer.sync): derive the rename delete key like the create key, guard the watched root

The rename delete leg rebuilt the old key with a raw util.Join, bypassing the
sink-side key normalization the create leg gets from buildKey — so a rename
could create the new entry and then fail to delete the old one under a
transformed key. Build the old key through buildKey too, and skip the delete
when the moved entry is the watched root itself (where the old key would
resolve to the target root and recursively delete the whole sink tree).

* test(filer.sync): cover the in-place update delete-then-create fallback order

The recording sinks always reported foundExisting, so the fallback that an
in-place update takes when the entry is missing on the sink was never run.
Make it configurable and assert the fallback deletes before it recreates the
same key, in both the replicator and the filer.sync drivers.

* feat(filer.sync): move filer-sink renames natively via AtomicRenameEntry

create-then-delete is unsafe for the filer sink: CreateEntry returns nil
without creating on a transient chunk-copy error, so the paired delete could
remove the only valid destination copy; a directory rename also deleted the
old subtree before descendants were recreated, and left old chunks behind.

Add an optional EntryMover sink capability and implement it on the filer sink
via AtomicRenameEntry — one atomic, metadata-only move that relocates a whole
subtree in a single transaction. Renames prefer it; sinks without a native
move keep create-then-delete. When the old path is already gone (a descendant
the parent rename moved, or one never replicated) MoveEntry creates the new
path instead, re-checking existence with a lookup so a rolled-back move that
left the old entry intact is retried rather than mistaken for gone.

* docs(filer.sync): note entryMissing's gRPC not-found string fallback is deliberate
2026-06-09 12:54:28 -07:00
Chris Lu 7b07d8177a fix(filer.sync): scope filesystem key sanitization to the local sink (#9894)
* fix(filer.sync): scope filesystem key sanitization to the local sink

destKey ran every sink key through escapeKey, whose Windows build strips
colons. Colons are illegal in NTFS filenames so the local sink needs that,
but s3/filer/azure/gcs/b2 accept them as ordinary key bytes — stripping
them silently diverged the destination key (a source a:b replicated as ab).

Move the sanitization into the local sink behind a Windows build tag,
applied at every entry point so the previously-unescaped in-place-update
paths stay consistent. Non-local sinks now keep the raw key; non-Windows
builds are unchanged; a leading drive-letter colon is preserved.

* test(filer.sync): cover incremental destKey and localsink update/delete sanitization

Lock the colon-preserving behavior for the incremental destKey branch, and
extend the Windows local-sink test to assert UpdateEntry and DeleteEntry also
sanitize the key, not just CreateEntry.
2026-06-09 10:18:49 -07:00
Chris Lu ed470dccb1 mini: grow volumes one at a time
Mini auto-sizes a few large volume slots, but the master pre-grows 7
volumes per new collection. Under a filer group each S3 bucket is its
own collection, so the first buckets claimed every slot and later
writes failed to assign a volume. Cap mini's volume_growth copy counts
to 1.
2026-06-08 14:51:40 -07:00
Jaehoon Kim 1b5f1c1f3b feat(filer.backup): -initialSnapshot re-seeds a reinitialized destination (#9828)
* feat(filer.backup): add -resetCheckpoint to force a fresh sync

filer.backup resumes from a per-sink offset persisted in the source filer's KV.
There was no first-class way to discard that checkpoint and re-run from the
beginning short of guessing a large -timeAgo, which also skips -initialSnapshot.

Add -resetCheckpoint: before reading the offset, write 0 for this sink so
getOffset returns 0, isFreshSync stays true, and -initialSnapshot re-runs a full
walk. Effective only when -timeAgo is 0.

The flag is cleared after the first successful reset: runFilerBackup retries
doFilerBackup forever on error, so leaving it set would re-zero the checkpoint
on every retry and never make forward progress after a transient failure. Later
retries resume from the persisted checkpoint instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(filer.backup): keep fresh-sync intent when offset read fails after reset

After -resetCheckpoint writes offset 0, a transient getOffset read-back error
flipped isFreshSync to false, which skipped the -initialSnapshot walk the reset
explicitly requested. Track that the reset happened this iteration and, on a
getOffset error, preserve isFreshSync=true in that case (the non-reset path
keeps treating a read error as "not fresh" to avoid re-walking on transients).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(filer.backup): skip offset read-back on reset instead of tracking a flag

Replace the didReset bool by branching: on -resetCheckpoint, clear the offset and
start fresh without reading it back (we just wrote 0, so the state is known);
otherwise read the offset as before. This drops the redundant getOffset RPC after
a reset and removes the read-back error case entirely, so no separate flag is
needed to preserve isFreshSync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* filer.backup: -initialSnapshot re-seeds on every start; drop -resetCheckpoint

-initialSnapshot now walks the live tree whenever -timeAgo is 0, seeds the
destination, and overwrites the saved checkpoint, rather than running only on a
fresh sync. That re-seeds a reinitialized destination on its own, so the
separate -resetCheckpoint flag is gone.

The walk runs once per process: the in-memory flag is cleared after the
watermark is persisted, so the retry loop resumes from the persisted checkpoint
instead of re-walking on every transient error. A process restart re-walks, so
remove the flag once the backup is caught up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-07 23:35:53 -07:00
Chris Lu 89cbb1c558 admin: default -dataDir to "." so maintenance task state persists across restarts (#9856)
admin: default -dataDir to "." so maintenance task state persists

Previously -dataDir defaulted to empty, so the admin ran maintenance in
memory only: task state was never saved and maintenance tasks (notably EC
balance/rebuild) were re-issued every scan cycle without converging,
churning EC shards (moves landed shards without their .ecx index, leaving
EC volumes unloadable/missing shards).

Default -dataDir to "." (the process working directory, which under the
standard systemd unit is the admin's data dir) so state persists out of
the box.
2026-06-07 20:45:03 -07:00
Chris Lu 755af4adf4 s3: actually bind outbound connections when -ip.bind is set (#9849)
* s3: set outbound bind IP before the first filer dial

Standalone weed s3 dialed the filer for GetFilerConfiguration before
SetOutboundLocalIP ran, so that gRPC conn was created with the stock
dialer and no source address. gRPC caches conns by address and reuses
the original dialer on reconnect, so the s3->filer connection kept
leaving from the OS-chosen source for the life of the process even
after the bind IP was set a moment later.

* grpc: install the outbound-bind dialer unconditionally

The dialer was installed only when OutboundLocalAddr was already set at
GrpcDial time, baking the source-address decision into the cached conn,
so a conn dialed before the bind IP was configured never bound.

Install the context dialer always and decide per dial: bind through
OutboundDialContext once a source is set, otherwise fall back to the
stock net.Dialer so default deployments keep gRPC's dial timeout and
keepalive behavior. The bind now applies on the next reconnect
regardless of ordering, matching the HTTP transport's unconditional
DialContext.
2026-06-07 10:20:58 -07:00
Chris Lu be7f417a03 ip.bind: bind outbound connections to the configured address (#9834)
* ip.bind: bind outbound connections to the configured address

-ip.bind only governed listeners; outbound gRPC and HTTP connections let
the OS pick the source IP, which may not even be able to reach the
target. Mirror the bind address into a process-global source address and
apply it to outbound TCP dials: the gRPC context dialer, the per-client
HTTP transports, and the default transport. Loopback targets and unix
sockets keep the OS-chosen source so same-host traffic still works.

* ip.bind: first-write-wins source IP, skip on address-family mismatch

Make SetOutboundLocalIP first-write-wins so a `weed server` component's own
bind setting (run in its goroutine) can't clobber the process-wide source
address the top-level -ip.bind already established for the other components.

Skip source binding when the target is a literal IP of a different family
than the bind address, since forcing a mismatched source fails the dial.
2026-06-05 12:44:21 -07:00
Chris Lu ab7be7867d security: hot-reload JWT signing keys on SIGHUP (#9826)
* security: reload JWT signing keys on SIGHUP

Signing keys were read once in the server constructors and never
refreshed. After a key rotation (Secret update, divergent reads) the
in-memory key stayed stale and every request kept failing "wrong jwt"
until the affected process was restarted.

Add Guard.UpdateSigningKeys and call it from the master, volume and
filer reload paths and the s3 reload hook, next to the existing
whitelist refresh. Make the global chunk-read JWT cache reloadable via
an atomic swap, and register the master's Reload with grace.OnReload --
it was never wired, so the master ignored SIGHUP entirely.

Mirror the same refresh in the Rust volume server's SIGHUP handler.

* security: swap signing keys behind an atomic pointer

Addresses review feedback on the in-place key swap: SigningKey is a
[]byte, so reassigning the Guard fields while a request handler reads
them is a data race that can tear the multi-word slice header and read
out of bounds.

Hold the four signing-key fields in an immutable signingConfig snapshot
behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a
reader sees either the old keys or the new ones. Reads go through new
SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors.

The Rust guard is already safe: every read and the SIGHUP write go
through the shared RwLock<Guard>.

* security: fold whitelist + auth state into the atomic snapshot

Review follow-up. UpdateSigningKeys still wrote isWriteActive while the
request path read it (and the whitelist maps) unsynchronized, so a SIGHUP
under load could expose an inconsistent mix of activation bits and
whitelist contents.

Move all hot-reloadable Guard state -- keys, expirations, whitelist, and
the activation flags -- into a single immutable guardState swapped behind
one atomic.Pointer. The Update* methods take a small mutex to serialize
the read-modify-write; readers stay lock-free. The concurrency test now
also rotates the whitelist and probes IsWhiteListed under -race.

Also read each signing key once per branch in the volume/filer JWT auth
checks, so a reload landing mid-check can't take the allow-fast-path
after auth was enabled or verify against a different key than the branch
saw.
2026-06-04 22:26:08 -07:00
7y-9 6e8002f065 fix: handle meta backup offset errors safely (#9818)
* fix: log meta backup offset errors

* fix: log meta backup offset errors

* fix: exit on meta backup offset errors

Exit with a non-zero status when the initial metadata backup offset cannot be persisted.

Classify offset-read failures during streaming so the backup process exits instead of retrying forever, allowing supervisors to restart and bootstrap from a missing checkpoint.

* meta backup: read offset in the loop, drop offset error type

Reading the saved offset inside the retry loop makes an offset read
failure a clean exit and a stream error a retry, without a typed error
to tell them apart. streamMetadataBackup now takes the start time.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-04 10:53:23 -07:00
Fabian Hardt ce6a51468a sftpd: support SSH user certificates signed by a trusted CA (#9815)
* sftpd: support SSH user certificates signed by a trusted CA

Adds a new "certificate" auth method to weed sftp. When enabled, the server
loads trusted CA public keys from -trustedUserCAKeysFile (OpenSSH
authorized_keys format, one or more keys) and accepts only ssh.Certificate
blobs of type UserCert on the public-key channel. Validation uses
ssh.CertChecker: CA signature, ValidAfter/ValidBefore, non-empty
ValidPrincipals and SSH login user must appear in ValidPrincipals. The
authenticated user must exist in the user store; home dir and permissions
resolve as before.

Behaviour mirrors MinIO's --sftp=trusted-user-ca-key and OpenSSH's
TrustedUserCAKeys: when certificate auth is active, plain (non-cert) public
keys are rejected even if "publickey" is also listed. Default authMethods
remain "password,publickey", so existing deployments are unaffected.

* Update weed/sftpd/auth/certificate.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* sftpd: address review feedback on certificate auth

- Pre-marshal trusted CA public keys in IsUserAuthority instead of
  re-marshaling on every authentication attempt (gemini-code-assist).
- Differentiate user-not-found from underlying store errors via
  errors.As(*user.UserNotFoundError) so backend/read failures are no
  longer reported as bad credentials (coderabbitai).
- Fix the corresponding sanity check in the missing-file test to use
  errors.As instead of errors.Is (UserNotFoundError has no Is method,
  so the previous check never matched) (coderabbitai).

* sftpd: register trustedUserCAKeysFile flag in filer and server commands

The new field on SftpOptions is dereferenced unconditionally in
resolvePaths(), but only the standalone `weed sftp` command was wiring
its flag. `weed filer` and `weed server` both embed an SftpOptions value
and call resolvePaths() on it, so they hit a nil pointer dereference at
startup.

Register `-sftp.trustedUserCAKeysFile` in both commands and update the
-sftp.authMethods help text to mention the new "certificate" method.

Fixes the SFTP Integration Tests CI failure on this PR.

* helm: expose SFTP certificate auth in the SeaweedFS chart

Adds Helm-chart support for the new SSH user-certificate auth method:

- values.yaml (sftp:) gains `trustedUserCAKeys` (inline OpenSSH
  authorized_keys-format CA public keys) and `existingCAKeysSecret`
  (reference an externally managed Secret). Same pair added under
  allInOne.sftp with a null default that falls back to the top-level
  sftp.* setting.
- New template templates/sftp/sftp-ca-secret.yaml renders a
  chart-managed Secret <release>-sftp-ca-secret with `ca_user.pub`,
  but only when SFTP is enabled, "certificate" is in authMethods,
  inline keys are provided, and no existingCAKeysSecret is set.
- templates/sftp/sftp-deployment.yaml and the all-in-one deployment
  template add `-trustedUserCAKeysFile=/etc/sw/sftp_ca/ca_user.pub`
  to the weed sftp command, mount the CA secret at /etc/sw/sftp_ca
  and add the corresponding volume. All cert-auth bits are guarded
  by `contains "certificate" authMethods` so existing users see no
  change.
- authMethods help text updated to mention "certificate".

Verified end-to-end on a local k3d cluster: cert login succeeds,
plain-pubkey login is rejected with "public key without certificate
not allowed".

* helm: fail render when SFTP certificate auth lacks CA keys

When certificate is in authMethods but neither trustedUserCAKeys nor
existingCAKeysSecret is set, the deployment mounted a secret that the
chart never renders, leaving the pod stuck on a missing volume. Fail at
template time with a clear message instead.

* sftpd: fix stale auth-method list in SFTPServiceOptions comment

keyboard-interactive was never implemented; certificate is the new
supported method. Match the CLI help text.

* sftpd: test Manager wiring of certificate vs public-key channel

Cover the channel takeover at the Manager level: certificate auth
displaces plain public-key auth when both are enabled, public-key auth
stays put otherwise, and enabling certificate without a CA file errors.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-03 22:32:47 -07:00
Aleksey e3e02d3364 [CheckDisk]: implement disk health detection (#9560)
* [CheckDisk][GRPC]: implement MVP for disk health detection, added timeout for new grpc connections

* fix(volume): build disk health check on every platform

setDiskStatus only existed behind the statfs build tag, so disk.go failed
to compile on windows, openbsd, solaris, netbsd and plan9. Move the timeout
wrapper and failure tracking into the shared disk.go and have each platform's
fillInDiskStatus return an error, so every platform gets the same protection
from a stuck filesystem.

Also restore the uint64(fs.Bavail) cast: Bavail is int64 on freebsd, so the
unguarded multiply broke the freebsd build.

* fix(volume): keep one outstanding statfs probe per disk

A stuck statfs used to leave isChecking cleared by the timeout path, so the
next check spawned another goroutine while the previous one was still blocked
in the syscall, leaking one goroutine per minute on a hung disk. Clear the
flag only when statfs returns and treat an overlapping check as a failure, so
a hung filesystem keeps a single outstanding probe and still gets reported.

* fix(volume): assume disk available until the first health check

isDiskAvailable defaulted to false, and CollectHeartbeat skips locations that
are not available. A freshly started volume server would therefore omit every
volume from its first heartbeats until the async CheckDiskSpace ran, so the
master could briefly treat all of them as missing.

* fix(volume): label the disk error metric by data directory

The new gauge tagged the series with IdxDirectory while every neighbouring
resource gauge uses Directory, so the error series would not line up with them
in dashboards. Also log the underlying error instead of a generic message.

* test(volume): cover disk health success and repeated-failure paths

* fix(volume): make a healthy disk the zero-value default

Track the disk as isDiskUnavailable instead of isDiskAvailable so the safe
state is the zero value, matching isDiskSpaceLow. CollectHeartbeat only skips a
location once a check has actively marked it unavailable, so any DiskLocation
built without running CheckDiskSpace (tests, future call sites) still reports
its volumes instead of silently dropping them.

* feat(disk): detect degraded disks using IO latency probes

* feat(stats): introduce configurable disk I/O health probe with EWMA-based latency detection

* feat(disk): replace EWMA with sliding window algorithm for disk health detection and added user-friendly options

* feat(disk): improve disk health probing and recovery

* feat(volume): configure disk health checks via volume.toml

* fix(volume): Remove disk IO probe CLI options

---------

Co-authored-by: ptukha <ptukha@tochka.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-02 09:02:05 -07:00
Chris Lu 2386fa550a grpc: don't tear down the shared master connection on a caller's own timeout (#9775)
A Canceled/DeadlineExceeded from the caller's per-request context was
treated like a dead channel: it closed the shared cached ClientConn and
cancelled every other in-flight RPC on it with "the client connection is
closing". Under a burst of concurrent chunk assigns (e.g. a large S3
multipart upload) one slow assign hitting its 10s attempt timeout could
poison the connection for all the rest, cascading into a flood of 500s.

Thread the caller's context into shouldInvalidateConnection and only
invalidate on Canceled/DeadlineExceeded while that context is still live,
which isolates the genuine stale-channel signal (a peer restart behind a
k8s Service VIP). To carry the context, add a ctx parameter to the
existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the
master assign and volume-lookup paths pass their per-attempt context and
every other caller passes context.Background().
2026-06-01 15:11:02 -07:00
Chris Lu 80dd3b2621 EC bitrot follow-ups: protect destination sidecar on optional copy; cap sidecar block_size (#9763)
* fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest

A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a
huge power-of-two block_size that passed validation, then force a multi-GiB
scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize
(64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and
derive the volume flag cap from the same constant so they cannot drift.

* fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy

writeToFile opened the destination with O_TRUNC before knowing whether the
source had the file, so an optional copy (ignoreSourceFileNotFound) from a source
that lacks the .ecsum truncated and then removed a valid pre-existing destination
sidecar. Stage the optional copy into a temp sibling and commit it with an atomic
rename only when the source actually delivered the file; a missing source is now
a no-op. Mandatory copies keep their in-place behavior.
2026-05-31 23:42:33 -07:00
Chris Lu 9658f309d2 EC bitrot detection: per-shard checksum sidecars (#9761)
* ec: add EC bitrot checksum protobuf

EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages,
copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode.

* ec: bitrot checksum sidecar format, validation, and per-volume load

Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a
self-integrity header; validation, rolling builder, backfill primitive, and
EcVolume load on mount + removal on destroy.

* ec: capture per-shard checksums at encode; verify-and-exclude on rebuild

WriteEcFilesWithContext returns the protection computed inline during encoding.
generateMissingEcFiles verifies present inputs against the sidecar, excludes
corrupt ones, regenerates in place, and re-verifies; fail-closed unless
unsafe_ignore_sidecar, removing all generated outputs on failure.

* ec: read-only checksum scrub with Reed-Solomon arbiter

ChecksumScrub verifies each local shard against the sidecar and reconstructs
flagged shards from the clean shards so stale-sidecar false positives are not
reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum.

* ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill

Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant);
remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and
opportunistically backfills a sidecar when all shards are reachable.

* ec: volume server bitrot config flags

-ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16).

* fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply

Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge
flag value cannot overflow int64 and slip past the power-of-two check, and a
block size cannot collapse a sidecar to a few oversized blocks.

* fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path

The worker EC encode wrote the generation-0 sidecar locally but never added it
to shardFiles, so DistributeEcShards never shipped it and the distributed
holders came up unprotected. Append it to shardFiles and map the ecsum shard
type to its extension in the sender so it travels with the shards.

* fix(ec_bitrot): remove orphaned sidecars when the generation is gone

Gate sidecar removal on existingShardCount==0 alone rather than also requiring a
stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be
removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays
gated on hasEcxFile as before.

* fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles

ChecksumScrub's first return is blocks scanned, not files. Discard it so the
scrub response's TotalFiles (a needle/file count) is not inflated by the block
count for CHECKSUM mode.

* test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles

* fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast

The header stores payload_len as a uint32; bound the payload before the
conversion so a pathological manifest cannot truncate the length field and
corrupt the sidecar. A real manifest is a few KB, so this never trips.

* fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB

The block size becomes the per-shard scratch buffer the scrub/backfill path
allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent
scrub worker. Lower the upper bound from 1024 to 64 MiB.

* fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx

The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext,
which fails closed on a malformed/stale .ecsum. Without an override flag an
operator could not complete the rebuild without manually deleting the sidecar.
Expose -ecUnsafeIgnoreSidecar (default false) and thread it through.

* fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull

Guard len(payload) against a plain int constant (1 GiB) before the allocation
instead of a uint64 MaxUint32 compare, so the allocation-size value is provably
bounded (clears the CodeQL overflow alert) and the math import is no longer
needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and
remove the now-redundant readFull helper (os.File.ReadAt fills the slice or
errors).

* test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32

* refactor(ec): fold the EcFiles WithContext variants into the base functions

RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as
before) and WriteEcFiles takes it too (nil => default), removing the parallel
RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an
explicit context drop the WithContext suffix; the default-context callers pass
nil. No behavior change.

* refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles

Add a non-nil BackgroundECContext placeholder (analogous to context.Background())
and have callers with no specific layout pass it instead of a nil *ECContext.
WriteEcFiles resolves a zero/background context to the default ratio and
RebuildEcFiles resolves it from the .vif, so behavior is unchanged.

* fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif

- BackgroundECContext is now a function returning a fresh *ECContext, so callers
  cannot mutate a shared singleton or race on it (and it mirrors context.Background,
  which is also a function).
- RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but-
  unreadable .vif fails closed instead of silently rebuilding with the default
  ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
2026-05-31 18:52:44 -07:00
Chris Lu 05c6500453 volume: fix maxVolumeCount dead zone that stalled writes on auto-sized disks (#9755)
* volume: don't drop the last writable slot on auto-sized disks

MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk
with room for exactly one volume (free between 1x and 2x the size limit)
reported 0 slots. The master then never grew a writable volume and every
assign drained its retry budget, so writes failed with context deadline
exceeded. Count the full volumes that actually fit, floored at one for an
auto-sized disk that has free space.

* mini: show disk and volume capacity in the startup banner

Print free space, volume size, total volume count and free volume count
under the data directory line, so a volume size limit that outstrips the
disk is visible at startup instead of surfacing later as failed writes.
2026-05-30 23:45:17 -07:00
Chris Lu 5834c834e3 Refine enterprise edition feature blurb in version output and docs 2026-05-30 09:29:06 -07:00
Chris Lu c9623007a2 fix(filer.sync): keep sync_offset fresh through filtered-event markers (#9733)
On a read-only watched path the idle heartbeat keeps sync_offset fresh,
but a busy source filer still emits a MaxUnsyncedEvents marker after many
filtered events. The marker has a non-nil but empty EventNotification, so
the client routed it to the event path, where it advanced no real
watermark yet drove offsetFunc to republish the stale processed
watermark — regressing the gauge between heartbeats and spiking the
derived lag every time a filtered-event burst landed.

Route the empty marker through OnIdleHeartbeat like the idle heartbeat so
its fresh timestamp keeps the gauge current; it still advances the
in-stream resume cursor.
2026-05-28 23:29:59 -07:00
Chris Lu 2f0643e5b1 fix(volume): stop flipping volumes read-only on a non-append-ordered .idx (#9726)
* fix(volume): verify the .dat-tail needle in the integrity check

CheckVolumeDataIntegrity checked the last entry by file position in the .idx
and, for a live needle, flipped the volume read-only when fileSize > fileTailOffset.
That entry is the .dat tail only when the .idx is in append order; a key-sorted
.idx (weed fix and other rebuilds listed entries by key) puts the highest-key
needle last, whose tail sits mid-file, so healthy volumes went read-only on every
load and re-running weed fix only reproduced the sorted index.

Locate the needle at the maximum offset — the one physically last in the .dat —
and verify the .dat ends exactly at it, regardless of .idx ordering. The
append-ordered common case stays O(1) (the last entry's on-disk end matches the
.dat size); only a key-sorted index pays a single linear scan. Deletion
tombstones at the tail are now verified too, instead of skipping the file-size
check.

* fix(command): weed fix rebuilds the .idx in .dat offset order

SaveToIdx wrote entries via AscendingVisit — sorted by key, the .sdx/.ecx shape
— so the rebuilt .idx put the highest-key needle last instead of the .dat-tail
needle, and dropped tombstones whose live needle was gone. Collect the live and
deleted entries, sort by .dat offset, and write them in append order so the .idx
stays a faithful log whose last entry is the real .dat tail.
2026-05-28 18:04:31 -07:00
Chris Lu dfd05d14cb refactor(filer): remove the inode->path index and the NFS gateway (#9724)
* fix(filer): derive inodes by hash instead of a snowflake sequencer

Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.

* chore: remove the experimental NFS gateway

The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.

* refactor(filer): drop the inode->path index

With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
2026-05-28 15:00:18 -07:00
Chris Lu 3481f13f54 mount: route POSIX advisory locks to the owner filer under -dlm (#9669)
With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. Keys are the inode identity
(HardLinkId else path); SetLkw is client-side polling with the FUSE cancel
channel (no server wait queue); a per-mount session id namespaces owners;
a local hint avoids a release RPC on every close. Background unlock/release
RPCs are bounded so a stuck filer can't hang close().
2026-05-24 23:56:37 -07:00
Chris Lu 25beb7ec48 admin: expose Prometheus metrics (#9652)
* admin: add -metricsPort flag to expose Prometheus metrics

The admin command had no metrics endpoint, so passing -metricsPort
(as the operator does for spec.admin.metricsPort) crashed the process
with "flag provided but not defined". Wire up -metricsPort/-metricsIp
and start the shared Prometheus metrics server, matching filer, master,
and volume.

* admin: emit maintenance task and worker fleet metrics

Add Prometheus metrics for the admin server's distinctive work: the
maintenance task queue and the worker fleet that executes it.

Task lifecycle: maintenance_tasks_by_status / _by_type gauges (snapshot
of the queue), maintenance_tasks_completed_total{type,outcome} counter
and maintenance_task_duration_seconds{type} histogram (recorded when a
task reaches a terminal state), and last/next scan timestamp gauges.

Worker fleet: workers_connected and worker_slots{used,max} gauges, plus
worker_events_total{event} counting register/unregister/stale removals.

Gauges are snapshotted by a background goroutine on the admin server;
counters and the histogram are recorded at their event sites.

* admin: read worker slot totals under lock, clear next-scan gauge when idle

GetWorkers returns live worker pointers; summing CurrentLoad/MaxConcurrent
outside the queue lock races with task assignment and completion. Add
GetWorkerSlotTotals to aggregate under the lock.

Also reset maintenance_next_scan_timestamp_seconds to 0 when the scanner
is not running, so it can't retain a stale value after a stop.
2026-05-24 14:09:02 -07:00
Chris Lu 303c2be38d feat(fix): rebuild lost EC index (.ecx) and .vif from local shards (#9596)
weed fix -ecx reconstructs the .dat from the local data shards, scans the
needles, and writes a fresh ascending-sorted .ecx containing only live
entries — the same on-disk index WriteSortedFileFromIdx emits at encode
time. When the .vif is also missing it is regenerated from the inferred
EC ratio (flags > .vif > shard-count inference / 10+4) and the .dat size
recovered from the scan.

When some data shards are missing but at least dataShards shards survive,
the missing shards are first reconstructed from the survivors via
Reed-Solomon, so a partial shard set is repaired too.

Also makes erasure_coding.WriteDatFile de-stripe using len(shardFileNames)
instead of the DataShardsCount constant, so the caller's actual data-shard
count is honored (behavior-preserving for the default 10, and fixing the
existing caller that already passes ECContext.DataShards).

This recovers an EC volume whose sealed index was lost from every node
while enough shards survive, a state neither ec.rebuild nor ec.decode can
repair because both require an existing .ecx.

Flags: -ecx, -ecDataShards, -ecParityShards. Run with the volume server
stopped.
2026-05-21 00:41:27 -07:00
Chris Lu 5af7d12f04 fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589)
* fix(filer.sync): keep sync_offset fresh while the source is read-only

sync_offset holds the timestamp of the last replicated source event, so
monitoring derives lag from now-sync_offset. A read-only source emits no
metadata events, so the gauge froze at the last write and the derived lag
grew without bound, making thresholds unusable.

The source filer now sends an idle heartbeat carrying its current time
while a subscriber is caught up to the buffer head. filer.sync uses it to
advance the gauge, so now-sync_offset reflects real lag. Heartbeats are
opt-in (client_supports_idle_heartbeat), are never written to the metadata
log, and do not move the resume checkpoint, so a restart still resumes
from the last real event.

* fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs

In metadata-chunks mode persisted entries replay as log file refs and
never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up
subscriber with an old SinceNs would never get a heartbeat. Use the
read cursor (lastReadTime), which advances in that mode too, max'd with
lastSeenTsNs so the in-memory backlog-then-idle case still works while
the cursor returned to the caller has not yet updated.
2026-05-20 11:26:37 -07:00
Lars Lehtonen 9914e6af30 chore(weed/command): prune unused functions (#9573)
* chore(weed/command): prune unused functions

* drop now-unused closed field and renderLocked guard

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-19 17:45:50 -07:00
Chris Lu 3d872a1416 fix(filer): load -s3.config static identities into the filer's CredentialManager (#9537)
When weed filer started its embedded S3 gateway with -s3 -s3.config, only
the S3 server loaded the s3.json static identities — the filer's own
CredentialManager stayed empty, so the IAM gRPC service backing the admin
UI and weed shell returned only dynamic users. Mirror the wiring weed
server already does and hand the same config path to the filer.
2026-05-18 13:41:30 -07:00
Chris Lu 6b94701213 mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board

Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.

Also folds in a few cleanups uncovered while making this readable:

- route the admin.go startup prints through glog so quietMiniLogs()
  filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
  and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
  conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
  the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
  the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
  environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
  (new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
  s3 / iceberg / admin

* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors

loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.

startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().

* ci(s3-proxy-signature): match weed mini's new progress-board ready line

The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits "  S3           ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.

Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
2026-05-17 19:13:09 -07:00
Chris Lu 62821964dd filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509) (#9514)
PR #9442 made the filer refuse to register the IAM gRPC service unless
jwt.filer_signing.key was set in security.toml, which broke the admin
UI Users/Groups/Policies pages for every deployment that ships without
a security.toml — weed mini, plain Helm, vanilla weed filer. The Users
tab returns Unimplemented and the page is unusable. Issues #9504,
#9505 and #9509 all trace to this gap.

The rest of the filer's gRPC surface is unauthenticated by default;
treat IAM the same way. The service now always registers, and the
auth gate is a no-op when no signing key is configured. When the key
is set, every RPC still requires an admin-signed Bearer token, matching
the post-#9442 behaviour. Operators who expose the filer gRPC port
beyond a trusted network should set the key on both filer and admin.

The admin client (IamGrpcStore.withIamClient) already skips attaching
the authorization metadata when its key is empty, so no changes there.
2026-05-15 13:15:20 -07:00
Chris Lu bfb2661fec fix(tests): make 32-bit GOARCH tests build and run (#9507)
fix(tests): make 32-bit GOARCH tests build and run (#9503)

verifyTestFilerClient had bare int64 atomic counters after a map header,
so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on
linux/386. Switch to atomic.Int64, which the stdlib guarantees is
8-byte aligned on all platforms.

rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to
t.Errorf, where it default-promoted to int and overflowed 32-bit int.
Bind it to a typed uint32 const used in both the comparison and the
error message.
2026-05-14 20:55:37 -07:00
Chris Lu f51468cf73 Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)"

This reverts commit f28c7ce6df.

The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.

After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .

Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
2026-05-12 18:22:21 -07:00
Chris Lu 43a8c4fdca Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)"

This reverts commit 21054b6c18.

The fail-closed gate broke any multi-host cluster: in compose / k8s /
remote-host deployments the master's IP isn't loopback, so every
master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute,
vacuum, scrub, ...) is rejected with PermissionDenied unless the
operator manually configures -whiteList. The e2e workflow has been
failing since 10cc06333 with `not authorized: 172.18.0.2` on
AllocateVolume; downstream symptom is fio fsync EIO because zero
volumes can be grown.

The gate's intent was to lock down destructive admin tooling, but the
same RPCs are the master's normal mechanism for growing and managing
volumes. Reverting to restore cluster-internal operation; a narrower
re-do should distinguish operator/admin callers from the master peer
(e.g. trust IPs resolved from -master) before going back in.

* security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic

The revert in the previous commit also rolled back an unrelated bug fix
that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error
but did not continue, so the nil *net.IPNet was stored in whiteListCIDR
and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on
the next gRPC admin check.

Restore the continue. Orthogonal to the fail-closed semantics this PR
is reverting.
2026-05-12 16:00:44 -07:00
Chris Lu f28c7ce6df master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:

- Reject heartbeats whose Ip does not match the gRPC peer's source
  address. Loopback peers are still trusted; operators behind a proxy
  can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
  and drop foreign re-claims. Non-EC volume claims are bounded by the
  replica copy count so legitimate replicas still register. EC
  ownership is keyed by (vid, shard_id) so the same vid can legitimately
  be split across many peers as long as their EcIndexBits are disjoint;
  rejected bits are cleared from the bitmap and the parallel ShardSizes
  array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
  so disconnect cleanup is O(M) in what that peer held rather than O(N)
  over the whole map.

Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.

The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
2026-05-12 15:38:52 -07:00
Chris Lu 21054b6c18 volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use
it to gate destructive volume admin RPCs. IsWhiteListed keeps its
allow-all-when-empty semantics for HTTP compatibility.

For TCP peers with an empty whitelist, off-host callers are rejected but
loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly
cohabits with the master/filer on a single host and in integration-test
clusters; the loopback exception keeps cluster-internal admin traffic
working without -whiteList while still locking out off-host attackers.

Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check
entirely. When `weed server` runs master+volume+filer in a single process
the master dials the volume server in-process and the peer address surfaces
as "@", which has no parseable IP. Such a caller shares our OS process and
cannot be spoofed by a remote attacker, so we treat it as trusted by
construction.

The gate also tolerates a nil guard (developmental / embedded path) and only
enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR
fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.
2026-05-12 12:35:27 -07:00
Chris Lu 69da20bdae volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle

Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the
destructive volume-server RPCs, and reject cluster-internal remote S3
endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before
dialing. Pin the validated address against DNS rebinding by routing the
AWS SDK through an HTTP transport whose DialContext re-resolves the host
and re-applies the deny list on every dial, so an endpoint that resolves
to a public IP at validate-time and then flips to 127.0.0.1 at connect
time is refused. Operators that legitimately fetch from private hosts
can opt out with -volume.allowUntrustedRemoteEndpoints.
2026-05-12 10:11:20 -07:00
Chris Lu 5e8f99f40a filer: require admin-signed JWT on the IAM gRPC service (#9442)
Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires
a Bearer token in the authorization metadata, signed with the filer
write-signing key. The service refuses to register on a filer that has
no jwt.filer_signing.key set, so the unauthenticated default is gone:
operators who use these RPCs must configure the key and attach a token
on every call.

Bearer scheme matching is case-insensitive (RFC 6750), every handler
nil-checks req before dereferencing it, and tests now cover the
expired-token path.
2026-05-12 10:11:08 -07:00
Chris Lu 05d31a04b6 fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite

The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.

Three changes that work together:

- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
  the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
  the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
  tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
  Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
  land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
  -runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
  shell command runs the full pipeline (reader + scheduler + dispatcher) for the
  duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
  out for now — versioned-bucket expiration via the worker still needs its own
  pass.

Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (8b87ceb0d updated `mini_plugin_test.go` for the s3_lifecycle
plugin but missed this twin test).

Two retention-gate engine tests `t.Skip` under the s3tests build because they
rely on absolute lookback-vs-retention math the day-rescale collapses; the prod
build still covers them.

* review: harden lifecycle worker spawn + assert handler identity

- Workflow: aliveness check on the backgrounded `weed shell` (a bad command
  exits in <1s and the suite would otherwise just opaque-timeout); move
  worker/server teardown into a `trap cleanup EXIT` so failure paths still
  print the worker log and reap the data dir.
- worker_test: check the actual job-type set by name, not just the count.

* fix(shell): keep s3.lifecycle.run-shard alive when no rules exist yet

The s3-tests CI runs the worker BEFORE any test creates a bucket, so
LoadCompileInputs returns empty and the shell command was bailing out
with "no buckets with enabled lifecycle rules found" within ~1s. The
aliveness check then fired exit 1 before tox ever started.

Two changes:

- Don't early-exit on empty inputs. Compile against the empty set, log a
  one-liner, and let the pipeline run normally — the meta-log subscription
  is already up, so events for buckets created later DO arrive; they just
  need the engine to know about them when they do.
- Add `-refresh <duration>` (default 5m, 2s in s3tests CI) that
  periodically re-runs LoadCompileInputs + engine.Compile so rules added
  after startup land in the snapshot the dispatcher reads on its next
  tick. Production deployments keep the 5m default; only the CI workflow
  drops to 2s.

Workflow passes `-refresh 2s` in both basic and SQL blocks.

* fix(shell): backfill pre-rule entries via bootstrap walker

The reader-driven path only sees meta-log events created AFTER its
engine snapshot knows the rule. The s3-tests CI scenario PUTs objects
first, then PUTs the lifecycle config, so by the time the engine
refresh picks up the new bucket the object events have already been
seen-and-dropped (BucketActionKeys returned empty for the bucket).

Wire bootstrap.Walk into the shell command:

- bucketBootstrapper tracks buckets seen so far. kickOffNew spawns one
  loop goroutine per fresh bucket.
- Each goroutine re-walks the bucket every walkInterval (defaults to
  the same value as -refresh, i.e. 2s in s3tests CI, 5m in prod) and
  feeds each entry through bootstrap.Walk; due actions dispatch via a
  direct LifecycleDelete RPC. Not-yet-due entries are silently skipped
  and picked up on a later iteration once they age past their (rescaled
  or real) threshold.
- LifecycleDelete is called with no expected_identity; the server-side
  identityMatches treats nil as "skip CAS", which is the right call
  for bootstrap (the bootstrap entry doesn't carry chunk fid /
  extended hash anyway).

The dispatcher's pkg-private toProtoActionKind is duplicated in the
shell file rather than exported, since the shape is six lines and the
reverse import would pull a proto dep into the s3lifecycle root.

* refactor(s3/lifecycle): hoist bucket bootstrapper into scheduler pkg

The shell command got the backfill in the previous commit but the worker
plugin (weed/worker/tasks/s3_lifecycle/handler.go) drives Scheduler.Run
directly and missed it — same root cause: the reader-driven path only
sees events created after the rule lands, so a daily cron picking up a
freshly-PUT rule wouldn't expire any pre-rule object.

Move the looping bucket walker into scheduler.BucketBootstrapper:

- Scheduler.Run now constructs one and calls KickOffNew on every engine
  refresh. Per-bucket goroutines re-walk every BootstrapWalkInterval
  (defaults to RefreshInterval — 5m in prod, 2s under s3tests).
- The shell command consumes the same struct instead of its own copy
  so the two paths can't drift in semantics.

* refactor(s3/lifecycle): walk-once + schedule via event injection

Previous per-bucket walker re-listed every WalkInterval forever. For a
bucket with N objects under a long rule, the worker did O(N * runtime /
walkInterval) listings even when nothing was newly due — way too much
for production-scale buckets.

New approach: walk each bucket exactly once on first sight, synthesize
one *reader.Event per existing entry, push it onto Pipeline.events.
Router.Route builds a Match with DueTime=mtime+delay; future-due matches
sit in the per-shard Schedule and fire when their DueTime arrives.
Currently-due matches fire on the very next dispatch tick.

Wiring:

- dispatcher.Pipeline lifts its events channel into a struct field
  with sync.Once init, and exposes InjectEvent(ctx, ev). Reader no
  longer closes the channel — the dispatch goroutine exits on runCtx
  cancellation, which works the same as channel-close did.
- scheduler.BucketBootstrapper drops the WalkInterval ticker. KickOffNew
  spawns one walker goroutine per fresh bucket; the goroutine lists,
  synthesizes events, then exits.
- scheduler.Scheduler builds its pipelines up front and exposes a
  pipelineFanout (shard -> Pipeline) as the EventInjector, so a multi-
  worker scheduler routes each synthesized event to the pipeline that
  owns its shard.
- Shell command's single-pipeline path passes pipeline.InjectEvent
  directly.

Synthesized events carry TsNs=0; dispatcher.advance treats that as a
no-op so the reader's persisted cursor isn't ratcheted past unprocessed
meta-log events. Identity (HeadFid + ExtendedHash) is still computed
from the real filer entry, so the server's identity-CAS catches an
overwrite between bootstrap and dispatch.

* debug(s3tests): make lifecycle worker progress visible in CI logs

The previous CI failure dumped an empty $LC_LOG even though the worker
was running. Two reasons:

1. weed shell suppresses glog by default (logtostderr / alsologtostderr
   set to false). Pass `-debug` so the bootstrapper's V(0) lines reach
   stderr instead of disappearing into /tmp/weed.*.log.
2. cleanup used `kill -9` which skips Go's stdout flush. SIGTERM first
   with a 1s grace, then SIGKILL the holdout, then read the log.

While here: bump the bootstrap walker's two informational logs to V(0)
so the diagnosis from CI doesn't require -v=1 on the worker.

* fix(s3/lifecycle/dispatcher): refresh snap on every event

Pipeline.Run captured snap at startup and only refreshed it on the
dispatch tick. With bootstrap event injection, the walker pushes events
seconds after engine.Compile sees the bucket — typically WITHIN the
same dispatch interval. Routing against the cached (empty) snap then
silently dropped every match because BucketActionKeys returned nil for
the bucket-not-yet-in-snapshot case.

Re-fetch on each event. Engine.Snapshot is an atomic.Pointer.Load, so
the cost is negligible. The dispatch-tick branch keeps using a fresh
local read for its own loop, so its semantics are unchanged.
2026-05-08 17:29:47 -07:00
Chris Lu 8b87ceb0d1 refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration (Phase 4) (#9367)
* refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration

The handler used to walk every existing entry under the rule's prefix
and stamp entry.Attributes.TtlSec + the SeaweedFSExpiresS3 flag so that
the filer's compaction filter would expire them. With the event-driven
lifecycle worker live, that retroactive walk is redundant — the worker
drives expiration off the meta-log and a one-time bootstrap scan, so a
PUT lifecycle stays O(rules) instead of O(objects).

New writes still inherit TTL from the filer.conf location entry above;
that volume-routing path is unchanged here and will move to an explicit
operator command later (Phase 11).

Drops updateEntriesTTL + processDirectoryTTL + processTTLBatch +
updateEntryTTL from filer_util.go.

* fix(s3api): clear stale lifecycle TTL entries on PUT

PutBucketLifecycleConfiguration only ever appended/updated filer.conf
entries — it never cleared ones the operator removed, renamed-prefix on,
disabled, retagged with a tag filter, or bucket-versioned out of the
fast path. The stale day-TTL kept routing new writes (and would expire
old ones if any landed under the prefix) after the policy was updated.

Treat PUT as a full replacement: walk this bucket's existing day-TTL
entries, clear them, then add fresh entries from the new rule set.

* test(command): bump mini default plugin job-type count to 7

The s3_lifecycle plugin handler registered in #9362 is the seventh
default; the test still asserted six.

* fix(s3api): delete stale lifecycle PathConf instead of blanking Ttl

Just clearing pathConf.Ttl leaves the rule's Collection, Replication,
and VolumeGrowthCount in place, so new writes still match the stale
prefix and inherit outdated routing/placement. Use
fc.DeleteLocationConf so the lifecycle-owned PathConf goes away
entirely. Same fix in DeleteBucketLifecycleHandler, which had the
same bug.
2026-05-08 11:03:03 -07:00
Chris Lu c567da7164 feat(s3): register SeaweedS3LifecycleInternal gRPC service (#9359)
Phase 2 added the LifecycleDelete handler on S3ApiServer but never
registered it on a running gRPC server, so workers had no endpoint to
dial. Embed UnimplementedSeaweedS3LifecycleInternalServer on
S3ApiServer and register it on the s3 command's grpc server alongside
SeaweedS3IamCacheServer.
2026-05-07 18:19:42 -07:00
Chris Lu 1c0e24f06a fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume

A .vif left behind without its .idx (orphaned by a crashed move, partial
copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the
whole volume server down on boot, killing every healthy volume on it
too. For remote-tiered volumes treat it as a per-volume load error so
the server can come up and the operator can clean up the stray .vif.

Refs #9331.

* fix(balance): skip remote-tiered volumes in admin balance detection

The admin/worker balance detector had no equivalent of the shell-side
guard ("does not move volume in remote storage" in
command_volume_balance.go), so it scheduled moves on remote-tiered
volumes. The "move" copies .idx/.vif to the destination and then calls
Volume.Destroy on the source, which calls backendStorage.DeleteFile —
deleting the remote object the destination's new .vif now points at.

Populate HasRemoteCopy on the metrics emitted by both the admin
maintenance scanner and the worker's master poll, then drop those
volumes at the top of Detection.

Fixes #9331.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(volume): keep remote data on volume-move-driven delete

The on-source delete after a volume move (admin/worker balance and
shell volume.move) ran Volume.Destroy with no way to opt out of the
remote-object cleanup. Volume.Destroy unconditionally calls
backendStorage.DeleteFile for remote-tiered volumes, so a successful
move would copy .idx/.vif to the destination and then nuke the cloud
object the destination's new .vif was already pointing at.

Add VolumeDeleteRequest.keep_remote_data and plumb it through
Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The
balance task and shell volume.move set it to true; the post-tier-upload
cleanup of other replicas and the over-replication trim in
volume.fix.replication also set it to true since the remote object is
still referenced. Other real-delete callers keep the default. The
delete-before-receive path in VolumeCopy also sets it: the inbound copy
carries a .vif that may reference the same cloud object as the
existing volume.

Refs #9331.

* test(storage): in-process remote-tier integration tests

Cover the four operations the user is most likely to run against a
cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by
registering a local-disk-backed BackendStorage as the "remote" tier and
exercising the real Volume / DiskLocation / EC encoder code paths.

Locks in:
- Destroy(keepRemoteData=true) preserves the remote object (move case)
- Destroy(keepRemoteData=false) deletes it (real-delete case)
- Vacuum/compact on a remote-tier volume never deletes the remote object
- EC encode requires the local .dat (callers must download first)
- EC encode + rebuild round-trips after a tier-down

Tests run in-process and finish in under a second total — no cluster,
binary, or external storage required.

* fix(rust-volume): keep remote data on volume-move-driven delete

Mirror the Go fix in seaweed-volume: plumb keep_remote_data through
grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume
→ Volume.destroy, and skip the s3-tier delete_file call when the flag
is set. The pre-receive cleanup in volume_copy passes true for the
same reason as the Go side: the inbound copy carries a .vif that may
reference the same cloud object as the existing volume.

The Rust loader already warns rather than fataling on a stray .vif
without an .idx (volume.rs load_index_inmemory / load_index_redb), so
no counterpart to the Go fatal-on-missing-idx fix is needed.

Refs #9331.

* fix(volume): preserve remote tier on IO-error eviction; fix EC test target

Two review nits:

- Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored
  volumes with keepRemoteData=false, so a transient local fault on a
  remote-tiered volume would also nuke the cloud object. Track the
  delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile()
  for IO-error evictions; TTL-expired evictions still pass false.

- TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but
  called them "parity" — by the klauspost/reedsolomon convention shards
  0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1
  are parity. Switch the loop to delete the parity range so the
  intent matches the indices.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-06 15:19:43 -07:00
Chris Lu 6141222ab0 fix(test/s3/policy): allocate fresh admin port per subtest (#9332)
* fix(test/s3/policy): allocate fresh admin port per subtest

startMiniCluster ran weed mini in-process and explicitly assigned
master/volume/filer/s3 ports allocated by MustAllocatePorts, but it
left -admin.port and -admin.port.grpc unset, so each subtest reused
the hardcoded defaults 23646 / 33646.

The package's subtests run sequentially within the same go test
process. The previous subtest's admin goroutine is still bound to
23646 by the time the next subtest spins up its own mini, so the
new admin can never bind, mini.go's waitForAdminServerReady hits
its 240-attempt cap, and glog.Fatalf kills the test binary. This
has been the dominant cause of "admin server did not become ready"
flakes across recent IAM PRs.

Allocate two extra ports for admin and pass them through. The other
subprocess-based tests (s3tables/*) are not affected because each
launches weed mini in a fresh OS process.

* fix(mini): make admin readiness wait context-aware

waitForAdminServerReady polled for 240 attempts × 500ms regardless of
whether the surrounding mini context was cancelled. When mini is run
in-process from a test harness (test/s3/policy/...) and the test calls
its cancel func, the leftover wait keeps spinning for the full two
minutes and then glog.Fatalf's, terminating the entire test binary —
including any sibling subtest that has since started its own mini.

Thread the existing miniClientsCtx through the wait so a Stop / cancel
returns context.Canceled immediately. The caller (startMiniAdminWithWorker)
treats a context-cancelled outcome as a graceful shutdown signal and
logs+returns instead of fataling.
2026-05-05 11:24:43 -07:00
Chris Lu 95560076e6 fix(mini): raise admin readiness timeout to 2 minutes (#9329)
The 30-second ceiling on waitForAdminServerReady was too tight on busy
CI runners. master + filer + volume + admin all start in parallel on a
shared worker, and S3 Policy Shell Integration Tests has been flaking
across multiple PRs with "admin server did not become ready... after
60 attempts" even though the server still comes up within a minute or
two. Two minutes (240 attempts at 500ms) leaves headroom for runner
contention without being absurd in a local-dev run.
2026-05-05 07:59:25 -07:00
Chris Lu d605feb403 refactor(command): expand "~" in all path-style CLI flags (#9306)
* refactor(command): expand "~" in all path-style CLI flags

Many of weed's path-bearing flags (-s3.config, -s3.iam.config,
-admin.dataDir, -webdav.cacheDir, -volume.dir.idx, TLS cert/key
files, profile output paths, mount cache dirs, sftp key files, ...)
were never run through util.ResolvePath, so a value like "~/iam.json"
was used literally. Tilde only worked when the shell expanded it,
which silently fails for the common -flag=~/path form (bash leaves
the tilde literal in --opt=~/path).

- Extend util.ResolvePath to also handle "~user" / "~user/rest",
  matching shell tilde expansion. Add unit tests.
- Apply util.ResolvePath at the top of each shared start* function
  (s3, webdav, sftp) so mini/server/filer/standalone callers all
  inherit it; resolve at the few one-off use sites (mount cache
  dirs, volume idx folder, mini admin.dataDir, profile paths).
- Drop the duplicate expandHomeDir helper from admin.go in favor of
  the now-equivalent util.ResolvePath.

* fixup: handle comma-separated -dir flags for tilde expansion

`weed mini -dir`, `weed server -dir`, and `weed volume -dir` accept
comma-separated paths (`dir[,dir]...`). Calling util.ResolvePath on
the whole string mishandled multi-folder values with tilde, e.g.
"~/d1,~/d2" would resolve as if "d1,~/d2" were a single subpath.

- Add util.ResolveCommaSeparatedPaths: split on ",", run each entry
  through ResolvePath, rejoin. Short-circuits when no "~" present.
- Use it for *miniDataFolders (mini.go), *volumeDataFolders (server.go),
  and resolve each entry of v.folders in-place (volume.go) so all
  downstream consumers see resolved paths.
- Add 7-case TestResolveCommaSeparatedPaths covering empty, single,
  multiple, and mixed inputs.

* address PR review: metaFolder + Windows backslash

- master.go: resolve *m.metaFolder at the top of runMaster so
  util.FullPath(*m.metaFolder) on the next line sees an expanded
  path. Drop the now-redundant ResolvePath in TestFolderWritable.
- server.go: same treatment for *masterOptions.metaFolder, paired
  with the existing cpu/mem profile resolves. Drop the redundant
  inner ResolvePath at TestFolderWritable.
- file_util.go: ResolvePath now accepts filepath.Separator as a
  separator after the tilde, so "~\\data" works on Windows. Other
  platforms keep current behaviour (backslash stays literal because
  it is a valid filename character in usernames and paths).
- file_util_test.go: add two cases using filepath.Separator that
  exercise the new code path on Windows and remain a no-op on Unix.

* address PR review: resolve "~" in remaining command path flags

Comprehensive sweep of path-bearing flags across every weed
subcommand, applying util.ResolvePath in-place at the top of each
run* function so all downstream consumers see expanded paths.

- webdav.go: resolve *wo.cacheDir at the top of startWebDav so
  mini/server/filer/standalone callers all inherit it.
- mount_std.go: cpu/mem profile paths.
- filer_sync.go: cpu/mem profile paths.
- mq_broker.go: cpu/mem profile paths.
- benchmark.go: cpuprofile output path.
- backup.go: -dir resolved once at runBackup; drop the duplicated
  inline ResolvePath in NewVolume calls.
- compact.go: -dir resolved at runCompact; drop inline ResolvePath.
- export.go: -dir and -o resolved at runExport; drop inline
  ResolvePath in LoadFromIdx and ScanVolumeFile.
- download.go: -dir resolved at runDownload; drop inline.
- update.go: -dir resolved at runUpdate so filepath.Join uses the
  expanded path; drop inline ResolvePath in TestFolderWritable.
- scaffold.go: -output expanded before filepath.Join.
- worker.go: -workingDir expanded before being passed to runtime.

* address PR review: resolve option-struct paths at run* entry points

server.go:381 propagates s3Options.config to filerOptions.s3ConfigFile
*before* startS3Server runs, which meant the filer-side code saw the
unresolved tilde-prefixed pointer. Same pattern for webdavOptions and
sftpOptions (and equivalent in mini.go / filer.go).

The fix: hoist resolution from the shared start* functions up to the
run* entry points, where every shared pointer is set up before any
propagation happens.

- s3.go, webdav.go, sftp.go: extract a resolvePaths() method on each
  Options struct that runs every path field through util.ResolvePath
  in-place. Idempotent.
- runS3, runWebDav, runSftp: call the standalone struct's resolvePaths
  before starting metrics / loading security config.
- runServer, runMini, runFiler: call resolvePaths on every embedded
  options struct, plus resolve loose flags (serverIamConfig,
  miniS3Config, miniIamConfig, miniMasterOptions.metaFolder, and
  filer's defaultLevelDbDirectory) so they're expanded before any
  pointer copy or use.
- Drop the now-redundant inline ResolvePath at filer's
  defaultLevelDbDirectory composition.

* address PR review: re-resolve mini -dir post-config, cover misc paths

- mini.go: applyConfigFileOptions can overwrite -dir with a literal
  ~/data from mini.options. Re-resolve *miniDataFolders after the
  config-file apply, alongside the other path resolves, so the mini
  filer no longer ends up with a literal ~/data/filerldb2.
- benchmark.go: resolve *b.idListFile (-list).
- filer_sync.go: resolve *syncOptions.aSecurity / .bSecurity
  (-a.security / -b.security) before LoadClientTLSFromFile.
- filer_cat.go: resolve *filerCat.output (-o) before os.OpenFile.
- admin.go: drop trailing blank line at EOF (git diff --check).

* address PR review: resolve -a.security/-b.security/-config before use

Three follow-up fixes:

- filer_sync.go: the -a.security / -b.security resolves were placed
  *after* LoadClientTLSFromFile / LoadHTTPClientFromFile were called,
  so weed filer.sync -a.security=~/a.toml still passed the literal
  tilde path. Hoist the resolves above the security-loading block so
  TLS clients see expanded paths.
- filer_sync_verify.go: same flag pair was never resolved at all in
  the verify command; resolve at the top of runFilerSyncVerify.
- filer_meta_backup.go: -config (the backup_filer.toml path) was
  passed directly to viper. Resolve at the top of runFilerMetaBackup.
- mini.go: master.dir defaulted to the entire comma-joined
  miniDataFolders. With weed mini -dir=~/d1,~/d2 (or any multi-dir
  setup), TestFolderWritable then stat'd the joined string instead
  of a single directory. Default to the first entry via StringSplit
  to mirror the disk-space calculation a few lines below, and drop
  the now-redundant ResolvePath in TestFolderWritable.
2026-05-03 21:46:21 -07:00
Chris Lu f16353de0b feat(mini): add -bucket flag to pre-create an S3 bucket on startup (#9302)
* feat(mini): add -bucket flag to pre-create an S3 bucket on startup

Lets users hand a pre-provisioned object store to clients/CI without a
post-start `weed shell s3.bucket.create` step. The flag is a no-op when
empty (default) and idempotent on subsequent starts.

* mini: bound bucket-creation RPCs with a timeout off miniClientsCtx

Address PR review feedback: derive the lookup/mkdir context from
miniClientsCtx() so Ctrl+C cancels the bucket RPCs, and cap with a 5s
timeout so a stalled filer cannot block the welcome message
indefinitely. Also wrap the DoMkdir error for parity with the lookup
path.

* mini: fall back to S3_BUCKET env var for -bucket

Mirrors the existing -s3.externalUrl / S3_EXTERNAL_URL pattern so
container/Kubernetes deployments can pre-create the bucket via env
without overriding the entrypoint command.

* docs(readme): lead weed mini quick start with credentials + bucket

Promote the one-line setup (env vars + bucket) so users get a
ready-to-use S3 endpoint without hopping between sections to find
credential and bucket setup.

* mini: accept comma-separated -bucket list

Lets a single startup pre-create multiple S3 buckets, e.g.
-bucket=bucket1,bucket2 (or S3_BUCKET=bucket1,bucket2). Names are
trimmed and deduped; per-bucket errors are logged and the loop continues
so one bad name does not block the rest.

* mini: add -tableBucket flag for pre-creating S3 Tables buckets

Mirrors -bucket but creates S3 Tables (Iceberg) buckets via
s3tables.Manager so users can hand the all-in-one binary a ready-to-use
table catalog without a follow-up weed shell call. Comma-separated, env
fallback to S3_TABLE_BUCKET, idempotent on restart, owned by the
DefaultAccountID placeholder.

* mini: use errors.Is for ErrNotFound check in bucket lookup

Matches the rest of the codebase (~20 call sites in weed/s3api). The
direct equality works today because LookupEntry returns ErrNotFound
unwrapped, but errors.Is future-proofs against any future wrapping.
2026-05-02 21:02:21 -07:00
Chris Lu 1f6f473995 refactor(worker): co-locate plugin handlers with their task packages (#9301)
* refactor(worker): co-locate plugin handlers with their task packages

Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.

Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.

  - activity.go: BuildExecutorActivity, BuildDetectorActivity
  - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
  - interval.go: ShouldSkipDetectionByInterval
  - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
  - collection_filter.go: CollectionFilterMode + consts
  - volume_metrics.go: export CollectVolumeMetricsFromMasters,
    MasterAddressCandidates, FetchVolumeList
  - testing_senders_test.go: shared test stubs

Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.

  weed/plugin/worker/vacuum_handler.go         -> weed/worker/tasks/vacuum/plugin_handler.go
  weed/plugin/worker/ec_balance_handler.go     -> weed/worker/tasks/ec_balance/plugin_handler.go
  weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
  weed/plugin/worker/iceberg/                   -> weed/worker/tasks/iceberg/

  weed/plugin/worker/handlers/handlers.go now blank-imports all five
  task subpackages so their init() registrations fire.

  weed/command/mini.go and the worker tests construct the handler with
  vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
  vacuum handler).

admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.

* refactor(worker): update test/plugin_workers imports for moved handlers

Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:

  pluginworker.NewVacuumHandler        -> vacuum.NewVacuumHandler
  pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
  pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler

The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.

* refactor(worker): update test/s3tables iceberg import path

The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.

Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.

* chore: gofmt PR-touched files

The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.

Touched files:
  test/plugin_workers/erasure_coding/{detection,execution}_test.go
  test/s3tables/maintenance/maintenance_integration_test.go
  weed/plugin/worker/handlers/handlers.go
  weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go

* refactor(worker): bounds-checked int conversions for plugin config values

CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.

Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:

  ReadInt32Config (math.MinInt32 .. math.MaxInt32)
  ReadUint32Config (0 .. math.MaxUint32)
  ReadIntConfig    (math.MinInt32 .. math.MaxInt32, platform-portable)

Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.

Touched files:
  weed/plugin/worker/config.go
  weed/worker/tasks/balance/plugin_handler.go
  weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/worker/tasks/vacuum/plugin_handler.go

* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers

CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.

Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.

Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).
2026-05-02 18:03:13 -07:00
Jaehoon Kim be451d22b5 feat(filer.sync): add -verifySync mode to filer.sync for cross-cluster file comparison (#9284)
* Add -verifySync flag to filer.sync for cross-cluster file comparison

Add a verification mode to filer.sync that compares entries between two
filers without performing actual synchronization. Uses directory-level
sorted merge of ListEntries to detect missing files, size mismatches,
and ETag mismatches. Supports -isActivePassive for unidirectional check
and -modifyTimeAgo to skip recently modified files during sync lag.

* Add mtime annotation and JSON output to filer.sync -verifySync

Add automatic mtime relation analysis for SIZE_MISMATCH and
ETAG_MISMATCH diffs, and an NDJSON output mode for external tooling.

mtime classification:
- B_NEWER => "late_updates_skip_likely" hint. Surfaces the case
  where target has a stub entry whose mtime is ahead of source's
  real file, causing UpdateEntry's mtime guard in filersink to
  permanently skip the update.
- A_NEWER => "sync_lag_or_event_miss" hint.
- EQUAL   => no hint (chunk-level issue suspected).

Text output example:
  [SIZE_MISMATCH] /path (a=996, b=0, B newer +274d [late-updates skip likely])

Add -verifyJsonOutput flag. When set, emits one JSON object per
line (NDJSON) for diffs and a final SUMMARY object, suitable for
piping into external diagnostic pipelines.

Concurrent writes from the directory worker pool are now serialized
via outputMu to keep both text lines and JSON records atomic.

* fix(filer.sync): use shared global semaphore in verifySync to bound goroutine explosion

Replace the per-call local semaphore in compareDirectory with a single
shared semaphore created in runVerifySync. The old per-level semaphore
applied a limit of verifySyncConcurrency only within each directory level,
allowing effective concurrency to grow as verifySyncConcurrency^depth on
deep trees.

The shared semaphore is held only for each directory's I/O phase
(listEntries + merge) and released before recursing into subdirectories,
so a parent never blocks waiting for children to acquire slots — which
would deadlock once tree depth exceeds the semaphore capacity.

Extract the capacity into a named constant (verifySyncConcurrency = 5)
with a comment explaining the memory vs. performance trade-off.

Add unit tests:
- correctness: missing file, only-in-B, size mismatch, active-passive mode
- concurrency bound: peak concurrent listings ≤ verifySyncConcurrency
- no-deadlock: binary tree of depth 10 completes within timeout

* fix(filer.sync): stream directory entries to prevent OOM on large directories

Replace the listEntries helper (which accumulated all entries into a
single []filer_pb.Entry slice) with an entryStream type that pages
through the directory in the background and forwards entries one at a
time through a buffered channel. Memory per directory comparison is now
O(channel buffer size = 64) regardless of how many entries the directory
contains.

Key design points:
- entryStream wraps a goroutine + buffered channel with a one-entry
  lookahead (peek/advance) so the two-pointer sorted merge in
  compareDirectory can work without buffering any full listing.
- A child context (mergeCtx) is passed to both stream goroutines so
  they are cancelled promptly if compareDirectory returns early (e.g.
  on error); the ctx.Done() select arm in the callback prevents
  goroutine leaks when the consumer stops reading.
- stream.err is written by the goroutine before close(ch), so it is
  safe to read after the channel is exhausted (Go memory model:
  channel close happens-before the zero-value receive).
- countMissingRecursive is rewritten to use ReadDirAllEntries with a
  direct callback, eliminating its own slice allocation.
- listEntries is removed; it is no longer called anywhere.

* fix(filer.sync): address verifySync review findings

Four real bugs found and fixed; one finding already resolved (shared
semaphore was introduced in a prior commit).

path.Join for child paths (filer_sync_verify.go)
  fmt.Sprintf("%s/%s", dir, name) produced "//name" when dir was "/".
  Replace all child-path concatenations with path.Join so root-level
  walks emit clean paths.

cutoffTime check for ONLY_IN_B entries (filer_sync_verify.go)
  The B-only branch ignored -modifyTimeAgo, so files recently written
  to B were reported as ONLY_IN_B instead of being skipped. Mirror the
  A-side mtime guard: skip and increment skippedRecent when the entry
  is newer than cutoffTime.

Summary emitted before error check (filer_sync_verify.go)
  A filer I/O error mid-walk still caused a SUMMARY record (or text
  summary) to be printed, making partial runs appear complete. Move the
  error check to before summary emission; on error, return immediately
  without printing any summary.

Return false on verification failure (filer_sync.go)
  runVerifySync returned true (exit 0) even when diffs were found or the
  walk failed. Return false so the main binary sets exit status 1,
  consistent with how all other commands signal failure.

* test(filer.sync): add missing verifySync test coverage

Four new tests covering gaps identified during review:

TestVerifySyncETagMismatch
  Verifies that two files with identical size but different Md5 checksums
  are counted as etagMismatch (not sizeMismatch). Exercises the second
  branch of compareEntries that was previously untested.

TestVerifySyncCutoffTime (4 subtests)
  A-only recent  — recent file skipped (skippedRecent++), not MISSING
  A-only old     — old file reported as MISSING
  B-only recent  — recent file skipped (skippedRecent++), not ONLY_IN_B
  B-only old     — old file reported as ONLY_IN_B
  The B-only subtests specifically cover the cutoffTime fix added in the
  previous commit.

TestVerifySyncRootPath
  Regression for the path.Join fix: walks from "/" and verifies that the
  child directory is reached and compared correctly (the old Sprintf
  produced "//data" which would silently produce wrong results).
  Asserts dirCount=2 and fileCount=1 to confirm the full tree is walked.

* fix(filer.sync): use os.Exit(2) instead of return false on verify failure

return false triggered weed.go's error handler which printed the full
command usage — appropriate for invalid arguments, not for a completed
verification that found differences. Use os.Exit(2) consistent with
the existing pattern in filer_sync.go (lines 251, 293).

* refactor(filer.sync.verify): split verify into its own command

The verify mode is a one-shot batch operation with a fundamentally
different lifecycle from the long-running sync subscriber, and most of
filer.sync's flags (replication, metrics port, debug pprof, concurrency,
etc.) do not apply to it. Extract it into a sibling command alongside
filer.copy/filer.backup/filer.export rather than a flag mode on
filer.sync.

Also rename modifyTimeAgo to modifiedTimeAgo (grammatical) and drop the
verifyJsonOutput prefix to plain jsonOutput now that the verify context
is implicit in the command name.

* fix(filer.sync.verify): address review comments

- Bounded worker pool: cap subdirectory goroutines per level via a
  jobs channel and min(verifySyncConcurrency, len(subDirs)) workers
  instead of spawning one goroutine per child. Wide directories no
  longer park ~2KB per queued goroutine.

- Don't gate recursion on a directory's mtime: a fresh child write
  bumps the parent mtime, but older files inside should still be
  reported as missing. Always recurse for missing-in-B directories
  and apply the cutoff per-file inside countMissingRecursive.

- Apply -modifiedTimeAgo symmetrically: matched-name files now skip
  the comparison when EITHER side is recently modified, not just A.
  This restores lag tolerance when B was just rewritten.

Adds tests for both new behaviors and a shared isTooRecent helper.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-29 12:33:53 -07:00
Chris Lu 35fe3c801b feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder

The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.

Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.

This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.

References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.

* feat(nfs): advertise UDP MOUNT v3 in the portmap responder

The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.

Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.

* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener

Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.

The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.

Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.

* docs(nfs): note that the plain mount form works on UDP-default clients

With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.

The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.

* test(nfs): add kernel-mount e2e tests under test/nfs

The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.

Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:

  - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
  - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
    only — regression test for the new UDP MOUNT v3 responder.
  - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
    asserting the kernel surfaces a protocol/version-level error rather
    than a generic "mount system call failed" — regression test for the
    PROG_MISMATCH path from #9262.

The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.

Run locally with:

	cd test/nfs
	sudo go test -v -run TestKernelMount ./...

CI wiring follows in the next commit.

* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow

Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:

  - Existing protocol-layer step now skips '^TestKernelMount' so a
    "skipped because not root" line doesn't appear on every run.
  - New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
    helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
    name lookups need to resolve `tcp`/`udp`).
  - New privileged step runs only the kernel-mount tests under sudo,
    preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
    caches so the second `go test` invocation reuses already-built
    test binaries instead of redownloading modules under root.

The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
2026-04-28 14:06:35 -07:00
Chris Lu 735e94f6ba mount: expose -fuse.maxBackground and -fuse.congestionThreshold flags (closes #9258) (#9268)
* mount: expose `-fuse.maxBackground` flag (closes #9258)

The Linux FUSE driver caps in-flight async requests via
`/sys/fs/fuse/connections/<id>/max_background` (and a derived
`congestion_threshold = 3/4 * max_background`). Heavy upload workloads
need this raised, but the cap currently lives only in `/sys`, so it
resets on reboot/remount.

`weed mount` was hardcoding `MaxBackground: 128`. Promote it to a flag,
default unchanged. Setting `-fuse.maxBackground=2048` reproduces the
manual `echo 2048 > .../max_background` (and gives 1536 for
congestion_threshold automatically) persistently across remounts.

`congestion_threshold` is not exposed as a separate flag because
go-fuse derives it as 3/4 of MaxBackground in InitOut and offers no
hook to override; users wanting a different ratio can still write
/sys/fs/fuse/connections/<id>/congestion_threshold post-mount.

* mount: add `-fuse.congestionThreshold` flag, bump go-fuse to v2.9.3

go-fuse v2.9.3 exposes CongestionThreshold as a separate MountOption,
so we can now let users override the kernel's default 3/4-of-max_background
ratio at mount time instead of having to write
/sys/fs/fuse/connections/<id>/congestion_threshold post-mount on every
remount/reboot.

Default 0 preserves existing behavior (kernel derives it as
3/4 * max_background). Non-zero is sent to the kernel verbatim; the
kernel clamps it to max_background if higher.
2026-04-28 13:42:58 -07:00
Chris Lu 0fa0a56a5a filer(mysql): TLS hostname/SNI knobs + MariaDB upsert documentation (#9260)
* refactor(filer/mysql): set tls.Config per-instance via Connector instead of global registry

Replace the use of `mysql.RegisterTLSConfig("mysql-tls", ...)` and the
`&tls=mysql-tls` DSN suffix with a per-instance setup that assigns the
`*tls.Config` directly to `mysql.Config.TLS` and opens the database via
`mysql.NewConnector` + `sql.OpenDB`.

The driver's TLS-config registry is process-wide; if a second `MysqlStore`
were ever initialized with different TLS settings (e.g., a filer plus a
separately configured store) the second registration would silently
overwrite the first. The connector pattern keeps the TLS configuration
attached to the connector and avoids that global side effect.

Behavior is otherwise unchanged: TLS is enabled when `enable_tls=true`,
the same `ca_crt`/`client_crt`/`client_key` knobs are honored, and the
TLS minimum version remains 1.2.

* filer(mysql): use system root CAs when ca_crt is empty

Previously, enabling `enable_tls=true` without setting `ca_crt` returned an
unhelpful empty-path read error. Many managed MySQL/MariaDB providers serve
certificates that chain to a public CA already in the host's trust store, so
requiring an explicit CA bundle adds friction with no security benefit.

Leave `RootCAs` unset when `ca_crt` is empty so Go's `tls.Config` falls back
to the system trust store, matching the standard behavior of `mysql --ssl`.
Existing setups with `ca_crt` configured are unaffected.

Also wraps the CA read/parse errors with the file path for easier diagnosis.

* filer(mysql): fail loudly when client_crt / client_key are unreadable

The previous implementation called `tls.LoadX509KeyPair` and silently
discarded any error, falling back to a non-mTLS connection. A typo or
permissions problem in `client_crt` / `client_key` therefore appeared as a
confusing server-side handshake error rather than as a config error,
because the server was expecting a client cert that the filer never sent.

Treat the keypair as required when either path is set, and surface the
underlying load error with both filenames so the misconfiguration is
obvious. The default (both paths empty) is unchanged: no client cert is
sent.

* filer(mysql): add tls_insecure_skip_verify and tls_server_name knobs

When the filer connects to a MySQL/MariaDB cluster whose server
certificate's SAN does not match the connection address (common with
internal load balancers, IP-only connection strings, or self-signed
cluster certs), the TLS handshake fails with `x509: certificate is valid
for X, not Y`. There was previously no way to fix this short of reissuing
the cert.

Expose two new optional knobs on `[mysql]`:

- `tls_server_name` overrides the SNI / cert hostname used for
  verification — the standard fix when the cert SAN is correct but the
  connection address is not.
- `tls_insecure_skip_verify` disables verification entirely as an escape
  hatch for testing or for clusters with no usable SAN.

Both default to off, so existing configurations continue to verify the
server certificate against the connection address as before.

* docs(scaffold/filer.toml): document mysql TLS knobs and MariaDB upsert override

- Document the new `tls_insecure_skip_verify` and `tls_server_name` options.
- Update the `ca_crt` comment to reflect that it is optional and that the
  system trust store is used when the path is empty (matches the runtime
  behavior in mysql_store.go).
- Reword the client cert comments to make the mTLS pairing requirement
  explicit (both `client_crt` and `client_key` must be set together).
- Add a commented-out MariaDB / MySQL 5.7 alternative for `upsertQuery`,
  noting that the default (`AS new` row alias) requires MySQL 8.0.19+.

* filer(mysql): drop redundant blank import of go-sql-driver/mysql

The package was imported twice: once with the `mysql` alias (used for
`mysql.MySQLError`, `mysql.Config`, `mysql.NewConnector`, etc.) and once
as `_` to register the driver. The named import already triggers
`init()` and registers the driver, so the blank import is dead weight.
2026-04-28 01:29:41 -07:00