seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-13 23:36:45 +03:00

Author	SHA1	Message	Date
Chris Lu	37962e2445	admin: configure maintenance tasks via admin.toml (#9926 ) * admin: configure maintenance tasks via admin.toml Maintenance task settings could only be edited in the admin UI and live under <dataDir>/conf, so they silently reverted to defaults whenever the data directory was recreated. An optional admin.toml now declares vacuum, balance, and erasure coding settings; keys set there are written through to the persisted task configs at every startup, overriding UI edits, so the configuration stays declarative. Generate an example with "weed scaffold -config=admin". * vacuum: round min volume age up to whole hours MinVolumeAgeSeconds was truncated by integer division when converted to the hour-granular protobuf field, so a sub-hour setting silently became 0 and disabled the age guard. * admin: split and normalize preferred_tags from admin.toml A comma-separated string, as set via environment variable, came through viper as a single slice element. Split on commas and reuse util.NormalizeTagList, matching the plugin config path. * scaffold: clarify admin.toml wording	2026-06-11 11:04:52 -07:00
Chris Lu	e56a1c4c05	admin: pre-gzip embedded static assets, add cache headers (#9918 ) The admin UI served embedded static files uncompressed and without cache headers: embed.FS has zero mod times, so no Last-Modified, no ETag, no 304s -- every page load re-downloaded ~700KB of css/js in full, which gets painful over slow or tunneled links. Gzip the static tree at generation time (go generate ./weed/admin) and embed only the compressed mirror, shrinking the binary ~1.5MB. The handler hands the pre-compressed bytes to gzip-capable clients, decompresses for the rest, and sets Cache-Control, per-variant content-hash ETags and Vary so repeat loads revalidate with a 304. bootstrap.min.css goes 232KB -> 30KB on the wire. A drift test keeps static_gz/ in sync with static/.	2026-06-10 12:54:36 -07:00
Chris Lu	2ac5aa72c7	add elastic8 filer store for Elasticsearch 8 (#9916 ) * elastic: fix listing against a missing or empty directory index The refresh 404 leaked into the named return, so the first listing of a directory whose index does not exist yet returned an error instead of an empty result. Sorting also fails on an index with no documents ("No mapping found for [_id] in order to sort on"); unmapped_type keeps the resumed-listing path working there. * add elastic8 filer store for Elasticsearch 8 Elasticsearch 8 disables _id fielddata by default, so the elastic7 store's directory listings fail with "Fielddata access on the _id field is disallowed". elastic8 uses the same client and configuration options, but also indexes the document id as an Id field and sorts listings on Id.keyword.	2026-06-10 12:10:49 -07:00
Chris Lu	e12052ee6b	fix(filer.sync): replicate a rename as an atomic move, not a no-op update (#9895 ) * fix(filer.sync): replicate a rename as create-then-delete, not an in-place update A rename arrives as a single metadata event carrying both the old and new entry. The filer sink was routed to UpdateEntry, which looks up the old path but issues the update against the new parent without changing the name — and the filer UpdateEntry RPC cannot move an entry. So the rename was dropped: the old path lingered and the new path never appeared (same-dir renames rewrote the old name in place). Route a real move (the sink path changed) through CreateEntry(new) then DeleteEntry(old) in both the replicator and the filer.sync/backup driver, the way the other sinks already handle it; reach UpdateEntry only for true in-place updates. Create before delete so a crash between the two leaves the entry visible rather than lost. * fix(filer.sync): derive the rename delete key like the create key, guard the watched root The rename delete leg rebuilt the old key with a raw util.Join, bypassing the sink-side key normalization the create leg gets from buildKey — so a rename could create the new entry and then fail to delete the old one under a transformed key. Build the old key through buildKey too, and skip the delete when the moved entry is the watched root itself (where the old key would resolve to the target root and recursively delete the whole sink tree). * test(filer.sync): cover the in-place update delete-then-create fallback order The recording sinks always reported foundExisting, so the fallback that an in-place update takes when the entry is missing on the sink was never run. Make it configurable and assert the fallback deletes before it recreates the same key, in both the replicator and the filer.sync drivers. * feat(filer.sync): move filer-sink renames natively via AtomicRenameEntry create-then-delete is unsafe for the filer sink: CreateEntry returns nil without creating on a transient chunk-copy error, so the paired delete could remove the only valid destination copy; a directory rename also deleted the old subtree before descendants were recreated, and left old chunks behind. Add an optional EntryMover sink capability and implement it on the filer sink via AtomicRenameEntry — one atomic, metadata-only move that relocates a whole subtree in a single transaction. Renames prefer it; sinks without a native move keep create-then-delete. When the old path is already gone (a descendant the parent rename moved, or one never replicated) MoveEntry creates the new path instead, re-checking existence with a lookup so a rolled-back move that left the old entry intact is retried rather than mistaken for gone. * docs(filer.sync): note entryMissing's gRPC not-found string fallback is deliberate	2026-06-09 12:54:28 -07:00
Chris Lu	7b07d8177a	fix(filer.sync): scope filesystem key sanitization to the local sink (#9894 ) * fix(filer.sync): scope filesystem key sanitization to the local sink destKey ran every sink key through escapeKey, whose Windows build strips colons. Colons are illegal in NTFS filenames so the local sink needs that, but s3/filer/azure/gcs/b2 accept them as ordinary key bytes — stripping them silently diverged the destination key (a source a:b replicated as ab). Move the sanitization into the local sink behind a Windows build tag, applied at every entry point so the previously-unescaped in-place-update paths stay consistent. Non-local sinks now keep the raw key; non-Windows builds are unchanged; a leading drive-letter colon is preserved. * test(filer.sync): cover incremental destKey and localsink update/delete sanitization Lock the colon-preserving behavior for the incremental destKey branch, and extend the Windows local-sink test to assert UpdateEntry and DeleteEntry also sanitize the key, not just CreateEntry.	2026-06-09 10:18:49 -07:00
Chris Lu	ed470dccb1	mini: grow volumes one at a time Mini auto-sizes a few large volume slots, but the master pre-grows 7 volumes per new collection. Under a filer group each S3 bucket is its own collection, so the first buckets claimed every slot and later writes failed to assign a volume. Cap mini's volume_growth copy counts to 1.	2026-06-08 14:51:40 -07:00
Jaehoon Kim	1b5f1c1f3b	feat(filer.backup): -initialSnapshot re-seeds a reinitialized destination (#9828 ) * feat(filer.backup): add -resetCheckpoint to force a fresh sync filer.backup resumes from a per-sink offset persisted in the source filer's KV. There was no first-class way to discard that checkpoint and re-run from the beginning short of guessing a large -timeAgo, which also skips -initialSnapshot. Add -resetCheckpoint: before reading the offset, write 0 for this sink so getOffset returns 0, isFreshSync stays true, and -initialSnapshot re-runs a full walk. Effective only when -timeAgo is 0. The flag is cleared after the first successful reset: runFilerBackup retries doFilerBackup forever on error, so leaving it set would re-zero the checkpoint on every retry and never make forward progress after a transient failure. Later retries resume from the persisted checkpoint instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(filer.backup): keep fresh-sync intent when offset read fails after reset After -resetCheckpoint writes offset 0, a transient getOffset read-back error flipped isFreshSync to false, which skipped the -initialSnapshot walk the reset explicitly requested. Track that the reset happened this iteration and, on a getOffset error, preserve isFreshSync=true in that case (the non-reset path keeps treating a read error as "not fresh" to avoid re-walking on transients). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(filer.backup): skip offset read-back on reset instead of tracking a flag Replace the didReset bool by branching: on -resetCheckpoint, clear the offset and start fresh without reading it back (we just wrote 0, so the state is known); otherwise read the offset as before. This drops the redundant getOffset RPC after a reset and removes the read-back error case entirely, so no separate flag is needed to preserve isFreshSync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * filer.backup: -initialSnapshot re-seeds on every start; drop -resetCheckpoint -initialSnapshot now walks the live tree whenever -timeAgo is 0, seeds the destination, and overwrites the saved checkpoint, rather than running only on a fresh sync. That re-seeds a reinitialized destination on its own, so the separate -resetCheckpoint flag is gone. The walk runs once per process: the in-memory flag is cleared after the watermark is persisted, so the retry loop resumes from the persisted checkpoint instead of re-walking on every transient error. A process restart re-walks, so remove the flag once the backup is caught up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-07 23:35:53 -07:00
Chris Lu	89cbb1c558	admin: default -dataDir to "." so maintenance task state persists across restarts (#9856 ) admin: default -dataDir to "." so maintenance task state persists Previously -dataDir defaulted to empty, so the admin ran maintenance in memory only: task state was never saved and maintenance tasks (notably EC balance/rebuild) were re-issued every scan cycle without converging, churning EC shards (moves landed shards without their .ecx index, leaving EC volumes unloadable/missing shards). Default -dataDir to "." (the process working directory, which under the standard systemd unit is the admin's data dir) so state persists out of the box.	2026-06-07 20:45:03 -07:00
Chris Lu	755af4adf4	s3: actually bind outbound connections when -ip.bind is set (#9849 ) * s3: set outbound bind IP before the first filer dial Standalone weed s3 dialed the filer for GetFilerConfiguration before SetOutboundLocalIP ran, so that gRPC conn was created with the stock dialer and no source address. gRPC caches conns by address and reuses the original dialer on reconnect, so the s3->filer connection kept leaving from the OS-chosen source for the life of the process even after the bind IP was set a moment later. * grpc: install the outbound-bind dialer unconditionally The dialer was installed only when OutboundLocalAddr was already set at GrpcDial time, baking the source-address decision into the cached conn, so a conn dialed before the bind IP was configured never bound. Install the context dialer always and decide per dial: bind through OutboundDialContext once a source is set, otherwise fall back to the stock net.Dialer so default deployments keep gRPC's dial timeout and keepalive behavior. The bind now applies on the next reconnect regardless of ordering, matching the HTTP transport's unconditional DialContext.	2026-06-07 10:20:58 -07:00
Chris Lu	be7f417a03	ip.bind: bind outbound connections to the configured address (#9834 ) * ip.bind: bind outbound connections to the configured address -ip.bind only governed listeners; outbound gRPC and HTTP connections let the OS pick the source IP, which may not even be able to reach the target. Mirror the bind address into a process-global source address and apply it to outbound TCP dials: the gRPC context dialer, the per-client HTTP transports, and the default transport. Loopback targets and unix sockets keep the OS-chosen source so same-host traffic still works. * ip.bind: first-write-wins source IP, skip on address-family mismatch Make SetOutboundLocalIP first-write-wins so a `weed server` component's own bind setting (run in its goroutine) can't clobber the process-wide source address the top-level -ip.bind already established for the other components. Skip source binding when the target is a literal IP of a different family than the bind address, since forcing a mismatched source fails the dial.	2026-06-05 12:44:21 -07:00
Chris Lu	ab7be7867d	security: hot-reload JWT signing keys on SIGHUP (#9826 ) * security: reload JWT signing keys on SIGHUP Signing keys were read once in the server constructors and never refreshed. After a key rotation (Secret update, divergent reads) the in-memory key stayed stale and every request kept failing "wrong jwt" until the affected process was restarted. Add Guard.UpdateSigningKeys and call it from the master, volume and filer reload paths and the s3 reload hook, next to the existing whitelist refresh. Make the global chunk-read JWT cache reloadable via an atomic swap, and register the master's Reload with grace.OnReload -- it was never wired, so the master ignored SIGHUP entirely. Mirror the same refresh in the Rust volume server's SIGHUP handler. * security: swap signing keys behind an atomic pointer Addresses review feedback on the in-place key swap: SigningKey is a []byte, so reassigning the Guard fields while a request handler reads them is a data race that can tear the multi-word slice header and read out of bounds. Hold the four signing-key fields in an immutable signingConfig snapshot behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a reader sees either the old keys or the new ones. Reads go through new SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors. The Rust guard is already safe: every read and the SIGHUP write go through the shared RwLock<Guard>. * security: fold whitelist + auth state into the atomic snapshot Review follow-up. UpdateSigningKeys still wrote isWriteActive while the request path read it (and the whitelist maps) unsynchronized, so a SIGHUP under load could expose an inconsistent mix of activation bits and whitelist contents. Move all hot-reloadable Guard state -- keys, expirations, whitelist, and the activation flags -- into a single immutable guardState swapped behind one atomic.Pointer. The Update* methods take a small mutex to serialize the read-modify-write; readers stay lock-free. The concurrency test now also rotates the whitelist and probes IsWhiteListed under -race. Also read each signing key once per branch in the volume/filer JWT auth checks, so a reload landing mid-check can't take the allow-fast-path after auth was enabled or verify against a different key than the branch saw.	2026-06-04 22:26:08 -07:00
7y-9	6e8002f065	fix: handle meta backup offset errors safely (#9818 ) * fix: log meta backup offset errors * fix: log meta backup offset errors * fix: exit on meta backup offset errors Exit with a non-zero status when the initial metadata backup offset cannot be persisted. Classify offset-read failures during streaming so the backup process exits instead of retrying forever, allowing supervisors to restart and bootstrap from a missing checkpoint. * meta backup: read offset in the loop, drop offset error type Reading the saved offset inside the retry loop makes an offset read failure a clean exit and a stream error a retry, without a typed error to tell them apart. streamMetadataBackup now takes the start time. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-04 10:53:23 -07:00
Fabian Hardt	ce6a51468a	sftpd: support SSH user certificates signed by a trusted CA (#9815 ) * sftpd: support SSH user certificates signed by a trusted CA Adds a new "certificate" auth method to weed sftp. When enabled, the server loads trusted CA public keys from -trustedUserCAKeysFile (OpenSSH authorized_keys format, one or more keys) and accepts only ssh.Certificate blobs of type UserCert on the public-key channel. Validation uses ssh.CertChecker: CA signature, ValidAfter/ValidBefore, non-empty ValidPrincipals and SSH login user must appear in ValidPrincipals. The authenticated user must exist in the user store; home dir and permissions resolve as before. Behaviour mirrors MinIO's --sftp=trusted-user-ca-key and OpenSSH's TrustedUserCAKeys: when certificate auth is active, plain (non-cert) public keys are rejected even if "publickey" is also listed. Default authMethods remain "password,publickey", so existing deployments are unaffected. * Update weed/sftpd/auth/certificate.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * sftpd: address review feedback on certificate auth - Pre-marshal trusted CA public keys in IsUserAuthority instead of re-marshaling on every authentication attempt (gemini-code-assist). - Differentiate user-not-found from underlying store errors via errors.As(user.UserNotFoundError) so backend/read failures are no longer reported as bad credentials (coderabbitai). - Fix the corresponding sanity check in the missing-file test to use errors.As instead of errors.Is (UserNotFoundError has no Is method, so the previous check never matched) (coderabbitai). sftpd: register trustedUserCAKeysFile flag in filer and server commands The new field on SftpOptions is dereferenced unconditionally in resolvePaths(), but only the standalone `weed sftp` command was wiring its flag. `weed filer` and `weed server` both embed an SftpOptions value and call resolvePaths() on it, so they hit a nil pointer dereference at startup. Register `-sftp.trustedUserCAKeysFile` in both commands and update the -sftp.authMethods help text to mention the new "certificate" method. Fixes the SFTP Integration Tests CI failure on this PR. * helm: expose SFTP certificate auth in the SeaweedFS chart Adds Helm-chart support for the new SSH user-certificate auth method: - values.yaml (sftp:) gains `trustedUserCAKeys` (inline OpenSSH authorized_keys-format CA public keys) and `existingCAKeysSecret` (reference an externally managed Secret). Same pair added under allInOne.sftp with a null default that falls back to the top-level sftp.* setting. - New template templates/sftp/sftp-ca-secret.yaml renders a chart-managed Secret <release>-sftp-ca-secret with `ca_user.pub`, but only when SFTP is enabled, "certificate" is in authMethods, inline keys are provided, and no existingCAKeysSecret is set. - templates/sftp/sftp-deployment.yaml and the all-in-one deployment template add `-trustedUserCAKeysFile=/etc/sw/sftp_ca/ca_user.pub` to the weed sftp command, mount the CA secret at /etc/sw/sftp_ca and add the corresponding volume. All cert-auth bits are guarded by `contains "certificate" authMethods` so existing users see no change. - authMethods help text updated to mention "certificate". Verified end-to-end on a local k3d cluster: cert login succeeds, plain-pubkey login is rejected with "public key without certificate not allowed". * helm: fail render when SFTP certificate auth lacks CA keys When certificate is in authMethods but neither trustedUserCAKeys nor existingCAKeysSecret is set, the deployment mounted a secret that the chart never renders, leaving the pod stuck on a missing volume. Fail at template time with a clear message instead. * sftpd: fix stale auth-method list in SFTPServiceOptions comment keyboard-interactive was never implemented; certificate is the new supported method. Match the CLI help text. * sftpd: test Manager wiring of certificate vs public-key channel Cover the channel takeover at the Manager level: certificate auth displaces plain public-key auth when both are enabled, public-key auth stays put otherwise, and enabling certificate without a CA file errors. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-03 22:32:47 -07:00
Aleksey	e3e02d3364	[CheckDisk]: implement disk health detection (#9560 ) * [CheckDisk][GRPC]: implement MVP for disk health detection, added timeout for new grpc connections * fix(volume): build disk health check on every platform setDiskStatus only existed behind the statfs build tag, so disk.go failed to compile on windows, openbsd, solaris, netbsd and plan9. Move the timeout wrapper and failure tracking into the shared disk.go and have each platform's fillInDiskStatus return an error, so every platform gets the same protection from a stuck filesystem. Also restore the uint64(fs.Bavail) cast: Bavail is int64 on freebsd, so the unguarded multiply broke the freebsd build. * fix(volume): keep one outstanding statfs probe per disk A stuck statfs used to leave isChecking cleared by the timeout path, so the next check spawned another goroutine while the previous one was still blocked in the syscall, leaking one goroutine per minute on a hung disk. Clear the flag only when statfs returns and treat an overlapping check as a failure, so a hung filesystem keeps a single outstanding probe and still gets reported. * fix(volume): assume disk available until the first health check isDiskAvailable defaulted to false, and CollectHeartbeat skips locations that are not available. A freshly started volume server would therefore omit every volume from its first heartbeats until the async CheckDiskSpace ran, so the master could briefly treat all of them as missing. * fix(volume): label the disk error metric by data directory The new gauge tagged the series with IdxDirectory while every neighbouring resource gauge uses Directory, so the error series would not line up with them in dashboards. Also log the underlying error instead of a generic message. * test(volume): cover disk health success and repeated-failure paths * fix(volume): make a healthy disk the zero-value default Track the disk as isDiskUnavailable instead of isDiskAvailable so the safe state is the zero value, matching isDiskSpaceLow. CollectHeartbeat only skips a location once a check has actively marked it unavailable, so any DiskLocation built without running CheckDiskSpace (tests, future call sites) still reports its volumes instead of silently dropping them. * feat(disk): detect degraded disks using IO latency probes * feat(stats): introduce configurable disk I/O health probe with EWMA-based latency detection * feat(disk): replace EWMA with sliding window algorithm for disk health detection and added user-friendly options * feat(disk): improve disk health probing and recovery * feat(volume): configure disk health checks via volume.toml * fix(volume): Remove disk IO probe CLI options --------- Co-authored-by: ptukha <ptukha@tochka.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-02 09:02:05 -07:00
Chris Lu	2386fa550a	grpc: don't tear down the shared master connection on a caller's own timeout (#9775 ) A Canceled/DeadlineExceeded from the caller's per-request context was treated like a dead channel: it closed the shared cached ClientConn and cancelled every other in-flight RPC on it with "the client connection is closing". Under a burst of concurrent chunk assigns (e.g. a large S3 multipart upload) one slow assign hitting its 10s attempt timeout could poison the connection for all the rest, cascading into a flood of 500s. Thread the caller's context into shouldInvalidateConnection and only invalidate on Canceled/DeadlineExceeded while that context is still live, which isolates the genuine stale-channel signal (a peer restart behind a k8s Service VIP). To carry the context, add a ctx parameter to the existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the master assign and volume-lookup paths pass their per-attempt context and every other caller passes context.Background().	2026-06-01 15:11:02 -07:00
Chris Lu	80dd3b2621	EC bitrot follow-ups: protect destination sidecar on optional copy; cap sidecar block_size (#9763 ) * fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a huge power-of-two block_size that passed validation, then force a multi-GiB scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize (64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and derive the volume flag cap from the same constant so they cannot drift. * fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy writeToFile opened the destination with O_TRUNC before knowing whether the source had the file, so an optional copy (ignoreSourceFileNotFound) from a source that lacks the .ecsum truncated and then removed a valid pre-existing destination sidecar. Stage the optional copy into a temp sibling and commit it with an atomic rename only when the source actually delivered the file; a missing source is now a no-op. Mandatory copies keep their in-place behavior.	2026-05-31 23:42:33 -07:00
Chris Lu	9658f309d2	EC bitrot detection: per-shard checksum sidecars (#9761 ) * ec: add EC bitrot checksum protobuf EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages, copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode. * ec: bitrot checksum sidecar format, validation, and per-volume load Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a self-integrity header; validation, rolling builder, backfill primitive, and EcVolume load on mount + removal on destroy. * ec: capture per-shard checksums at encode; verify-and-exclude on rebuild WriteEcFilesWithContext returns the protection computed inline during encoding. generateMissingEcFiles verifies present inputs against the sidecar, excludes corrupt ones, regenerates in place, and re-verifies; fail-closed unless unsafe_ignore_sidecar, removing all generated outputs on failure. * ec: read-only checksum scrub with Reed-Solomon arbiter ChecksumScrub verifies each local shard against the sidecar and reconstructs flagged shards from the clean shards so stale-sidecar false positives are not reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum. * ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant); remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and opportunistically backfills a sidecar when all shards are reachable. * ec: volume server bitrot config flags -ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16). * fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge flag value cannot overflow int64 and slip past the power-of-two check, and a block size cannot collapse a sidecar to a few oversized blocks. * fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path The worker EC encode wrote the generation-0 sidecar locally but never added it to shardFiles, so DistributeEcShards never shipped it and the distributed holders came up unprotected. Append it to shardFiles and map the ecsum shard type to its extension in the sender so it travels with the shards. * fix(ec_bitrot): remove orphaned sidecars when the generation is gone Gate sidecar removal on existingShardCount==0 alone rather than also requiring a stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays gated on hasEcxFile as before. * fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles ChecksumScrub's first return is blocks scanned, not files. Discard it so the scrub response's TotalFiles (a needle/file count) is not inflated by the block count for CHECKSUM mode. * test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles * fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast The header stores payload_len as a uint32; bound the payload before the conversion so a pathological manifest cannot truncate the length field and corrupt the sidecar. A real manifest is a few KB, so this never trips. * fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB The block size becomes the per-shard scratch buffer the scrub/backfill path allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent scrub worker. Lower the upper bound from 1024 to 64 MiB. * fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext, which fails closed on a malformed/stale .ecsum. Without an override flag an operator could not complete the rebuild without manually deleting the sidecar. Expose -ecUnsafeIgnoreSidecar (default false) and thread it through. * fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull Guard len(payload) against a plain int constant (1 GiB) before the allocation instead of a uint64 MaxUint32 compare, so the allocation-size value is provably bounded (clears the CodeQL overflow alert) and the math import is no longer needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and remove the now-redundant readFull helper (os.File.ReadAt fills the slice or errors). * test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32 * refactor(ec): fold the EcFiles WithContext variants into the base functions RebuildEcFiles now takes the ECContext directly (nil => derive from .vif as before) and WriteEcFiles takes it too (nil => default), removing the parallel RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an explicit context drop the WithContext suffix; the default-context callers pass nil. No behavior change. refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles Add a non-nil BackgroundECContext placeholder (analogous to context.Background()) and have callers with no specific layout pass it instead of a nil ECContext. WriteEcFiles resolves a zero/background context to the default ratio and RebuildEcFiles resolves it from the .vif, so behavior is unchanged. fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif - BackgroundECContext is now a function returning a fresh *ECContext, so callers cannot mutate a shared singleton or race on it (and it mirrors context.Background, which is also a function). - RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but- unreadable .vif fails closed instead of silently rebuilding with the default ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.	2026-05-31 18:52:44 -07:00
Chris Lu	05c6500453	volume: fix maxVolumeCount dead zone that stalled writes on auto-sized disks (#9755 ) * volume: don't drop the last writable slot on auto-sized disks MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk with room for exactly one volume (free between 1x and 2x the size limit) reported 0 slots. The master then never grew a writable volume and every assign drained its retry budget, so writes failed with context deadline exceeded. Count the full volumes that actually fit, floored at one for an auto-sized disk that has free space. * mini: show disk and volume capacity in the startup banner Print free space, volume size, total volume count and free volume count under the data directory line, so a volume size limit that outstrips the disk is visible at startup instead of surfacing later as failed writes.	2026-05-30 23:45:17 -07:00
Chris Lu	5834c834e3	Refine enterprise edition feature blurb in version output and docs	2026-05-30 09:29:06 -07:00
Chris Lu	c9623007a2	fix(filer.sync): keep sync_offset fresh through filtered-event markers (#9733 ) On a read-only watched path the idle heartbeat keeps sync_offset fresh, but a busy source filer still emits a MaxUnsyncedEvents marker after many filtered events. The marker has a non-nil but empty EventNotification, so the client routed it to the event path, where it advanced no real watermark yet drove offsetFunc to republish the stale processed watermark — regressing the gauge between heartbeats and spiking the derived lag every time a filtered-event burst landed. Route the empty marker through OnIdleHeartbeat like the idle heartbeat so its fresh timestamp keeps the gauge current; it still advances the in-stream resume cursor.	2026-05-28 23:29:59 -07:00
Chris Lu	2f0643e5b1	fix(volume): stop flipping volumes read-only on a non-append-ordered .idx (#9726 ) * fix(volume): verify the .dat-tail needle in the integrity check CheckVolumeDataIntegrity checked the last entry by file position in the .idx and, for a live needle, flipped the volume read-only when fileSize > fileTailOffset. That entry is the .dat tail only when the .idx is in append order; a key-sorted .idx (weed fix and other rebuilds listed entries by key) puts the highest-key needle last, whose tail sits mid-file, so healthy volumes went read-only on every load and re-running weed fix only reproduced the sorted index. Locate the needle at the maximum offset — the one physically last in the .dat — and verify the .dat ends exactly at it, regardless of .idx ordering. The append-ordered common case stays O(1) (the last entry's on-disk end matches the .dat size); only a key-sorted index pays a single linear scan. Deletion tombstones at the tail are now verified too, instead of skipping the file-size check. * fix(command): weed fix rebuilds the .idx in .dat offset order SaveToIdx wrote entries via AscendingVisit — sorted by key, the .sdx/.ecx shape — so the rebuilt .idx put the highest-key needle last instead of the .dat-tail needle, and dropped tombstones whose live needle was gone. Collect the live and deleted entries, sort by .dat offset, and write them in append order so the .idx stays a faithful log whose last entry is the real .dat tail.	2026-05-28 18:04:31 -07:00
Chris Lu	dfd05d14cb	refactor(filer): remove the inode->path index and the NFS gateway (#9724 ) * fix(filer): derive inodes by hash instead of a snowflake sequencer Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers. * chore: remove the experimental NFS gateway The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies. * refactor(filer): drop the inode->path index With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.	2026-05-28 15:00:18 -07:00
Chris Lu	3481f13f54	mount: route POSIX advisory locks to the owner filer under -dlm (#9669 ) With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. Background unlock/release RPCs are bounded so a stuck filer can't hang close().	2026-05-24 23:56:37 -07:00
Chris Lu	25beb7ec48	admin: expose Prometheus metrics (#9652 ) * admin: add -metricsPort flag to expose Prometheus metrics The admin command had no metrics endpoint, so passing -metricsPort (as the operator does for spec.admin.metricsPort) crashed the process with "flag provided but not defined". Wire up -metricsPort/-metricsIp and start the shared Prometheus metrics server, matching filer, master, and volume. * admin: emit maintenance task and worker fleet metrics Add Prometheus metrics for the admin server's distinctive work: the maintenance task queue and the worker fleet that executes it. Task lifecycle: maintenance_tasks_by_status / _by_type gauges (snapshot of the queue), maintenance_tasks_completed_total{type,outcome} counter and maintenance_task_duration_seconds{type} histogram (recorded when a task reaches a terminal state), and last/next scan timestamp gauges. Worker fleet: workers_connected and worker_slots{used,max} gauges, plus worker_events_total{event} counting register/unregister/stale removals. Gauges are snapshotted by a background goroutine on the admin server; counters and the histogram are recorded at their event sites. * admin: read worker slot totals under lock, clear next-scan gauge when idle GetWorkers returns live worker pointers; summing CurrentLoad/MaxConcurrent outside the queue lock races with task assignment and completion. Add GetWorkerSlotTotals to aggregate under the lock. Also reset maintenance_next_scan_timestamp_seconds to 0 when the scanner is not running, so it can't retain a stale value after a stop.	2026-05-24 14:09:02 -07:00
Chris Lu	303c2be38d	feat(fix): rebuild lost EC index (.ecx) and .vif from local shards (#9596 ) weed fix -ecx reconstructs the .dat from the local data shards, scans the needles, and writes a fresh ascending-sorted .ecx containing only live entries — the same on-disk index WriteSortedFileFromIdx emits at encode time. When the .vif is also missing it is regenerated from the inferred EC ratio (flags > .vif > shard-count inference / 10+4) and the .dat size recovered from the scan. When some data shards are missing but at least dataShards shards survive, the missing shards are first reconstructed from the survivors via Reed-Solomon, so a partial shard set is repaired too. Also makes erasure_coding.WriteDatFile de-stripe using len(shardFileNames) instead of the DataShardsCount constant, so the caller's actual data-shard count is honored (behavior-preserving for the default 10, and fixing the existing caller that already passes ECContext.DataShards). This recovers an EC volume whose sealed index was lost from every node while enough shards survive, a state neither ec.rebuild nor ec.decode can repair because both require an existing .ecx. Flags: -ecx, -ecDataShards, -ecParityShards. Run with the volume server stopped.	2026-05-21 00:41:27 -07:00
Chris Lu	5af7d12f04	fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589 ) * fix(filer.sync): keep sync_offset fresh while the source is read-only sync_offset holds the timestamp of the last replicated source event, so monitoring derives lag from now-sync_offset. A read-only source emits no metadata events, so the gauge froze at the last write and the derived lag grew without bound, making thresholds unusable. The source filer now sends an idle heartbeat carrying its current time while a subscriber is caught up to the buffer head. filer.sync uses it to advance the gauge, so now-sync_offset reflects real lag. Heartbeats are opt-in (client_supports_idle_heartbeat), are never written to the metadata log, and do not move the resume checkpoint, so a restart still resumes from the last real event. * fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs In metadata-chunks mode persisted entries replay as log file refs and never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up subscriber with an old SinceNs would never get a heartbeat. Use the read cursor (lastReadTime), which advances in that mode too, max'd with lastSeenTsNs so the in-memory backlog-then-idle case still works while the cursor returned to the caller has not yet updated.	2026-05-20 11:26:37 -07:00
Lars Lehtonen	9914e6af30	chore(weed/command): prune unused functions (#9573 ) * chore(weed/command): prune unused functions * drop now-unused closed field and renderLocked guard --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-19 17:45:50 -07:00
Chris Lu	3d872a1416	fix(filer): load -s3.config static identities into the filer's CredentialManager (#9537 ) When weed filer started its embedded S3 gateway with -s3 -s3.config, only the S3 server loaded the s3.json static identities — the filer's own CredentialManager stayed empty, so the IAM gRPC service backing the admin UI and weed shell returned only dynamic users. Mirror the wiring weed server already does and hand the same config path to the filer.	2026-05-18 13:41:30 -07:00
Chris Lu	6b94701213	mini: quieter startup with a docker-compose-style progress board (#9524 ) * mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway\|service).*(started\|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases.	2026-05-17 19:13:09 -07:00
Chris Lu	62821964dd	filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509 ) (#9514 ) PR #9442 made the filer refuse to register the IAM gRPC service unless jwt.filer_signing.key was set in security.toml, which broke the admin UI Users/Groups/Policies pages for every deployment that ships without a security.toml — weed mini, plain Helm, vanilla weed filer. The Users tab returns Unimplemented and the page is unusable. Issues #9504, #9505 and #9509 all trace to this gap. The rest of the filer's gRPC surface is unauthenticated by default; treat IAM the same way. The service now always registers, and the auth gate is a no-op when no signing key is configured. When the key is set, every RPC still requires an admin-signed Bearer token, matching the post-#9442 behaviour. Operators who expose the filer gRPC port beyond a trusted network should set the key on both filer and admin. The admin client (IamGrpcStore.withIamClient) already skips attaching the authorization metadata when its key is empty, so no changes there.	2026-05-15 13:15:20 -07:00
Chris Lu	bfb2661fec	fix(tests): make 32-bit GOARCH tests build and run (#9507 ) fix(tests): make 32-bit GOARCH tests build and run (#9503) verifyTestFilerClient had bare int64 atomic counters after a map header, so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on linux/386. Switch to atomic.Int64, which the stdlib guarantees is 8-byte aligned on all platforms. rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to t.Errorf, where it default-promoted to int and overflowed 32-bit int. Bind it to a typed uint32 const used in both the comparison and the error message.	2026-05-14 20:55:37 -07:00
Chris Lu	f51468cf73	Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474 ) Revert "master: bind heartbeat claims to the connecting peer (#9443)" This reverts commit `f28c7ce6df`. The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects every hostname-based deployment. In docker-compose / k8s the volume server is started with -ip=<service-name> and the gRPC peer surfaces as the container/pod IP, so the two never match and every heartbeat fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`. The master therefore never learns about any volume, growth fails, and fio writes against the mount return EIO. After the #9440 revert merged (`43a8c4fdc`), the e2e workflow is still failing for this reason; see https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 . Reverting to unblock e2e. A narrower re-do should accept the heartbeat when heartbeat.Ip resolves (DNS) to the peer address, so the spoof hardening can return without breaking hostname-based clusters.	2026-05-12 18:22:21 -07:00
Chris Lu	43a8c4fdca	Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472 ) * Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)" This reverts commit `21054b6c18`. The fail-closed gate broke any multi-host cluster: in compose / k8s / remote-host deployments the master's IP isn't loopback, so every master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute, vacuum, scrub, ...) is rejected with PermissionDenied unless the operator manually configures -whiteList. The e2e workflow has been failing since `10cc06333` with `not authorized: 172.18.0.2` on AllocateVolume; downstream symptom is fio fsync EIO because zero volumes can be grown. The gate's intent was to lock down destructive admin tooling, but the same RPCs are the master's normal mechanism for growing and managing volumes. Reverting to restore cluster-internal operation; a narrower re-do should distinguish operator/admin callers from the master peer (e.g. trust IPs resolved from -master) before going back in. * security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic The revert in the previous commit also rolled back an unrelated bug fix that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error but did not continue, so the nil *net.IPNet was stored in whiteListCIDR and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on the next gRPC admin check. Restore the continue. Orthogonal to the fail-closed semantics this PR is reverting.	2026-05-12 16:00:44 -07:00
Chris Lu	f28c7ce6df	master: bind heartbeat claims to the connecting peer (#9443 ) SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on the wire. Three changes tighten that: - Reject heartbeats whose Ip does not match the gRPC peer's source address. Loopback peers are still trusted; operators behind a proxy can opt out with -master.allowUntrustedHeartbeat. - Track which (ip, port) first claimed a volume id or an ec shard slot and drop foreign re-claims. Non-EC volume claims are bounded by the replica copy count so legitimate replicas still register. EC ownership is keyed by (vid, shard_id) so the same vid can legitimately be split across many peers as long as their EcIndexBits are disjoint; rejected bits are cleared from the bitmap and the parallel ShardSizes array is compacted in lock-step. - Maintain reverse indexes owner -> volumes and owner -> ec shard slots so disconnect cleanup is O(M) in what that peer held rather than O(N) over the whole map. Bindings are also released when a heartbeat reports that the peer no longer holds an id, either via explicit Deleted{Volumes,EcShards} entries or by omitting it from a full snapshot. Without this, a planned rebalance that moved a vid or an ec shard from peer A to peer B would leave B's heartbeats permanently filtered out until A disconnected, breaking ec encode/decode flows that delete shards on the source as soon as the move completes. The (vid -> owners) binding still does not track which replica slot each peer occupies, so the first N claims under the copy count win; strict per-slot mapping is a follow-up.	2026-05-12 15:38:52 -07:00
Chris Lu	21054b6c18	volume: fail closed in admin gRPC gate when no whitelist is configured (#9440 ) Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use it to gate destructive volume admin RPCs. IsWhiteListed keeps its allow-all-when-empty semantics for HTTP compatibility. For TCP peers with an empty whitelist, off-host callers are rejected but loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly cohabits with the master/filer on a single host and in integration-test clusters; the loopback exception keeps cluster-internal admin traffic working without -whiteList while still locking out off-host attackers. Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check entirely. When `weed server` runs master+volume+filer in a single process the master dials the volume server in-process and the peer address surfaces as "@", which has no parseable IP. Such a caller shares our OS process and cannot be spoofed by a remote attacker, so we treat it as trusted by construction. The gate also tolerates a nil guard (developmental / embedded path) and only enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.	2026-05-12 12:35:27 -07:00
Chris Lu	69da20bdae	volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441 ) volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the destructive volume-server RPCs, and reject cluster-internal remote S3 endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before dialing. Pin the validated address against DNS rebinding by routing the AWS SDK through an HTTP transport whose DialContext re-resolves the host and re-applies the deny list on every dial, so an endpoint that resolves to a public IP at validate-time and then flips to 127.0.0.1 at connect time is refused. Operators that legitimately fetch from private hosts can opt out with -volume.allowUntrustedRemoteEndpoints.	2026-05-12 10:11:20 -07:00
Chris Lu	5e8f99f40a	filer: require admin-signed JWT on the IAM gRPC service (#9442 ) Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires a Bearer token in the authorization metadata, signed with the filer write-signing key. The service refuses to register on a filer that has no jwt.filer_signing.key set, so the unauthenticated default is gone: operators who use these RPCs must configure the key and attach a token on every call. Bearer scheme matching is case-insensitive (RFC 6750), every handler nil-checks req before dereferencing it, and tests now cover the expired-token path.	2026-05-12 10:11:08 -07:00
Chris Lu	05d31a04b6	fix(s3tests): wire lifecycle worker for expiration suite (#9374 ) * fix(s3tests): wire lifecycle worker for expiration suite The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration` exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left driving expiration. Three changes that work together: - Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day; the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside the suite's 30s polling window. Exported `DaysToDuration` so sibling-package tests pin to the same scale. - Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files. Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions land within a couple ticks of becoming due. - s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0 -runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the shell command runs the full pipeline (reader + scheduler + dispatcher) for the duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left out for now — versioned-bucket expiration via the worker still needs its own pass. Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered handler count (`8b87ceb0d` updated `mini_plugin_test.go` for the s3_lifecycle plugin but missed this twin test). Two retention-gate engine tests `t.Skip` under the s3tests build because they rely on absolute lookback-vs-retention math the day-rescale collapses; the prod build still covers them. * review: harden lifecycle worker spawn + assert handler identity - Workflow: aliveness check on the backgrounded `weed shell` (a bad command exits in <1s and the suite would otherwise just opaque-timeout); move worker/server teardown into a `trap cleanup EXIT` so failure paths still print the worker log and reap the data dir. - worker_test: check the actual job-type set by name, not just the count. * fix(shell): keep s3.lifecycle.run-shard alive when no rules exist yet The s3-tests CI runs the worker BEFORE any test creates a bucket, so LoadCompileInputs returns empty and the shell command was bailing out with "no buckets with enabled lifecycle rules found" within ~1s. The aliveness check then fired exit 1 before tox ever started. Two changes: - Don't early-exit on empty inputs. Compile against the empty set, log a one-liner, and let the pipeline run normally — the meta-log subscription is already up, so events for buckets created later DO arrive; they just need the engine to know about them when they do. - Add `-refresh <duration>` (default 5m, 2s in s3tests CI) that periodically re-runs LoadCompileInputs + engine.Compile so rules added after startup land in the snapshot the dispatcher reads on its next tick. Production deployments keep the 5m default; only the CI workflow drops to 2s. Workflow passes `-refresh 2s` in both basic and SQL blocks. * fix(shell): backfill pre-rule entries via bootstrap walker The reader-driven path only sees meta-log events created AFTER its engine snapshot knows the rule. The s3-tests CI scenario PUTs objects first, then PUTs the lifecycle config, so by the time the engine refresh picks up the new bucket the object events have already been seen-and-dropped (BucketActionKeys returned empty for the bucket). Wire bootstrap.Walk into the shell command: - bucketBootstrapper tracks buckets seen so far. kickOffNew spawns one loop goroutine per fresh bucket. - Each goroutine re-walks the bucket every walkInterval (defaults to the same value as -refresh, i.e. 2s in s3tests CI, 5m in prod) and feeds each entry through bootstrap.Walk; due actions dispatch via a direct LifecycleDelete RPC. Not-yet-due entries are silently skipped and picked up on a later iteration once they age past their (rescaled or real) threshold. - LifecycleDelete is called with no expected_identity; the server-side identityMatches treats nil as "skip CAS", which is the right call for bootstrap (the bootstrap entry doesn't carry chunk fid / extended hash anyway). The dispatcher's pkg-private toProtoActionKind is duplicated in the shell file rather than exported, since the shape is six lines and the reverse import would pull a proto dep into the s3lifecycle root. * refactor(s3/lifecycle): hoist bucket bootstrapper into scheduler pkg The shell command got the backfill in the previous commit but the worker plugin (weed/worker/tasks/s3_lifecycle/handler.go) drives Scheduler.Run directly and missed it — same root cause: the reader-driven path only sees events created after the rule lands, so a daily cron picking up a freshly-PUT rule wouldn't expire any pre-rule object. Move the looping bucket walker into scheduler.BucketBootstrapper: - Scheduler.Run now constructs one and calls KickOffNew on every engine refresh. Per-bucket goroutines re-walk every BootstrapWalkInterval (defaults to RefreshInterval — 5m in prod, 2s under s3tests). - The shell command consumes the same struct instead of its own copy so the two paths can't drift in semantics. * refactor(s3/lifecycle): walk-once + schedule via event injection Previous per-bucket walker re-listed every WalkInterval forever. For a bucket with N objects under a long rule, the worker did O(N * runtime / walkInterval) listings even when nothing was newly due — way too much for production-scale buckets. New approach: walk each bucket exactly once on first sight, synthesize one reader.Event per existing entry, push it onto Pipeline.events. Router.Route builds a Match with DueTime=mtime+delay; future-due matches sit in the per-shard Schedule and fire when their DueTime arrives. Currently-due matches fire on the very next dispatch tick. Wiring: - dispatcher.Pipeline lifts its events channel into a struct field with sync.Once init, and exposes InjectEvent(ctx, ev). Reader no longer closes the channel — the dispatch goroutine exits on runCtx cancellation, which works the same as channel-close did. - scheduler.BucketBootstrapper drops the WalkInterval ticker. KickOffNew spawns one walker goroutine per fresh bucket; the goroutine lists, synthesizes events, then exits. - scheduler.Scheduler builds its pipelines up front and exposes a pipelineFanout (shard -> Pipeline) as the EventInjector, so a multi- worker scheduler routes each synthesized event to the pipeline that owns its shard. - Shell command's single-pipeline path passes pipeline.InjectEvent directly. Synthesized events carry TsNs=0; dispatcher.advance treats that as a no-op so the reader's persisted cursor isn't ratcheted past unprocessed meta-log events. Identity (HeadFid + ExtendedHash) is still computed from the real filer entry, so the server's identity-CAS catches an overwrite between bootstrap and dispatch. debug(s3tests): make lifecycle worker progress visible in CI logs The previous CI failure dumped an empty $LC_LOG even though the worker was running. Two reasons: 1. weed shell suppresses glog by default (logtostderr / alsologtostderr set to false). Pass `-debug` so the bootstrapper's V(0) lines reach stderr instead of disappearing into /tmp/weed..log. 2. cleanup used `kill -9` which skips Go's stdout flush. SIGTERM first with a 1s grace, then SIGKILL the holdout, then read the log. While here: bump the bootstrap walker's two informational logs to V(0) so the diagnosis from CI doesn't require -v=1 on the worker. fix(s3/lifecycle/dispatcher): refresh snap on every event Pipeline.Run captured snap at startup and only refreshed it on the dispatch tick. With bootstrap event injection, the walker pushes events seconds after engine.Compile sees the bucket — typically WITHIN the same dispatch interval. Routing against the cached (empty) snap then silently dropped every match because BucketActionKeys returned nil for the bucket-not-yet-in-snapshot case. Re-fetch on each event. Engine.Snapshot is an atomic.Pointer.Load, so the cost is negligible. The dispatch-tick branch keeps using a fresh local read for its own loop, so its semantics are unchanged.	2026-05-08 17:29:47 -07:00
Chris Lu	8b87ceb0d1	refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration (Phase 4) (#9367 ) * refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration The handler used to walk every existing entry under the rule's prefix and stamp entry.Attributes.TtlSec + the SeaweedFSExpiresS3 flag so that the filer's compaction filter would expire them. With the event-driven lifecycle worker live, that retroactive walk is redundant — the worker drives expiration off the meta-log and a one-time bootstrap scan, so a PUT lifecycle stays O(rules) instead of O(objects). New writes still inherit TTL from the filer.conf location entry above; that volume-routing path is unchanged here and will move to an explicit operator command later (Phase 11). Drops updateEntriesTTL + processDirectoryTTL + processTTLBatch + updateEntryTTL from filer_util.go. * fix(s3api): clear stale lifecycle TTL entries on PUT PutBucketLifecycleConfiguration only ever appended/updated filer.conf entries — it never cleared ones the operator removed, renamed-prefix on, disabled, retagged with a tag filter, or bucket-versioned out of the fast path. The stale day-TTL kept routing new writes (and would expire old ones if any landed under the prefix) after the policy was updated. Treat PUT as a full replacement: walk this bucket's existing day-TTL entries, clear them, then add fresh entries from the new rule set. * test(command): bump mini default plugin job-type count to 7 The s3_lifecycle plugin handler registered in #9362 is the seventh default; the test still asserted six. * fix(s3api): delete stale lifecycle PathConf instead of blanking Ttl Just clearing pathConf.Ttl leaves the rule's Collection, Replication, and VolumeGrowthCount in place, so new writes still match the stale prefix and inherit outdated routing/placement. Use fc.DeleteLocationConf so the lifecycle-owned PathConf goes away entirely. Same fix in DeleteBucketLifecycleHandler, which had the same bug.	2026-05-08 11:03:03 -07:00
Chris Lu	c567da7164	feat(s3): register SeaweedS3LifecycleInternal gRPC service (#9359 ) Phase 2 added the LifecycleDelete handler on S3ApiServer but never registered it on a running gRPC server, so workers had no endpoint to dial. Embed UnimplementedSeaweedS3LifecycleInternalServer on S3ApiServer and register it on the s3 command's grpc server alongside SeaweedS3IamCacheServer.	2026-05-07 18:19:42 -07:00
Chris Lu	1c0e24f06a	fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335 ) * fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-06 15:19:43 -07:00
Chris Lu	6141222ab0	fix(test/s3/policy): allocate fresh admin port per subtest (#9332 ) * fix(test/s3/policy): allocate fresh admin port per subtest startMiniCluster ran weed mini in-process and explicitly assigned master/volume/filer/s3 ports allocated by MustAllocatePorts, but it left -admin.port and -admin.port.grpc unset, so each subtest reused the hardcoded defaults 23646 / 33646. The package's subtests run sequentially within the same go test process. The previous subtest's admin goroutine is still bound to 23646 by the time the next subtest spins up its own mini, so the new admin can never bind, mini.go's waitForAdminServerReady hits its 240-attempt cap, and glog.Fatalf kills the test binary. This has been the dominant cause of "admin server did not become ready" flakes across recent IAM PRs. Allocate two extra ports for admin and pass them through. The other subprocess-based tests (s3tables/) are not affected because each launches weed mini in a fresh OS process. fix(mini): make admin readiness wait context-aware waitForAdminServerReady polled for 240 attempts × 500ms regardless of whether the surrounding mini context was cancelled. When mini is run in-process from a test harness (test/s3/policy/...) and the test calls its cancel func, the leftover wait keeps spinning for the full two minutes and then glog.Fatalf's, terminating the entire test binary — including any sibling subtest that has since started its own mini. Thread the existing miniClientsCtx through the wait so a Stop / cancel returns context.Canceled immediately. The caller (startMiniAdminWithWorker) treats a context-cancelled outcome as a graceful shutdown signal and logs+returns instead of fataling.	2026-05-05 11:24:43 -07:00
Chris Lu	95560076e6	fix(mini): raise admin readiness timeout to 2 minutes (#9329 ) The 30-second ceiling on waitForAdminServerReady was too tight on busy CI runners. master + filer + volume + admin all start in parallel on a shared worker, and S3 Policy Shell Integration Tests has been flaking across multiple PRs with "admin server did not become ready... after 60 attempts" even though the server still comes up within a minute or two. Two minutes (240 attempts at 500ms) leaves headroom for runner contention without being absurd in a local-dev run.	2026-05-05 07:59:25 -07:00
Chris Lu	d605feb403	refactor(command): expand "~" in all path-style CLI flags (#9306 ) * refactor(command): expand "~" in all path-style CLI flags Many of weed's path-bearing flags (-s3.config, -s3.iam.config, -admin.dataDir, -webdav.cacheDir, -volume.dir.idx, TLS cert/key files, profile output paths, mount cache dirs, sftp key files, ...) were never run through util.ResolvePath, so a value like "~/iam.json" was used literally. Tilde only worked when the shell expanded it, which silently fails for the common -flag=~/path form (bash leaves the tilde literal in --opt=~/path). - Extend util.ResolvePath to also handle "~user" / "~user/rest", matching shell tilde expansion. Add unit tests. - Apply util.ResolvePath at the top of each shared start* function (s3, webdav, sftp) so mini/server/filer/standalone callers all inherit it; resolve at the few one-off use sites (mount cache dirs, volume idx folder, mini admin.dataDir, profile paths). - Drop the duplicate expandHomeDir helper from admin.go in favor of the now-equivalent util.ResolvePath. * fixup: handle comma-separated -dir flags for tilde expansion `weed mini -dir`, `weed server -dir`, and `weed volume -dir` accept comma-separated paths (`dir[,dir]...`). Calling util.ResolvePath on the whole string mishandled multi-folder values with tilde, e.g. "~/d1,~/d2" would resolve as if "d1,~/d2" were a single subpath. - Add util.ResolveCommaSeparatedPaths: split on ",", run each entry through ResolvePath, rejoin. Short-circuits when no "~" present. - Use it for miniDataFolders (mini.go), volumeDataFolders (server.go), and resolve each entry of v.folders in-place (volume.go) so all downstream consumers see resolved paths. - Add 7-case TestResolveCommaSeparatedPaths covering empty, single, multiple, and mixed inputs. * address PR review: metaFolder + Windows backslash - master.go: resolve m.metaFolder at the top of runMaster so util.FullPath(m.metaFolder) on the next line sees an expanded path. Drop the now-redundant ResolvePath in TestFolderWritable. - server.go: same treatment for masterOptions.metaFolder, paired with the existing cpu/mem profile resolves. Drop the redundant inner ResolvePath at TestFolderWritable. - file_util.go: ResolvePath now accepts filepath.Separator as a separator after the tilde, so "~\\data" works on Windows. Other platforms keep current behaviour (backslash stays literal because it is a valid filename character in usernames and paths). - file_util_test.go: add two cases using filepath.Separator that exercise the new code path on Windows and remain a no-op on Unix. address PR review: resolve "~" in remaining command path flags Comprehensive sweep of path-bearing flags across every weed subcommand, applying util.ResolvePath in-place at the top of each run* function so all downstream consumers see expanded paths. - webdav.go: resolve wo.cacheDir at the top of startWebDav so mini/server/filer/standalone callers all inherit it. - mount_std.go: cpu/mem profile paths. - filer_sync.go: cpu/mem profile paths. - mq_broker.go: cpu/mem profile paths. - benchmark.go: cpuprofile output path. - backup.go: -dir resolved once at runBackup; drop the duplicated inline ResolvePath in NewVolume calls. - compact.go: -dir resolved at runCompact; drop inline ResolvePath. - export.go: -dir and -o resolved at runExport; drop inline ResolvePath in LoadFromIdx and ScanVolumeFile. - download.go: -dir resolved at runDownload; drop inline. - update.go: -dir resolved at runUpdate so filepath.Join uses the expanded path; drop inline ResolvePath in TestFolderWritable. - scaffold.go: -output expanded before filepath.Join. - worker.go: -workingDir expanded before being passed to runtime. address PR review: resolve option-struct paths at run* entry points server.go:381 propagates s3Options.config to filerOptions.s3ConfigFile before startS3Server runs, which meant the filer-side code saw the unresolved tilde-prefixed pointer. Same pattern for webdavOptions and sftpOptions (and equivalent in mini.go / filer.go). The fix: hoist resolution from the shared start* functions up to the run* entry points, where every shared pointer is set up before any propagation happens. - s3.go, webdav.go, sftp.go: extract a resolvePaths() method on each Options struct that runs every path field through util.ResolvePath in-place. Idempotent. - runS3, runWebDav, runSftp: call the standalone struct's resolvePaths before starting metrics / loading security config. - runServer, runMini, runFiler: call resolvePaths on every embedded options struct, plus resolve loose flags (serverIamConfig, miniS3Config, miniIamConfig, miniMasterOptions.metaFolder, and filer's defaultLevelDbDirectory) so they're expanded before any pointer copy or use. - Drop the now-redundant inline ResolvePath at filer's defaultLevelDbDirectory composition. * address PR review: re-resolve mini -dir post-config, cover misc paths - mini.go: applyConfigFileOptions can overwrite -dir with a literal ~/data from mini.options. Re-resolve miniDataFolders after the config-file apply, alongside the other path resolves, so the mini filer no longer ends up with a literal ~/data/filerldb2. - benchmark.go: resolve b.idListFile (-list). - filer_sync.go: resolve syncOptions.aSecurity / .bSecurity (-a.security / -b.security) before LoadClientTLSFromFile. - filer_cat.go: resolve filerCat.output (-o) before os.OpenFile. - admin.go: drop trailing blank line at EOF (git diff --check). * address PR review: resolve -a.security/-b.security/-config before use Three follow-up fixes: - filer_sync.go: the -a.security / -b.security resolves were placed after LoadClientTLSFromFile / LoadHTTPClientFromFile were called, so weed filer.sync -a.security=~/a.toml still passed the literal tilde path. Hoist the resolves above the security-loading block so TLS clients see expanded paths. - filer_sync_verify.go: same flag pair was never resolved at all in the verify command; resolve at the top of runFilerSyncVerify. - filer_meta_backup.go: -config (the backup_filer.toml path) was passed directly to viper. Resolve at the top of runFilerMetaBackup. - mini.go: master.dir defaulted to the entire comma-joined miniDataFolders. With weed mini -dir=~/d1,~/d2 (or any multi-dir setup), TestFolderWritable then stat'd the joined string instead of a single directory. Default to the first entry via StringSplit to mirror the disk-space calculation a few lines below, and drop the now-redundant ResolvePath in TestFolderWritable.	2026-05-03 21:46:21 -07:00
Chris Lu	f16353de0b	feat(mini): add -bucket flag to pre-create an S3 bucket on startup (#9302 ) * feat(mini): add -bucket flag to pre-create an S3 bucket on startup Lets users hand a pre-provisioned object store to clients/CI without a post-start `weed shell s3.bucket.create` step. The flag is a no-op when empty (default) and idempotent on subsequent starts. * mini: bound bucket-creation RPCs with a timeout off miniClientsCtx Address PR review feedback: derive the lookup/mkdir context from miniClientsCtx() so Ctrl+C cancels the bucket RPCs, and cap with a 5s timeout so a stalled filer cannot block the welcome message indefinitely. Also wrap the DoMkdir error for parity with the lookup path. * mini: fall back to S3_BUCKET env var for -bucket Mirrors the existing -s3.externalUrl / S3_EXTERNAL_URL pattern so container/Kubernetes deployments can pre-create the bucket via env without overriding the entrypoint command. * docs(readme): lead weed mini quick start with credentials + bucket Promote the one-line setup (env vars + bucket) so users get a ready-to-use S3 endpoint without hopping between sections to find credential and bucket setup. * mini: accept comma-separated -bucket list Lets a single startup pre-create multiple S3 buckets, e.g. -bucket=bucket1,bucket2 (or S3_BUCKET=bucket1,bucket2). Names are trimmed and deduped; per-bucket errors are logged and the loop continues so one bad name does not block the rest. * mini: add -tableBucket flag for pre-creating S3 Tables buckets Mirrors -bucket but creates S3 Tables (Iceberg) buckets via s3tables.Manager so users can hand the all-in-one binary a ready-to-use table catalog without a follow-up weed shell call. Comma-separated, env fallback to S3_TABLE_BUCKET, idempotent on restart, owned by the DefaultAccountID placeholder. * mini: use errors.Is for ErrNotFound check in bucket lookup Matches the rest of the codebase (~20 call sites in weed/s3api). The direct equality works today because LookupEntry returns ErrNotFound unwrapped, but errors.Is future-proofs against any future wrapping.	2026-05-02 21:02:21 -07:00
Chris Lu	1f6f473995	refactor(worker): co-locate plugin handlers with their task packages (#9301 ) * refactor(worker): co-locate plugin handlers with their task packages Move every per-task plugin handler from weed/plugin/worker/ into the matching weed/worker/tasks/<name>/ package, so each task owns its detection, scheduling, execution, and plugin handler in one place. Step 0 (within pluginworker, no behavior change): extract shared helpers that previously lived inside individual handler files into dedicated files and export the ones now consumed across packages. - activity.go: BuildExecutorActivity, BuildDetectorActivity - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority - interval.go: ShouldSkipDetectionByInterval - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location - collection_filter.go: CollectionFilterMode + consts - volume_metrics.go: export CollectVolumeMetricsFromMasters, MasterAddressCandidates, FetchVolumeList - testing_senders_test.go: shared test stubs Phase 1: move the per-task plugin handlers (and the iceberg subpackage) into their task packages. weed/plugin/worker/vacuum_handler.go -> weed/worker/tasks/vacuum/plugin_handler.go weed/plugin/worker/ec_balance_handler.go -> weed/worker/tasks/ec_balance/plugin_handler.go weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go weed/plugin/worker/iceberg/ -> weed/worker/tasks/iceberg/ weed/plugin/worker/handlers/handlers.go now blank-imports all five task subpackages so their init() registrations fire. weed/command/mini.go and the worker tests construct the handler with vacuum.DefaultMaxExecutionConcurrency (the constant moved with the vacuum handler). admin_script remains in weed/plugin/worker/ because there is no underlying weed/worker/tasks/admin_script/ package to merge with. * refactor(worker): update test/plugin_workers imports for moved handlers Three handler constructors moved out of pluginworker into their task packages — update the integration test files in test/plugin_workers/ to import from the new locations: pluginworker.NewVacuumHandler -> vacuum.NewVacuumHandler pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler The pluginworker import is kept where the file still uses pluginworker.WorkerOptions / pluginworker.JobHandler. * refactor(worker): update test/s3tables iceberg import path The iceberg subpackage moved from weed/plugin/worker/iceberg/ to weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go still imported the old path, breaking S3 Tables / RisingWave / Trino / Spark / Iceberg-catalog / STS integration test builds. Mirrors the OSS-side fix needed by every job in the run that transitively imports test/s3tables/maintenance. * chore: gofmt PR-touched files The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables and test/s3tables, then fails if anything is unformatted. Files this PR moved or modified had import-grouping and trailing-spacing issues introduced by perl-based renames; reformat them with gofmt -w. Touched files: test/plugin_workers/erasure_coding/{detection,execution}_test.go test/s3tables/maintenance/maintenance_integration_test.go weed/plugin/worker/handlers/handlers.go weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler.go refactor(worker): bounds-checked int conversions for plugin config values CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved plugin handler files: results of pluginworker.ReadInt64Config (which ultimately calls strconv.ParseInt with bit size 64) were being narrowed to int32/uint32/int without an upper-bound check, so a malicious or malformed admin/worker config value could overflow the target type. Add three helpers in weed/plugin/worker/config.go that wrap ReadInt64Config and clamp out-of-range values back to the caller's fallback: ReadInt32Config (math.MinInt32 .. math.MaxInt32) ReadUint32Config (0 .. math.MaxUint32) ReadIntConfig (math.MinInt32 .. math.MaxInt32, platform-portable) Update each flagged call site in the four moved task packages to use the bounds-checked helper. For protobuf uint32 fields (volume IDs) the variable type also becomes uint32, removing the trailing uint32(volumeID) casts and changing the "missing volume_id" check from `<= 0` to `== 0`. Touched files: weed/plugin/worker/config.go weed/worker/tasks/balance/plugin_handler.go weed/worker/tasks/erasure_coding/plugin_handler.go weed/worker/tasks/vacuum/plugin_handler.go * refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers CodeQL still flagged three call sites where ReadInt64Config was being narrowed to int after a value-range clamp (max_concurrent_moves <= 50, batch_size <= 100, min_server_count >= 2). The clamp is correct but CodeQL's flow analysis didn't recognize the bound, so it flagged them as unbounded narrowing. Switch to ReadIntConfig (already int32-bounded by the helper) for those three sites, drop the now-redundant int64 intermediate variables. Also drops the now-unused `> math.MaxInt32` clamp in ec_balance.deriveECBalanceWorkerConfig (the helper covers it).	2026-05-02 18:03:13 -07:00
Jaehoon Kim	be451d22b5	feat(filer.sync): add -verifySync mode to filer.sync for cross-cluster file comparison (#9284 ) * Add -verifySync flag to filer.sync for cross-cluster file comparison Add a verification mode to filer.sync that compares entries between two filers without performing actual synchronization. Uses directory-level sorted merge of ListEntries to detect missing files, size mismatches, and ETag mismatches. Supports -isActivePassive for unidirectional check and -modifyTimeAgo to skip recently modified files during sync lag. * Add mtime annotation and JSON output to filer.sync -verifySync Add automatic mtime relation analysis for SIZE_MISMATCH and ETAG_MISMATCH diffs, and an NDJSON output mode for external tooling. mtime classification: - B_NEWER => "late_updates_skip_likely" hint. Surfaces the case where target has a stub entry whose mtime is ahead of source's real file, causing UpdateEntry's mtime guard in filersink to permanently skip the update. - A_NEWER => "sync_lag_or_event_miss" hint. - EQUAL => no hint (chunk-level issue suspected). Text output example: [SIZE_MISMATCH] /path (a=996, b=0, B newer +274d [late-updates skip likely]) Add -verifyJsonOutput flag. When set, emits one JSON object per line (NDJSON) for diffs and a final SUMMARY object, suitable for piping into external diagnostic pipelines. Concurrent writes from the directory worker pool are now serialized via outputMu to keep both text lines and JSON records atomic. * fix(filer.sync): use shared global semaphore in verifySync to bound goroutine explosion Replace the per-call local semaphore in compareDirectory with a single shared semaphore created in runVerifySync. The old per-level semaphore applied a limit of verifySyncConcurrency only within each directory level, allowing effective concurrency to grow as verifySyncConcurrency^depth on deep trees. The shared semaphore is held only for each directory's I/O phase (listEntries + merge) and released before recursing into subdirectories, so a parent never blocks waiting for children to acquire slots — which would deadlock once tree depth exceeds the semaphore capacity. Extract the capacity into a named constant (verifySyncConcurrency = 5) with a comment explaining the memory vs. performance trade-off. Add unit tests: - correctness: missing file, only-in-B, size mismatch, active-passive mode - concurrency bound: peak concurrent listings ≤ verifySyncConcurrency - no-deadlock: binary tree of depth 10 completes within timeout * fix(filer.sync): stream directory entries to prevent OOM on large directories Replace the listEntries helper (which accumulated all entries into a single []filer_pb.Entry slice) with an entryStream type that pages through the directory in the background and forwards entries one at a time through a buffered channel. Memory per directory comparison is now O(channel buffer size = 64) regardless of how many entries the directory contains. Key design points: - entryStream wraps a goroutine + buffered channel with a one-entry lookahead (peek/advance) so the two-pointer sorted merge in compareDirectory can work without buffering any full listing. - A child context (mergeCtx) is passed to both stream goroutines so they are cancelled promptly if compareDirectory returns early (e.g. on error); the ctx.Done() select arm in the callback prevents goroutine leaks when the consumer stops reading. - stream.err is written by the goroutine before close(ch), so it is safe to read after the channel is exhausted (Go memory model: channel close happens-before the zero-value receive). - countMissingRecursive is rewritten to use ReadDirAllEntries with a direct callback, eliminating its own slice allocation. - listEntries is removed; it is no longer called anywhere. * fix(filer.sync): address verifySync review findings Four real bugs found and fixed; one finding already resolved (shared semaphore was introduced in a prior commit). path.Join for child paths (filer_sync_verify.go) fmt.Sprintf("%s/%s", dir, name) produced "//name" when dir was "/". Replace all child-path concatenations with path.Join so root-level walks emit clean paths. cutoffTime check for ONLY_IN_B entries (filer_sync_verify.go) The B-only branch ignored -modifyTimeAgo, so files recently written to B were reported as ONLY_IN_B instead of being skipped. Mirror the A-side mtime guard: skip and increment skippedRecent when the entry is newer than cutoffTime. Summary emitted before error check (filer_sync_verify.go) A filer I/O error mid-walk still caused a SUMMARY record (or text summary) to be printed, making partial runs appear complete. Move the error check to before summary emission; on error, return immediately without printing any summary. Return false on verification failure (filer_sync.go) runVerifySync returned true (exit 0) even when diffs were found or the walk failed. Return false so the main binary sets exit status 1, consistent with how all other commands signal failure. * test(filer.sync): add missing verifySync test coverage Four new tests covering gaps identified during review: TestVerifySyncETagMismatch Verifies that two files with identical size but different Md5 checksums are counted as etagMismatch (not sizeMismatch). Exercises the second branch of compareEntries that was previously untested. TestVerifySyncCutoffTime (4 subtests) A-only recent — recent file skipped (skippedRecent++), not MISSING A-only old — old file reported as MISSING B-only recent — recent file skipped (skippedRecent++), not ONLY_IN_B B-only old — old file reported as ONLY_IN_B The B-only subtests specifically cover the cutoffTime fix added in the previous commit. TestVerifySyncRootPath Regression for the path.Join fix: walks from "/" and verifies that the child directory is reached and compared correctly (the old Sprintf produced "//data" which would silently produce wrong results). Asserts dirCount=2 and fileCount=1 to confirm the full tree is walked. * fix(filer.sync): use os.Exit(2) instead of return false on verify failure return false triggered weed.go's error handler which printed the full command usage — appropriate for invalid arguments, not for a completed verification that found differences. Use os.Exit(2) consistent with the existing pattern in filer_sync.go (lines 251, 293). * refactor(filer.sync.verify): split verify into its own command The verify mode is a one-shot batch operation with a fundamentally different lifecycle from the long-running sync subscriber, and most of filer.sync's flags (replication, metrics port, debug pprof, concurrency, etc.) do not apply to it. Extract it into a sibling command alongside filer.copy/filer.backup/filer.export rather than a flag mode on filer.sync. Also rename modifyTimeAgo to modifiedTimeAgo (grammatical) and drop the verifyJsonOutput prefix to plain jsonOutput now that the verify context is implicit in the command name. * fix(filer.sync.verify): address review comments - Bounded worker pool: cap subdirectory goroutines per level via a jobs channel and min(verifySyncConcurrency, len(subDirs)) workers instead of spawning one goroutine per child. Wide directories no longer park ~2KB per queued goroutine. - Don't gate recursion on a directory's mtime: a fresh child write bumps the parent mtime, but older files inside should still be reported as missing. Always recurse for missing-in-B directories and apply the cutoff per-file inside countMissingRecursive. - Apply -modifiedTimeAgo symmetrically: matched-name files now skip the comparison when EITHER side is recently modified, not just A. This restores lag tolerance when B was just rewritten. Adds tests for both new behaviors and a shared isTooRecent helper. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-29 12:33:53 -07:00
Chris Lu	35fe3c801b	feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267 ) * feat(nfs): add UDP MOUNT v3 responder The upstream willscott/go-nfs library only serves the MOUNT protocol over TCP. Linux's mount.nfs and the in-kernel NFS client default mountproto to UDP in many configurations, so against a stock weed nfs deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0 ("not registered"), and either falls back inconsistently or surfaces EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version or transport protocol is not supported" reported in #9263. The user has to add `mountproto=tcp` or `mountport=2049` to mount options to coerce TCP just for the MOUNT phase. Add a small UDP responder that speaks just enough of MOUNT v3 to handle the procedures the kernel actually invokes during mount setup and teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors handler.go's TCP path so both transports produce the same root filehandle and the same auth flavor list for the same export. Other v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL. This commit only adds the responder; portmap-advertise and Server.Start wire-up follow in subsequent commits so each step stays independently reviewable. References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing constants and parseRPCCall / encodeAcceptedReply helpers from portmap.go are reused so behaviour stays consistent across both UDP listening goroutines. * feat(nfs): advertise UDP MOUNT v3 in the portmap responder The portmap responder advertised TCP-only entries because go-nfs only serves TCP, but with the new UDP MOUNT responder in place we can now honestly advertise MOUNT v3 over UDP as well. Linux clients whose default mountproto is UDP query portmap during mount setup; if the answer is "not registered" some kernels translate the result to EPROTONOSUPPORT instead of falling back to TCP, which is exactly the failure pattern reported in #9263. Add the entry, refresh the doc comment, and extend the existing GETPORT and DUMP unit tests so a regression that drops the entry shows up at unit-test granularity rather than only in an end-to-end mount. * feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener Plug the new mountUDPServer into Server.Start so it comes up on the same bind/port as the TCP NFS listener. Started before portmap so a portmap query that races a fast client never returns a UDP MOUNT entry the responder isn't actually answering, and shut down via the same defer chain so a portmap-or-listener startup failure doesn't leave the UDP responder dangling. The portmap startup log now reflects all three advertised entries (NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a glance that the UDP MOUNT path is up. Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a container with -portmap.bind, and mounted from another container using both the user-reported failing setup from #9263 (vers=3 + tcp without mountport) and an explicit mountproto=udp to force the new code path. The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049` now leads to a successful mount instead of EPROTONOSUPPORT. * docs(nfs): note that the plain mount form works on UDP-default clients With UDP MOUNT v3 now served alongside TCP, the only path that ever required mountproto=tcp / mountport=2049 — clients whose default mountproto is UDP — works against the plain mount example. Update the startup mount hint and the `weed nfs` long help so users don't go hunting for a mount-option workaround that no longer applies. The "without -portmap.bind" branch is unchanged: that path still has to bypass portmap entirely because there is no portmap responder for the kernel to query. * test(nfs): add kernel-mount e2e tests under test/nfs The existing test/nfs/ harness boots a real master + volume + filer + weed nfs subprocess stack and drives it via go-nfs-client. That covers protocol behaviour from a Go client's perspective, but anything mis-coded once a real Linux kernel parses the wire bytes is invisible: both ends of the test use the same RPC library, so identical bugs round-trip cleanly. The two NFS issues hit recently were exactly that shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3 — and only surfaced in a real client. Add three end-to-end tests that mount the harness's running NFS server through the in-tree Linux client: - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline). - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP only — regression test for the new UDP MOUNT v3 responder. - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server, asserting the kernel surfaces a protocol/version-level error rather than a generic "mount system call failed" — regression test for the PROG_MISMATCH path from #9262. The tests pass explicit port=/mountport= mount options so the kernel never queries portmap, which means the harness doesn't need to bind the privileged port 111 and won't collide with a system rpcbind on a shared CI runner. They t.Skip cleanly when the host isn't Linux, when mount.nfs isn't installed, or when the test process isn't running as root. Run locally with: cd test/nfs sudo go test -v -run TestKernelMount ./... CI wiring follows in the next commit. * ci(nfs): run kernel-mount e2e tests in nfs-tests workflow Wire the new TestKernelMount* tests from test/nfs into the existing NFS workflow: - Existing protocol-layer step now skips '^TestKernelMount' so a "skipped because not root" line doesn't appear on every run. - New "Install kernel NFS client" step pulls nfs-common (mount.nfs + helpers) and netbase (/etc/protocols, which mount.nfs's protocol- name lookups need to resolve `tcp`/`udp`). - New privileged step runs only the kernel-mount tests under sudo, preserving PATH and pointing GOMODCACHE/GOCACHE at the user's caches so the second `go test` invocation reuses already-built test binaries instead of redownloading modules under root. The summary block now lists the three kernel-mount cases explicitly so a regression on either of #9262 or this PR's UDP MOUNT change is traceable from the workflow run page.	2026-04-28 14:06:35 -07:00
Chris Lu	735e94f6ba	mount: expose -fuse.maxBackground and -fuse.congestionThreshold flags (closes #9258 ) (#9268 ) * mount: expose `-fuse.maxBackground` flag (closes #9258) The Linux FUSE driver caps in-flight async requests via `/sys/fs/fuse/connections/<id>/max_background` (and a derived `congestion_threshold = 3/4 * max_background`). Heavy upload workloads need this raised, but the cap currently lives only in `/sys`, so it resets on reboot/remount. `weed mount` was hardcoding `MaxBackground: 128`. Promote it to a flag, default unchanged. Setting `-fuse.maxBackground=2048` reproduces the manual `echo 2048 > .../max_background` (and gives 1536 for congestion_threshold automatically) persistently across remounts. `congestion_threshold` is not exposed as a separate flag because go-fuse derives it as 3/4 of MaxBackground in InitOut and offers no hook to override; users wanting a different ratio can still write /sys/fs/fuse/connections/<id>/congestion_threshold post-mount. * mount: add `-fuse.congestionThreshold` flag, bump go-fuse to v2.9.3 go-fuse v2.9.3 exposes CongestionThreshold as a separate MountOption, so we can now let users override the kernel's default 3/4-of-max_background ratio at mount time instead of having to write /sys/fs/fuse/connections/<id>/congestion_threshold post-mount on every remount/reboot. Default 0 preserves existing behavior (kernel derives it as 3/4 * max_background). Non-zero is sent to the kernel verbatim; the kernel clamps it to max_background if higher.	2026-04-28 13:42:58 -07:00
Chris Lu	0fa0a56a5a	filer(mysql): TLS hostname/SNI knobs + MariaDB upsert documentation (#9260 ) * refactor(filer/mysql): set tls.Config per-instance via Connector instead of global registry Replace the use of `mysql.RegisterTLSConfig("mysql-tls", ...)` and the `&tls=mysql-tls` DSN suffix with a per-instance setup that assigns the `tls.Config` directly to `mysql.Config.TLS` and opens the database via `mysql.NewConnector` + `sql.OpenDB`. The driver's TLS-config registry is process-wide; if a second `MysqlStore` were ever initialized with different TLS settings (e.g., a filer plus a separately configured store) the second registration would silently overwrite the first. The connector pattern keeps the TLS configuration attached to the connector and avoids that global side effect. Behavior is otherwise unchanged: TLS is enabled when `enable_tls=true`, the same `ca_crt`/`client_crt`/`client_key` knobs are honored, and the TLS minimum version remains 1.2. filer(mysql): use system root CAs when ca_crt is empty Previously, enabling `enable_tls=true` without setting `ca_crt` returned an unhelpful empty-path read error. Many managed MySQL/MariaDB providers serve certificates that chain to a public CA already in the host's trust store, so requiring an explicit CA bundle adds friction with no security benefit. Leave `RootCAs` unset when `ca_crt` is empty so Go's `tls.Config` falls back to the system trust store, matching the standard behavior of `mysql --ssl`. Existing setups with `ca_crt` configured are unaffected. Also wraps the CA read/parse errors with the file path for easier diagnosis. * filer(mysql): fail loudly when client_crt / client_key are unreadable The previous implementation called `tls.LoadX509KeyPair` and silently discarded any error, falling back to a non-mTLS connection. A typo or permissions problem in `client_crt` / `client_key` therefore appeared as a confusing server-side handshake error rather than as a config error, because the server was expecting a client cert that the filer never sent. Treat the keypair as required when either path is set, and surface the underlying load error with both filenames so the misconfiguration is obvious. The default (both paths empty) is unchanged: no client cert is sent. * filer(mysql): add tls_insecure_skip_verify and tls_server_name knobs When the filer connects to a MySQL/MariaDB cluster whose server certificate's SAN does not match the connection address (common with internal load balancers, IP-only connection strings, or self-signed cluster certs), the TLS handshake fails with `x509: certificate is valid for X, not Y`. There was previously no way to fix this short of reissuing the cert. Expose two new optional knobs on `[mysql]`: - `tls_server_name` overrides the SNI / cert hostname used for verification — the standard fix when the cert SAN is correct but the connection address is not. - `tls_insecure_skip_verify` disables verification entirely as an escape hatch for testing or for clusters with no usable SAN. Both default to off, so existing configurations continue to verify the server certificate against the connection address as before. * docs(scaffold/filer.toml): document mysql TLS knobs and MariaDB upsert override - Document the new `tls_insecure_skip_verify` and `tls_server_name` options. - Update the `ca_crt` comment to reflect that it is optional and that the system trust store is used when the path is empty (matches the runtime behavior in mysql_store.go). - Reword the client cert comments to make the mTLS pairing requirement explicit (both `client_crt` and `client_key` must be set together). - Add a commented-out MariaDB / MySQL 5.7 alternative for `upsertQuery`, noting that the default (`AS new` row alias) requires MySQL 8.0.19+. * filer(mysql): drop redundant blank import of go-sql-driver/mysql The package was imported twice: once with the `mysql` alias (used for `mysql.MySQLError`, `mysql.Config`, `mysql.NewConnector`, etc.) and once as `_` to register the driver. The named import already triggers `init()` and registers the driver, so the blank import is dead weight.	2026-04-28 01:29:41 -07:00

1 2 3 4 5 ...

1493 Commits