seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-13 23:36:45 +03:00

Author	SHA1	Message	Date
Chris Lu	8cc10460b4	fix(remote): correct content and permissions when syncing/caching remote objects (#9879 ) * fix(remote): reject short reads when caching remote objects A short read from the remote (stale listing size, truncated or flaky response) was silently zero-padded: the S3 and Azure clients pre-size the buffer and discard the downloaded byte count, and the chunk is recorded with the requested size. The cached file then matched the expected size but its tail was NULL, and the entry was marked cached so it never re-fetched. Check the byte count against the requested size in both clients, and add a backend-agnostic guard in FetchAndWriteNeedle. The cache now fails loudly and the entry stays remote-only for a later retry. * fix(remote): match S3 default modes when syncing remote metadata Remote object listings carry no POSIX mode, so synced entries were created with a hardcoded 0644. Against a SeaweedFS remote, whose S3 layer writes objects as 0660 and auto-creates directories as 0771 (0660\|0111), the mounted copy ended up 0644/0755 and the permissions visibly diverged from the source. Default to the S3 modes instead (files 0660, directories 0771). The filer derives parent-dir modes from the child as fileMode\|0111, so fixing the file default also brings the directories into line. Directory mtimes still reflect sync time: S3 listings don't enumerate directories, so the remote's directory timestamps aren't available.	2026-06-08 13:55:53 -07:00
Chris Lu	9ede92a7cc	filer: replicate RECOMPUTE_LATEST pointer updates to peers (#9840 ) applyRecomputeLatest wrote the .versions latest-version pointer and the demoted prior version's stamp through UpdateEntry without a following NotifyUpdateEvent, so neither change entered the metadata log. Across filers the pointer then lived only on whichever filer ran the mutation, and ListObjects served by any other filer dropped those objects from a versioned bucket. Emit the events the way PATCH_EXTENDED already does, keeping a pre-update image for the notification diff.	2026-06-06 18:02:28 -07:00
Chris Lu	6bd0091c72	master: grow rack-spanning volumes once per DC, capped at copy_N (#9835 ) * master: grow rack-spanning volumes once per DC, capped at copy_N The periodic rack-aware growth scan grew once per rack. For rack-spanning replication (DiffRackCount > 0) a single logical volume already covers every rack the placement needs, so a crowded volume made every rack report should-grow and the scan created racks×step too many volumes: with "010" across two racks that is 2 racks x step 2 = 4 logical (8 physical) volumes. Plan one DC-wide grow for rack-spanning replication, and cap the per-event step at master.volume_growth.copy_N so lowering it reduces periodic growth. * master: distribute lastGrowCount evenly across uneven DCs The non-rack-spanning grow divisor used the current DC's rack count, so DCs with different rack counts each over-grew. Sum every rack up front and divide lastGrowCount by that global count instead.	2026-06-05 12:39:59 -07:00
Chris Lu	ab7be7867d	security: hot-reload JWT signing keys on SIGHUP (#9826 ) * security: reload JWT signing keys on SIGHUP Signing keys were read once in the server constructors and never refreshed. After a key rotation (Secret update, divergent reads) the in-memory key stayed stale and every request kept failing "wrong jwt" until the affected process was restarted. Add Guard.UpdateSigningKeys and call it from the master, volume and filer reload paths and the s3 reload hook, next to the existing whitelist refresh. Make the global chunk-read JWT cache reloadable via an atomic swap, and register the master's Reload with grace.OnReload -- it was never wired, so the master ignored SIGHUP entirely. Mirror the same refresh in the Rust volume server's SIGHUP handler. * security: swap signing keys behind an atomic pointer Addresses review feedback on the in-place key swap: SigningKey is a []byte, so reassigning the Guard fields while a request handler reads them is a data race that can tear the multi-word slice header and read out of bounds. Hold the four signing-key fields in an immutable signingConfig snapshot behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a reader sees either the old keys or the new ones. Reads go through new SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors. The Rust guard is already safe: every read and the SIGHUP write go through the shared RwLock<Guard>. * security: fold whitelist + auth state into the atomic snapshot Review follow-up. UpdateSigningKeys still wrote isWriteActive while the request path read it (and the whitelist maps) unsynchronized, so a SIGHUP under load could expose an inconsistent mix of activation bits and whitelist contents. Move all hot-reloadable Guard state -- keys, expirations, whitelist, and the activation flags -- into a single immutable guardState swapped behind one atomic.Pointer. The Update* methods take a small mutex to serialize the read-modify-write; readers stay lock-free. The concurrency test now also rotates the whitelist and probes IsWhiteListed under -race. Also read each signing key once per branch in the volume/filer JWT auth checks, so a reload landing mid-check can't take the allow-fast-path after auth was enabled or verify against a different key than the branch saw.	2026-06-04 22:26:08 -07:00
Chris Lu	df879e1ed7	filer: bound TraverseBfsMetadata memory by queuing directory paths (#9814 ) * filer: bound TraverseBfsMetadata memory by queuing directory paths The BFS enqueued every entry, so it held the whole subtree in memory including each file's chunk list. A filer serving a peer's first-time bootstrap traversal of a large tree could exhaust memory and get killed. Stream each entry as it is visited and queue only directory paths to descend into. Memory is now bounded by the number of directories rather than the entire tree, and the streamed output order is unchanged. * filer: match excluded prefixes on path-component boundaries Only treat an excluded prefix as a match when it ends at a path boundary, so excluding /a/b does not also drop a sibling like /a/bc. Short-circuit the trie walk on the first real match.	2026-06-03 10:28:42 -07:00
Aleksey	e3e02d3364	[CheckDisk]: implement disk health detection (#9560 ) * [CheckDisk][GRPC]: implement MVP for disk health detection, added timeout for new grpc connections * fix(volume): build disk health check on every platform setDiskStatus only existed behind the statfs build tag, so disk.go failed to compile on windows, openbsd, solaris, netbsd and plan9. Move the timeout wrapper and failure tracking into the shared disk.go and have each platform's fillInDiskStatus return an error, so every platform gets the same protection from a stuck filesystem. Also restore the uint64(fs.Bavail) cast: Bavail is int64 on freebsd, so the unguarded multiply broke the freebsd build. * fix(volume): keep one outstanding statfs probe per disk A stuck statfs used to leave isChecking cleared by the timeout path, so the next check spawned another goroutine while the previous one was still blocked in the syscall, leaking one goroutine per minute on a hung disk. Clear the flag only when statfs returns and treat an overlapping check as a failure, so a hung filesystem keeps a single outstanding probe and still gets reported. * fix(volume): assume disk available until the first health check isDiskAvailable defaulted to false, and CollectHeartbeat skips locations that are not available. A freshly started volume server would therefore omit every volume from its first heartbeats until the async CheckDiskSpace ran, so the master could briefly treat all of them as missing. * fix(volume): label the disk error metric by data directory The new gauge tagged the series with IdxDirectory while every neighbouring resource gauge uses Directory, so the error series would not line up with them in dashboards. Also log the underlying error instead of a generic message. * test(volume): cover disk health success and repeated-failure paths * fix(volume): make a healthy disk the zero-value default Track the disk as isDiskUnavailable instead of isDiskAvailable so the safe state is the zero value, matching isDiskSpaceLow. CollectHeartbeat only skips a location once a check has actively marked it unavailable, so any DiskLocation built without running CheckDiskSpace (tests, future call sites) still reports its volumes instead of silently dropping them. * feat(disk): detect degraded disks using IO latency probes * feat(stats): introduce configurable disk I/O health probe with EWMA-based latency detection * feat(disk): replace EWMA with sliding window algorithm for disk health detection and added user-friendly options * feat(disk): improve disk health probing and recovery * feat(volume): configure disk health checks via volume.toml * fix(volume): Remove disk IO probe CLI options --------- Co-authored-by: ptukha <ptukha@tochka.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-02 09:02:05 -07:00
Neetika Mittal	45465e5a05	fix(master): notify clients after manual volume grow (#9656 ) Co-authored-by: Neetika Mittal <mneetika@users.noreply.github.com>	2026-06-01 20:33:37 -07:00
Chris Lu	2386fa550a	grpc: don't tear down the shared master connection on a caller's own timeout (#9775 ) A Canceled/DeadlineExceeded from the caller's per-request context was treated like a dead channel: it closed the shared cached ClientConn and cancelled every other in-flight RPC on it with "the client connection is closing". Under a burst of concurrent chunk assigns (e.g. a large S3 multipart upload) one slow assign hitting its 10s attempt timeout could poison the connection for all the rest, cascading into a flood of 500s. Thread the caller's context into shouldInvalidateConnection and only invalidate on Canceled/DeadlineExceeded while that context is still live, which isolates the genuine stale-channel signal (a peer restart behind a k8s Service VIP). To carry the context, add a ctx parameter to the existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the master assign and volume-lookup paths pass their per-attempt context and every other caller passes context.Background().	2026-06-01 15:11:02 -07:00
Chris Lu	1a19683ee6	filer: name the read-only path in the write rejection (#9773 ) * filer: name the read-only path in the write rejection The write path rejected creates under a read-only rule with a bare "read only", giving no hint which path was locked or why. Wrap the error with the matched location prefix and a quota hint so a FUSE mkdir or S3 put points straight at the offending bucket. * return the read-only reason over HTTP and drop any query string from the fallback prefix	2026-06-01 12:20:45 -07:00
Chris Lu	80dd3b2621	EC bitrot follow-ups: protect destination sidecar on optional copy; cap sidecar block_size (#9763 ) * fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a huge power-of-two block_size that passed validation, then force a multi-GiB scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize (64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and derive the volume flag cap from the same constant so they cannot drift. * fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy writeToFile opened the destination with O_TRUNC before knowing whether the source had the file, so an optional copy (ignoreSourceFileNotFound) from a source that lacks the .ecsum truncated and then removed a valid pre-existing destination sidecar. Stage the optional copy into a temp sibling and commit it with an atomic rename only when the source actually delivered the file; a missing source is now a no-op. Mandatory copies keep their in-place behavior.	2026-05-31 23:42:33 -07:00
Chris Lu	9658f309d2	EC bitrot detection: per-shard checksum sidecars (#9761 ) * ec: add EC bitrot checksum protobuf EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages, copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode. * ec: bitrot checksum sidecar format, validation, and per-volume load Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a self-integrity header; validation, rolling builder, backfill primitive, and EcVolume load on mount + removal on destroy. * ec: capture per-shard checksums at encode; verify-and-exclude on rebuild WriteEcFilesWithContext returns the protection computed inline during encoding. generateMissingEcFiles verifies present inputs against the sidecar, excludes corrupt ones, regenerates in place, and re-verifies; fail-closed unless unsafe_ignore_sidecar, removing all generated outputs on failure. * ec: read-only checksum scrub with Reed-Solomon arbiter ChecksumScrub verifies each local shard against the sidecar and reconstructs flagged shards from the clean shards so stale-sidecar false positives are not reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum. * ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant); remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and opportunistically backfills a sidecar when all shards are reachable. * ec: volume server bitrot config flags -ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16). * fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge flag value cannot overflow int64 and slip past the power-of-two check, and a block size cannot collapse a sidecar to a few oversized blocks. * fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path The worker EC encode wrote the generation-0 sidecar locally but never added it to shardFiles, so DistributeEcShards never shipped it and the distributed holders came up unprotected. Append it to shardFiles and map the ecsum shard type to its extension in the sender so it travels with the shards. * fix(ec_bitrot): remove orphaned sidecars when the generation is gone Gate sidecar removal on existingShardCount==0 alone rather than also requiring a stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays gated on hasEcxFile as before. * fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles ChecksumScrub's first return is blocks scanned, not files. Discard it so the scrub response's TotalFiles (a needle/file count) is not inflated by the block count for CHECKSUM mode. * test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles * fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast The header stores payload_len as a uint32; bound the payload before the conversion so a pathological manifest cannot truncate the length field and corrupt the sidecar. A real manifest is a few KB, so this never trips. * fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB The block size becomes the per-shard scratch buffer the scrub/backfill path allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent scrub worker. Lower the upper bound from 1024 to 64 MiB. * fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext, which fails closed on a malformed/stale .ecsum. Without an override flag an operator could not complete the rebuild without manually deleting the sidecar. Expose -ecUnsafeIgnoreSidecar (default false) and thread it through. * fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull Guard len(payload) against a plain int constant (1 GiB) before the allocation instead of a uint64 MaxUint32 compare, so the allocation-size value is provably bounded (clears the CodeQL overflow alert) and the math import is no longer needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and remove the now-redundant readFull helper (os.File.ReadAt fills the slice or errors). * test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32 * refactor(ec): fold the EcFiles WithContext variants into the base functions RebuildEcFiles now takes the ECContext directly (nil => derive from .vif as before) and WriteEcFiles takes it too (nil => default), removing the parallel RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an explicit context drop the WithContext suffix; the default-context callers pass nil. No behavior change. refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles Add a non-nil BackgroundECContext placeholder (analogous to context.Background()) and have callers with no specific layout pass it instead of a nil ECContext. WriteEcFiles resolves a zero/background context to the default ratio and RebuildEcFiles resolves it from the .vif, so behavior is unchanged. fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif - BackgroundECContext is now a function returning a fresh *ECContext, so callers cannot mutate a shared singleton or race on it (and it mirrors context.Background, which is also a function). - RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but- unreadable .vif fails closed instead of silently rebuilding with the default ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.	2026-05-31 18:52:44 -07:00
Chris Lu	6b06fe5ec4	s3: commit a versioned PutObject and its latest pointer in one transaction (#9756 ) * s3: commit a versioned PutObject and its latest pointer in one transaction A versioned PutObject wrote the version file and flipped the .versions latest pointer in two separate routed transactions. Fold the RECOMPUTE_LATEST into the version file's PUT so both commit atomically under the object's per-path lock: the recompute, applied after the PUT in the same transaction, scans the directory and sees the new version. A crash can no longer leave the version present with a stale pointer. putToFiler now takes a putFinalize describing the finalize step — routed mutations folded into the PUT, or an afterCreate run under the object write lock off the ring. Suspended-versioning keeps its afterCreate-only form; multipart, copy, and delete-marker finalizes are unchanged. * s3: trim verbose finalize comments	2026-05-31 00:13:36 -07:00
Mohamed Chorfa	10c4ab3e33	s3, iam, volume, filer, master: add /healthz and /readyz health probes (#9738 ) Adds standard Kubernetes liveness/readiness endpoints to all HTTP servers that were missing them: - S3: adds /readyz (already had /healthz) - IAM: adds /healthz and /readyz (had none) - Volume: adds /readyz (already had /healthz) - Filer: adds /readyz on default and readonly mux - Master: adds /healthz and /readyz at root level (preserves existing /cluster/healthz) All endpoints reuse existing health handlers or return 200 OK as a minimal foundation. Future PRs can enhance /readyz with dependency checks without breaking the contract. Closes #9736 Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>	2026-05-29 20:45:03 -07:00
Chris Lu	c9623007a2	fix(filer.sync): keep sync_offset fresh through filtered-event markers (#9733 ) On a read-only watched path the idle heartbeat keeps sync_offset fresh, but a busy source filer still emits a MaxUnsyncedEvents marker after many filtered events. The marker has a non-nil but empty EventNotification, so the client routed it to the event path, where it advanced no real watermark yet drove offsetFunc to republish the stale processed watermark — regressing the gauge between heartbeats and spiking the derived lag every time a filtered-event burst landed. Route the empty marker through OnIdleHeartbeat like the idle heartbeat so its fresh timestamp keeps the gauge current; it still advances the in-stream resume cursor.	2026-05-28 23:29:59 -07:00
Chris Lu	dfd05d14cb	refactor(filer): remove the inode->path index and the NFS gateway (#9724 ) * fix(filer): derive inodes by hash instead of a snowflake sequencer Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers. * chore: remove the experimental NFS gateway The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies. * refactor(filer): drop the inode->path index With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.	2026-05-28 15:00:18 -07:00
Chris Lu	29eec2f111	master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup (#9698 ) * master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup The volume-grow goroutine clears the layout's growRequest flag only after ms.DoAutomaticVolumeGrow returns, and AllocateVolume / DeleteVolume were calling the volume-server RPC with context.Background(). A volume server that hung mid-call (heavy I/O, stuck lock, dead peer behind a stable VIP) would park the goroutine forever, leaving growRequest=true and silently blocking every subsequent automatic grow for that layout — Assign retries then drained their 30s budget with "context deadline exceeded" until the operator restarted the master. Bound both RPCs with a 5-minute deadline (creating/removing a volume is sub-second normally, generous for contended disks) and move the flag clear + filter delete into defers so a panic in DoAutomaticVolumeGrow doesn't strand the layout either. * allocate_volume: shorten timeout to 1m for faster recovery Volume create/delete is sub-second under normal conditions; 1 minute is generous even on a contended disk and clears the growRequest flag well before too many client Assigns drain their own retry budget. * trim comments	2026-05-26 16:26:21 -07:00
Chris Lu	77dcb20a74	writeJson: drop unused JSONP branch (#9686 ) * writeJson: drop unused JSONP branch No in-tree caller uses ?callback=. Always serve application/json with X-Content-Type-Options: nosniff. * seaweed-volume: drop unused JSONP branch Mirror Go: always serve application/json with X-Content-Type-Options: nosniff. * writeJson: drop unreachable StatusNotModified check bodyAllowedForStatus already returns early for 304. * test/volume_server: rename and rewrite JSONP test to assert callback is ignored CI: /status?callback=myFunc now returns plain application/json with X-Content-Type-Options: nosniff.	2026-05-26 01:05:07 -07:00
Chris Lu	85ca3cb757	filer: warm-up + fail-closed cooling for POSIX locks on owner (re)start (#9673 ) After a (re)start the owner defers would-be grants for posixLockWarmup while mounts re-assert, trusting only locally-visible conflicts, so it does not double-grant from empty state; a deferred grant is a retry for SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling now fail-closes: if the previous owner is unreachable during a ring change, defer rather than risk a double-grant. readyAt is atomic so the handler reads it without locking.	2026-05-25 13:14:05 -07:00
Chris Lu	a3c0baa9b0	filer: cooling-off dual-read for POSIX locks during ring changes (#9672 ) While the ring changed within the last snapshot interval, a fresh owner asks the key's previous owner (LockRing.PriorOwner) whether it still holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so it does not double-grant before re-assertion rebuilds its local state. The probe is marked cooling_probe so the previous owner answers from local state without recursing. PriorOwner uses the snapshot's prebuilt ring rather than rebuilding a hash ring per call.	2026-05-25 12:34:15 -07:00
Chris Lu	f8caaa4464	mount,filer: re-assert POSIX locks via keepalive (ownership migration + restart) (#9668 ) * mount: renew POSIX lock leases via keepalive The mount tracks the inode keys it holds locks on and a background loop renews its session lease (KEEP_ALIVE) with each key's owner filer every 5s, within the filer's 15s TTL. A live mount is never reaped; a dead one stops renewing and owners reclaim its locks. Tracking is a superset: holds are added on grant and dropped only on owner release, so a still held lock is never under-renewed. * mount,filer: re-assert held POSIX locks via keepalive The owner filer holds POSIX advisory locks as in-memory soft state, so a key's owner change (ring rebalance) or an owner restart lost or stranded them: the new or restarted owner was blind to existing holders and would double-grant. Make the keepalive carry the mount's held lock ranges per key. The mount mirrors its own granted locks (posixOwn), and each tick re-asserts them to the key's current owner, which rebuilds that session's locks from the assertion — self -healing after a takeover or restart. The owner arbitrates re-asserted locks against other sessions so it never double-grants; a lock that lost a migration race is reported, not forced. A bare keepalive (no ranges) still just renews.	2026-05-25 01:02:45 -07:00
Chris Lu	c97b69f8a4	filer: session lease + reaping for POSIX locks (#9666 ) * filer: session lease + reaping for POSIX locks A mount renews its session lease by keepalive (new KEEP_ALIVE op); the owner filer records last-seen per session and a background sweeper reaps the locks of leased sessions that stop renewing — a dead or partitioned mount. Only sessions that have renewed are leased, so this is inert until mounts run with -posixLock. * mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665) mount: route POSIX advisory locks to the owner filer under -dlm With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. The mount calls its filer and relies on filer-side forwarding to reach the owner. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. * mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown The unlock/release RPCs run off the syscall path (close/flush) and used context.Background() with no deadline, so a slow or unreachable filer could hang close() indefinitely; bound them to 5s (they still aren't cancelled by an interrupt). The lease-reaping sweeper now selects on a stop channel that FilerServer.Shutdown closes, instead of looping for the process lifetime.	2026-05-25 00:00:59 -07:00
Chris Lu	fef49c2d75	filer: routed PosixLock RPC over the in-memory authority (#9664 ) * filer: in-memory POSIX lock authority (Manager) Concurrent multi-inode authority over the per-inode Set: a Set per opaque inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead mount's locks reap in O(locks held). Lock state stays in memory like the distributed lock manager's, off the replicated meta-log. TryLock/Unlock/ GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and stale index entries are pruned on release. * filer: routed PosixLock RPC over the in-memory authority Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner drops) that the owner filer answers from its in-memory Manager. The request key is the inode identity ring key; a non-owner filer forwards one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's table stays the single authority under a stale ring view. Strictly non-blocking; SetLkw polling lives in the mount.	2026-05-24 22:50:42 -07:00
Chris Lu	2a4923e7e8	ObjectTransaction: filer-side forwarding via route_key (#9659 ) A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.	2026-05-24 14:21:06 -07:00
Chris Lu	1f0c366583	s3: route metadata-only self-copy off the distributed lock (#9638 ) A non-versioned metadata-only self-copy (CopyObject with source == destination and the REPLACE directive) is a read-modify-write of one entry, which is why it held the distributed lock. It now routes to the owner as a serialized PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements, delete the dropped keys) onto a fresh read of the entry under its per-path lock, so a concurrent change to non-managed keys (legal hold, retention, version id) is preserved instead of clobbered, and bumps mtime. PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended self-copies create a new version (already routed via the copy finalize) and the no-owner bootstrap keep the lock.	2026-05-24 12:32:57 -07:00
Chris Lu	fa7056dc6f	s3: route object-lock version-specific deletes off the distributed lock (#9657 ) A version-specific DELETE (real version or the null version, including object-lock WORM-checked ones and governance-bypass) now runs as one routed transaction on the object's owner instead of holding the distributed lock. For a real version: recompute the .versions pointer excluding the version (repoint-before-delete, so a crash leaves a recoverable orphan rather than a dangling pointer), then delete the version file, under the object's per-path lock. The null version is the regular object entry, deleted directly (no pointer). Object-lock buckets gate the delete on the version's WORM guards evaluated on the owner: legal hold (always) + retention (while not elapsed). Governance bypass scopes the retention guard to COMPLIANCE mode, so the filer allows a governance-mode delete while still denying compliance and legal hold — the gateway never reads the version. Three primitives make this expressible: - ObjectTransaction.condition_key: evaluate the condition against a named entry (the version) while the lock stays on lock_key (the object). - Recompute.exclude_name: omit a child from the scan, to repoint before delete. - WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a mode, expressing governance bypass without a gateway-side read.	2026-05-24 11:41:08 -07:00
Chris Lu	db954b5503	s3: route versioned PutObject finalize off the DLM (#9631 ) s3: route versioned PutObject finalize off the distributed lock A versioned write's finalize (flip the .versions pointer to the newest version, demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction on the object's owner filer, under its per-path lock, instead of the unserialized updateLatestVersionInDirectory. The version file is written first; the owner re-derives the pointer by scanning the directory. RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's size and mtime on the pointer, and demote_key / demote_value to stamp the displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves. Falls back to updateLatestVersionInDirectory when no owner is known yet.	2026-05-24 03:10:30 -07:00
Chris Lu	f037fc4dce	s3: dial the object lock's primary filer directly (#9626 ) * s3: dial the object lock's primary filer directly The S3 object write lock builds a fresh short-lived lock per write, each starting at the seed filer. When the seed isn't the key's hash-ring primary the filer forwards the request to the primary, and in multi-cluster setups that forward crosses clusters on every write. Give the lock client a view of the filer lock ring, fed by the master's LockRingUpdate broadcasts the gateway already receives, so it dials the primary directly. The view tracks filer membership by version; a stale view stays correct because the filer still forwards as a fallback. Also send the initial ring snapshot to S3 clients, not just filers. * s3: subscribe to lock-ring updates before starting the master loop The master delivers the initial LockRingUpdate once, on connect. Registering the callback after KeepConnectedToMaster started left a window where that first update could arrive before the handler was set and be dropped, delaying the ring view until the next membership change. Build the lock client and register the callback in the masters block before launching the loop; the filers block reuses that client (or creates a plain one when no masters are configured). * lock_manager: build the hash ring in a deterministic server order rebuildRing ranged over the server set (a map), whose iteration order is randomized per process. On a vnode hash collision the last writer into vnodeToServer wins, so two nodes holding the same server set could resolve the collision to different servers and disagree on the primary for keys near that slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement would route the same key to different filers and defeat per-path serialization. Iterate the servers in sorted order so the ring is identical on every node with the same set, regardless of discovery order. * lock_manager: skip redundant ring rebuilds, trim comments SetRing now ignores a non-zero version at or below the current one once a ring exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the ring. * s3: hold the lock-ring client on the server for route-by-key Store the object-write lock client on S3ApiServer so handlers can resolve a key's owner filer via PrimaryForKey.	2026-05-24 00:40:43 -07:00
Chris Lu	b4d2224e97	filer: let PATCH_EXTENDED replace Entry.content (#9654 ) * filer: let PATCH_EXTENDED replace Entry.content PATCH_EXTENDED merges extended attributes under the per-path lock, reading the entry fresh, so concurrent patches to different keys don't clobber each other. Some single-key state lives in Entry.content rather than an extended attribute (e.g. the S3 bucket metadata blob). Add set_content/content to the mutation so a patch can replace content the same way -- read fresh, set content, preserve the rest -- letting a content write and an extended-attribute write on the same entry serialize on the lock instead of racing whole-entry rewrites. * Update weed/server/filer_grpc_server.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * filer: test set_content FileSize sync; note chosen content-patch approach Cover the FileSize behavior of a set_content patch: a file's size follows the new content length (including when it shrinks), a directory's stays zero. Also document, in the bucket-config design, that extending PATCH_EXTENDED with set_content is the implemented path for content-backed config. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-23 21:43:43 -07:00
Chris Lu	83195fc111	filer: reuse the caller's fetched entry in CreateEntry (#9645 ) CreateEntry starts with a FindEntry to load the current entry. A conditional CreateEntry already fetched that entry to evaluate the precondition under the per-path lock, so the create repeated the lookup. Add an existing *Entry parameter: when non-nil it is used as the current entry and the internal lookup is skipped; nil keeps the lookup. The gRPC CreateEntry handler passes the entry it fetched for the precondition, removing the redundant read while the lock is held. All other callers pass nil.	2026-05-23 21:40:41 -07:00
Chris Lu	091aad59dc	filer: add ObjectTransactionBatch for multi-key object writes (#9649 ) A multi-object delete spans many keys that route to different owner filers. The gateway groups keys by owner and sends one batch per owner; the filer applies each transaction under its own per-path lock, independent of the others. A failed transaction (precondition or mutation error) is reported in its own response without aborting the rest, matching S3 multi-object semantics where each key succeeds or fails on its own. There is no cross-key atomicity, which S3 batch delete does not require.	2026-05-23 21:09:02 -07:00
Chris Lu	e2203b2a0b	filer: add extended-attribute guard clauses for object-lock (#9648 ) Routing object-lock buckets off the distributed lock needs the retention and legal-hold check to run atomically with the write, under the per-path lock. Move just the comparison into the filer, not the S3 semantics: two generic clause kinds on an extended attribute. IF_EXTENDED_NOT_EQUAL blocks while extended[ext_key] equals ext_value (a legal hold). IF_EXTENDED_TIME_ELAPSED blocks while extended[ext_key], read as a unix- second deadline, is in the future against the filer's clock (retention); a malformed deadline fails safe. The caller composes these from the object-lock state and, for a governance bypass, simply omits the retention clause once the bypass is authorized -- the filer makes no authorization decision and keeps no S3 knowledge.	2026-05-23 19:38:08 -07:00
Chris Lu	e71bac55e9	filer: add RECOMPUTE_LATEST mutation to ObjectTransaction (#9647 ) Deleting a specific version that happens to be the latest needs the new latest re-derived from the remaining versions, and that scan must run under the same lock as the delete. The gateway can't do it atomically across RPCs. Add a RECOMPUTE_LATEST mutation: it scans a directory under the transaction lock, picks the child that sorts last (descending) or first by name, copies the mapped extended keys from it into a pointer entry, and stores its name under name_to_key. An empty directory clears the pointer keys. The filer stays mechanical and S3-agnostic: the caller, which knows the versioning scheme, supplies the sort direction and the key mappings. A missing pointer entry is a no-op, so a replayed transaction is idempotent.	2026-05-23 18:29:46 -07:00
Chris Lu	bf022ca018	filer: add ObjectTransaction for atomic multi-entry object writes (#9646 ) A versioned object write touches several entries that must change together: the main object, a delete marker or version file, and the latest pointer on the .versions directory. Holding a distributed lock across separate RPCs to do this is what the per-path lock was meant to replace, but a single CreateEntry only covers one entry. Add ObjectTransaction: a request carries a lock_key (the object path), an optional WriteCondition, and an ordered list of mutations (PUT / DELETE / PATCH_EXTENDED). The filer holds the per-path lock on lock_key for the whole call, checks the condition against the entry at lock_key, then applies the mutations in order. Callers route the object's writes to its owner filer so the lock is authoritative across all of the object's entries. DELETE and PATCH of an absent entry are no-ops, so a replayed transaction is idempotent. PUT entries are metadata-scoped; data-bearing writes (chunks) are written before the transaction, as today.	2026-05-23 17:34:30 -07:00
Chris Lu	b18d3dc96c	filer: evaluate a write precondition in CreateEntry (#9650 ) Add an optional WriteCondition to CreateEntryRequest. When set, the filer evaluates it against the current entry while holding the per-path lock, so the check and the write are atomic on this filer, and returns PRECONDITION_FAILED when it does not hold. The caller must route the key's writes to the owner filer for the check to be authoritative. A condition is a list of clauses that all must hold (logical AND). One clause is the common case; several express what a single comparison cannot: an ETag set (If-Match / If-None-Match with multiple values), weak-ETag comparison, and compound conditions. ETag comparison mirrors the S3 gateway's precedence (stored Seaweed ETag attribute, then the Md5/chunk fallback) and follows RFC 7232 strong/weak rules, so results match without coupling the filer to S3 handling. Condition parsing and evaluation live in filer_grpc_server_condition.go.	2026-05-23 16:29:14 -07:00
Chris Lu	bce76e6e21	filer: serialize same-path mutations with a per-path lock (#9639 ) CreateEntry is a FindEntry-then-write with no lock, so concurrent creates to the same path race: OExcl can admit two creators, and a conditional check-then-act has no atomicity. Add a per-path exclusive lock (util.LockTable, which evicts idle keys so it stays bounded) on the FilerServer and take it in CreateEntry, so the existence check and the write are atomic on this filer. This is the local serialization point that lets callers route a key's writes to its owner filer and drop the distributed lock for that key. AppendToEntry keeps its distributed lock for now; it can move to the per-path lock once its callers route to the owner.	2026-05-23 14:22:42 -07:00
Chris Lu	9021225591	master: accept volume-server Ping targets on follower masters (#9614 ) cluster.check asks every master to ping every volume server, but the Ping gate validated volume-server targets only against the local topology. Only the leader receives volume-server heartbeats, so a follower's topology is empty and every probe through it failed with "unknown ping target ... of type volumeServer". Fall back to the volume-server set the master learns over its own MasterClient subscription to the leader, the same source the filer gate already trusts. The anti-SSRF intent is preserved: Ping still only dials recognized cluster members.	2026-05-21 10:19:59 -07:00
Chris Lu	5af7d12f04	fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589 ) * fix(filer.sync): keep sync_offset fresh while the source is read-only sync_offset holds the timestamp of the last replicated source event, so monitoring derives lag from now-sync_offset. A read-only source emits no metadata events, so the gauge froze at the last write and the derived lag grew without bound, making thresholds unusable. The source filer now sends an idle heartbeat carrying its current time while a subscriber is caught up to the buffer head. filer.sync uses it to advance the gauge, so now-sync_offset reflects real lag. Heartbeats are opt-in (client_supports_idle_heartbeat), are never written to the metadata log, and do not move the resume checkpoint, so a restart still resumes from the last real event. * fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs In metadata-chunks mode persisted entries replay as log file refs and never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up subscriber with an old SinceNs would never get a heartbeat. Use the read cursor (lastReadTime), which advances in that mode too, max'd with lastSeenTsNs so the in-memory backlog-then-idle case still works while the cursor returned to the caller has not yet updated.	2026-05-20 11:26:37 -07:00
Chris Lu	77ac781bbd	fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568 ) * fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers When a volume server holds EC shards for the same vid across more than one disk, each DiskLocation registers its own EcVolume entry and Store.FindEcVolume returns whichever one it hits first. The shard-info RPC iterated only that single EcVolume's Shards, so the response missed every shard mounted on a sibling disk. The worker's verifyEcShardsBeforeDelete sums the per-server responses into a union bitmap and refuses to delete the source volume when the union falls short of dataShards+parityShards. On multi-disk destinations, the union was systematically under-counted and source deletion got blocked even though all shards were physically present and mounted. Walk every DiskLocation in the handler and emit the deduplicated union of all shards. The .ecx-backed fields (file counts, volume size) still come from a single EcVolume since every disk's entry opens the same .ecx via NewEcVolume's cross-disk fallback. Tests: - TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in weed/server/. - test/volume_server/grpc/ec_verify_multi_disk_test.go integration test drives the full generate -> mount -> redistribute -> restart -> reconcile path and asserts both VolumeEcShardsInfo and VerifyShardsAcrossServers + RequireFullShardSet (the production source-deletion gate) report all 14 shards. - ec_multi_disk_lifecycle_test.go tightened: replaces the "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a full-shard-set assertion. * review: use ShardBits bitmask + cap-pre-allocation for shard dedup	2026-05-19 14:58:56 -07:00
Chris Lu	68794fb94c	fix(ec_distribute): remove partial files on copy stream error (#9543 ) * fix(ec_distribute): remove partial files on copy stream error writeToFile opens the destination with O_TRUNC and streams into it. On a mid-stream receive / write / cancellation error it returned the failure but left the destination behind in whatever state had been written so far — typically 0 bytes when the source errored before sending any FileContent. VolumeEcShardsCopy distributes .ecx by calling doCopyFile, so this same stub-leaving behaviour produced the 0-byte .ecx files seen on EC encoding failures: the source claims a non-zero ModifiedTsNs (so the existing "source not found" cleanup doesn't fire), the stream then errors immediately, and the receiver ends up with a 0-byte .ecx that downstream code mistook for a valid empty index. Clean up the partial file on every error path that returns from the streaming loop (receive, write, and cancellation). Skip cleanup when isAppend=true so resumable appends keep their existing content. As defense in depth, VolumeEcShardsCopy also stats the .ecx after copy and removes / errors on a 0-byte result so the orchestrator can pick a different source. The Rust volume server has only the source side of CopyFile (no client-side stream-to-disk consumer) and no .ecx subsystem yet, so this fix has no Rust mirror. * fix(ec_distribute): close file before remove, fail fast on stat error Address review feedback: - writeToFile's mid-stream removeIncomplete called os.Remove while the destination file handle was still open. On Windows os.Remove fails while a handle is open, so the cleanup wouldn't run there. Wrap the handle close in a once-only helper, call it from removeIncomplete and from the existing "source not found" cleanup, and keep a deferred close as the safety net for the normal-return path. - VolumeEcShardsCopy's post-copy .ecx check silently passed when os.Stat returned an error: doCopyFile had reported success but if the file was already gone, unreadable, or somehow a directory, the orchestrator only learned at mount time with no useful context. Treat any non-nil stat error and any directory result as a copy failure here and surface it immediately.	2026-05-18 15:19:51 -07:00
Chris Lu	6b94701213	mini: quieter startup with a docker-compose-style progress board (#9524 ) * mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway\|service).*(started\|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases.	2026-05-17 19:13:09 -07:00
Chris Lu	2a41e76101	fix(ec): blanket-clean every destination over the full shard range (#9512 ) * fix(ec): blanket-clean every destination over the full shard range The previous cleanup pass walked t.sources only, with the shard ids the topology had reported at detection time. In the wild, a destination can end up with EC shards mounted that the topology snapshot didn't list — shards on a sibling disk that hadn't heartbeated, or shards left over from a concurrent attempt's mount step. FindEcVolume still returns true, so the next ReceiveFile trips the mounted-volume guard. Cleanup now unions t.sources (with ShardIds) and t.targets and issues unmount + delete over [0..totalShards-1] on each. Both RPCs are idempotent on missing shards, so the wider sweep is free. Two new tests cover the gap: shards mounted beyond what t.sources lists, and a target-only destination with no source row. * log(ec): include disk_id in EC unmount/delete/refusal log lines The current logs identify the volume and shard but leave disk_id off, which makes the cross-server cleanup story hard to follow when multiple disks of one server hold pieces of the same volume: UnmountEcShards 4121.1 -> add disk_id ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id volume server X:Y deletes ec shards from 4121 [...] -> add disk_id ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids ReceiveFile's refusal now names the disk_ids actually holding the mount so operators can see whether the next cleanup pass needs to target a sibling disk. Added Store.FindEcVolumeDiskIds / Store::find_ec_volume_disk_ids as the supporting primitive. Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard, heartbeat delete log in diff_ec_shard_delta_messages, refusal in the ReceiveFile handler). * test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server The plugin-worker EC tests boot a fake volume server that embeds UnimplementedVolumeServerServer. After the worker started calling VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the default Unimplemented response surfaced as fourteen "method not implemented" errors and TestErasureCodingExecutionEncodesShards failed. Both RPCs are no-ops here — nothing on the fake server has mounted state or persisted shard files to remove.	2026-05-17 11:31:37 -07:00
Chris Lu	62821964dd	filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509 ) (#9514 ) PR #9442 made the filer refuse to register the IAM gRPC service unless jwt.filer_signing.key was set in security.toml, which broke the admin UI Users/Groups/Policies pages for every deployment that ships without a security.toml — weed mini, plain Helm, vanilla weed filer. The Users tab returns Unimplemented and the page is unusable. Issues #9504, #9505 and #9509 all trace to this gap. The rest of the filer's gRPC surface is unauthenticated by default; treat IAM the same way. The service now always registers, and the auth gate is a no-op when no signing key is configured. When the key is set, every RPC still requires an admin-signed Bearer token, matching the post-#9442 behaviour. Operators who expose the filer gRPC port beyond a trusted network should set the key on both filer and admin. The admin client (IamGrpcStore.withIamClient) already skips attaching the authorization metadata when its key is empty, so no changes there.	2026-05-15 13:15:20 -07:00
Chris Lu	bfb2661fec	fix(tests): make 32-bit GOARCH tests build and run (#9507 ) fix(tests): make 32-bit GOARCH tests build and run (#9503) verifyTestFilerClient had bare int64 atomic counters after a map header, so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on linux/386. Switch to atomic.Int64, which the stdlib guarantees is 8-byte aligned on all platforms. rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to t.Errorf, where it default-promoted to int and overflowed 32-bit int. Bind it to a typed uint32 const used in both the comparison and the error message.	2026-05-14 20:55:37 -07:00
Chris Lu	3a8389cd68	fix(ec): verify full shard set before deleting source volume (#9490 ) (#9493 ) * fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process.	2026-05-13 19:29:24 -07:00
Chris Lu	d5c0a7b153	fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487 ) * fix(master): include GrpcPort in LookupEcVolume response LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume builds Location with only Url / PublicUrl / DataCenter, so callers fall back to ServerToGrpcAddress (httpPort + 10000). On any deployment where that convention does not hold — multi-disk integration tests, custom port layouts — EC reads dial the wrong port and quietly degrade to parity recovery. * fix(volume/ec): probe every DiskLocation when serving local shard reads reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the DiskLocation that physically owns it, so a multi-disk volume server can hold shards for the same vid in two separate ecVolumes — one per disk — with .ecx on whichever disk owned the original .dat. The read path only consulted the single EcVolume FindEcVolume picked, so requests for shards on the sibling disk fell through to errShardNotLocal and then to remote/loopback recovery. Walk all DiskLocations after the first probe in both readLocalEcShardInterval and the VolumeEcShardRead gRPC handler; the latter also covers the loopback that recoverOneRemoteEcShardInterval falls back to when a peer dial fails. * test(volume/ec): cover the multi-disk EC lifecycle end-to-end Two integration tests against a real volume server with two data dirs: TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read -> drop .dat -> stop -> redistribute shards across disks -> restart -> verify reconcileEcShardsAcrossDisks attached the orphan shards and reads still work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild pulls input from both disks -> reads still work. TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478 reproducer at the cluster level: seed a healthy .dat on disk 0, plant the on-disk footprint of an interrupted EC encode on disk 1, restart, and assert pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0. Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous run's volume.log is rotated to volume.log.previous so a startup regression on the second run does not lose the first run's diagnostics. * review: trim verbose comments * review: drop racy fast-path, use locked findEcShard directly gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent mount / unmount could race with it. findEcShard already walks every DiskLocation under the right lock, so the fast-path adds nothing but the race. Collapse both call sites to a single locked call. Also note in RestartVolumeServer why the log-rotation error is swallowed: absence on first call is benign; anything else surfaces in the next os.Create in startVolume.	2026-05-13 13:56:20 -07:00
Chris Lu	f51468cf73	Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474 ) Revert "master: bind heartbeat claims to the connecting peer (#9443)" This reverts commit `f28c7ce6df`. The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects every hostname-based deployment. In docker-compose / k8s the volume server is started with -ip=<service-name> and the gRPC peer surfaces as the container/pod IP, so the two never match and every heartbeat fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`. The master therefore never learns about any volume, growth fails, and fio writes against the mount return EIO. After the #9440 revert merged (`43a8c4fdc`), the e2e workflow is still failing for this reason; see https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 . Reverting to unblock e2e. A narrower re-do should accept the heartbeat when heartbeat.Ip resolves (DNS) to the peer address, so the spoof hardening can return without breaking hostname-based clusters.	2026-05-12 18:22:21 -07:00
Chris Lu	43a8c4fdca	Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472 ) * Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)" This reverts commit `21054b6c18`. The fail-closed gate broke any multi-host cluster: in compose / k8s / remote-host deployments the master's IP isn't loopback, so every master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute, vacuum, scrub, ...) is rejected with PermissionDenied unless the operator manually configures -whiteList. The e2e workflow has been failing since `10cc06333` with `not authorized: 172.18.0.2` on AllocateVolume; downstream symptom is fio fsync EIO because zero volumes can be grown. The gate's intent was to lock down destructive admin tooling, but the same RPCs are the master's normal mechanism for growing and managing volumes. Reverting to restore cluster-internal operation; a narrower re-do should distinguish operator/admin callers from the master peer (e.g. trust IPs resolved from -master) before going back in. * security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic The revert in the previous commit also rolled back an unrelated bug fix that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error but did not continue, so the nil *net.IPNet was stored in whiteListCIDR and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on the next gRPC admin check. Restore the continue. Orthogonal to the fail-closed semantics this PR is reverting.	2026-05-12 16:00:44 -07:00
Chris Lu	f28c7ce6df	master: bind heartbeat claims to the connecting peer (#9443 ) SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on the wire. Three changes tighten that: - Reject heartbeats whose Ip does not match the gRPC peer's source address. Loopback peers are still trusted; operators behind a proxy can opt out with -master.allowUntrustedHeartbeat. - Track which (ip, port) first claimed a volume id or an ec shard slot and drop foreign re-claims. Non-EC volume claims are bounded by the replica copy count so legitimate replicas still register. EC ownership is keyed by (vid, shard_id) so the same vid can legitimately be split across many peers as long as their EcIndexBits are disjoint; rejected bits are cleared from the bitmap and the parallel ShardSizes array is compacted in lock-step. - Maintain reverse indexes owner -> volumes and owner -> ec shard slots so disconnect cleanup is O(M) in what that peer held rather than O(N) over the whole map. Bindings are also released when a heartbeat reports that the peer no longer holds an id, either via explicit Deleted{Volumes,EcShards} entries or by omitting it from a full snapshot. Without this, a planned rebalance that moved a vid or an ec shard from peer A to peer B would leave B's heartbeats permanently filtered out until A disconnected, breaking ec encode/decode flows that delete shards on the source as soon as the move completes. The (vid -> owners) binding still does not track which replica slot each peer occupies, so the first N claims under the copy count win; strict per-slot mapping is a follow-up.	2026-05-12 15:38:52 -07:00
Chris Lu	10cc06333b	cluster: restrict Ping RPC to known peers of the requested type (#9445 ) Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master.	2026-05-12 13:00:52 -07:00
Chris Lu	21054b6c18	volume: fail closed in admin gRPC gate when no whitelist is configured (#9440 ) Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use it to gate destructive volume admin RPCs. IsWhiteListed keeps its allow-all-when-empty semantics for HTTP compatibility. For TCP peers with an empty whitelist, off-host callers are rejected but loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly cohabits with the master/filer on a single host and in integration-test clusters; the loopback exception keeps cluster-internal admin traffic working without -whiteList while still locking out off-host attackers. Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check entirely. When `weed server` runs master+volume+filer in a single process the master dials the volume server in-process and the peer address surfaces as "@", which has no parseable IP. Such a caller shares our OS process and cannot be spoofed by a remote attacker, so we treat it as trusted by construction. The gate also tolerates a nil guard (developmental / embedded path) and only enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.	2026-05-12 12:35:27 -07:00

1 2 3 4 5 ...

1925 Commits