* fix(remote): reject short reads when caching remote objects
A short read from the remote (stale listing size, truncated or flaky
response) was silently zero-padded: the S3 and Azure clients pre-size
the buffer and discard the downloaded byte count, and the chunk is
recorded with the requested size. The cached file then matched the
expected size but its tail was NULL, and the entry was marked cached
so it never re-fetched.
Check the byte count against the requested size in both clients, and
add a backend-agnostic guard in FetchAndWriteNeedle. The cache now
fails loudly and the entry stays remote-only for a later retry.
* fix(remote): match S3 default modes when syncing remote metadata
Remote object listings carry no POSIX mode, so synced entries were
created with a hardcoded 0644. Against a SeaweedFS remote, whose S3
layer writes objects as 0660 and auto-creates directories as 0771
(0660|0111), the mounted copy ended up 0644/0755 and the permissions
visibly diverged from the source.
Default to the S3 modes instead (files 0660, directories 0771). The
filer derives parent-dir modes from the child as fileMode|0111, so
fixing the file default also brings the directories into line.
Directory mtimes still reflect sync time: S3 listings don't enumerate
directories, so the remote's directory timestamps aren't available.
applyRecomputeLatest wrote the .versions latest-version pointer and the
demoted prior version's stamp through UpdateEntry without a following
NotifyUpdateEvent, so neither change entered the metadata log. Across
filers the pointer then lived only on whichever filer ran the mutation,
and ListObjects served by any other filer dropped those objects from a
versioned bucket. Emit the events the way PATCH_EXTENDED already does,
keeping a pre-update image for the notification diff.
* master: grow rack-spanning volumes once per DC, capped at copy_N
The periodic rack-aware growth scan grew once per rack. For rack-spanning
replication (DiffRackCount > 0) a single logical volume already covers every
rack the placement needs, so a crowded volume made every rack report
should-grow and the scan created racks×step too many volumes: with "010"
across two racks that is 2 racks x step 2 = 4 logical (8 physical) volumes.
Plan one DC-wide grow for rack-spanning replication, and cap the per-event
step at master.volume_growth.copy_N so lowering it reduces periodic growth.
* master: distribute lastGrowCount evenly across uneven DCs
The non-rack-spanning grow divisor used the current DC's rack count, so DCs
with different rack counts each over-grew. Sum every rack up front and divide
lastGrowCount by that global count instead.
* security: reload JWT signing keys on SIGHUP
Signing keys were read once in the server constructors and never
refreshed. After a key rotation (Secret update, divergent reads) the
in-memory key stayed stale and every request kept failing "wrong jwt"
until the affected process was restarted.
Add Guard.UpdateSigningKeys and call it from the master, volume and
filer reload paths and the s3 reload hook, next to the existing
whitelist refresh. Make the global chunk-read JWT cache reloadable via
an atomic swap, and register the master's Reload with grace.OnReload --
it was never wired, so the master ignored SIGHUP entirely.
Mirror the same refresh in the Rust volume server's SIGHUP handler.
* security: swap signing keys behind an atomic pointer
Addresses review feedback on the in-place key swap: SigningKey is a
[]byte, so reassigning the Guard fields while a request handler reads
them is a data race that can tear the multi-word slice header and read
out of bounds.
Hold the four signing-key fields in an immutable signingConfig snapshot
behind atomic.Pointer; UpdateSigningKeys swaps the whole pointer, so a
reader sees either the old keys or the new ones. Reads go through new
SigningKey/ExpiresAfterSec/ReadSigningKey/ReadExpiresAfterSec accessors.
The Rust guard is already safe: every read and the SIGHUP write go
through the shared RwLock<Guard>.
* security: fold whitelist + auth state into the atomic snapshot
Review follow-up. UpdateSigningKeys still wrote isWriteActive while the
request path read it (and the whitelist maps) unsynchronized, so a SIGHUP
under load could expose an inconsistent mix of activation bits and
whitelist contents.
Move all hot-reloadable Guard state -- keys, expirations, whitelist, and
the activation flags -- into a single immutable guardState swapped behind
one atomic.Pointer. The Update* methods take a small mutex to serialize
the read-modify-write; readers stay lock-free. The concurrency test now
also rotates the whitelist and probes IsWhiteListed under -race.
Also read each signing key once per branch in the volume/filer JWT auth
checks, so a reload landing mid-check can't take the allow-fast-path
after auth was enabled or verify against a different key than the branch
saw.
* filer: bound TraverseBfsMetadata memory by queuing directory paths
The BFS enqueued every entry, so it held the whole subtree in memory
including each file's chunk list. A filer serving a peer's first-time
bootstrap traversal of a large tree could exhaust memory and get killed.
Stream each entry as it is visited and queue only directory paths to
descend into. Memory is now bounded by the number of directories rather
than the entire tree, and the streamed output order is unchanged.
* filer: match excluded prefixes on path-component boundaries
Only treat an excluded prefix as a match when it ends at a path
boundary, so excluding /a/b does not also drop a sibling like /a/bc.
Short-circuit the trie walk on the first real match.
* [CheckDisk][GRPC]: implement MVP for disk health detection, added timeout for new grpc connections
* fix(volume): build disk health check on every platform
setDiskStatus only existed behind the statfs build tag, so disk.go failed
to compile on windows, openbsd, solaris, netbsd and plan9. Move the timeout
wrapper and failure tracking into the shared disk.go and have each platform's
fillInDiskStatus return an error, so every platform gets the same protection
from a stuck filesystem.
Also restore the uint64(fs.Bavail) cast: Bavail is int64 on freebsd, so the
unguarded multiply broke the freebsd build.
* fix(volume): keep one outstanding statfs probe per disk
A stuck statfs used to leave isChecking cleared by the timeout path, so the
next check spawned another goroutine while the previous one was still blocked
in the syscall, leaking one goroutine per minute on a hung disk. Clear the
flag only when statfs returns and treat an overlapping check as a failure, so
a hung filesystem keeps a single outstanding probe and still gets reported.
* fix(volume): assume disk available until the first health check
isDiskAvailable defaulted to false, and CollectHeartbeat skips locations that
are not available. A freshly started volume server would therefore omit every
volume from its first heartbeats until the async CheckDiskSpace ran, so the
master could briefly treat all of them as missing.
* fix(volume): label the disk error metric by data directory
The new gauge tagged the series with IdxDirectory while every neighbouring
resource gauge uses Directory, so the error series would not line up with them
in dashboards. Also log the underlying error instead of a generic message.
* test(volume): cover disk health success and repeated-failure paths
* fix(volume): make a healthy disk the zero-value default
Track the disk as isDiskUnavailable instead of isDiskAvailable so the safe
state is the zero value, matching isDiskSpaceLow. CollectHeartbeat only skips a
location once a check has actively marked it unavailable, so any DiskLocation
built without running CheckDiskSpace (tests, future call sites) still reports
its volumes instead of silently dropping them.
* feat(disk): detect degraded disks using IO latency probes
* feat(stats): introduce configurable disk I/O health probe with EWMA-based latency detection
* feat(disk): replace EWMA with sliding window algorithm for disk health detection and added user-friendly options
* feat(disk): improve disk health probing and recovery
* feat(volume): configure disk health checks via volume.toml
* fix(volume): Remove disk IO probe CLI options
---------
Co-authored-by: ptukha <ptukha@tochka.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
A Canceled/DeadlineExceeded from the caller's per-request context was
treated like a dead channel: it closed the shared cached ClientConn and
cancelled every other in-flight RPC on it with "the client connection is
closing". Under a burst of concurrent chunk assigns (e.g. a large S3
multipart upload) one slow assign hitting its 10s attempt timeout could
poison the connection for all the rest, cascading into a flood of 500s.
Thread the caller's context into shouldInvalidateConnection and only
invalidate on Canceled/DeadlineExceeded while that context is still live,
which isolates the genuine stale-channel signal (a peer restart behind a
k8s Service VIP). To carry the context, add a ctx parameter to the
existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the
master assign and volume-lookup paths pass their per-attempt context and
every other caller passes context.Background().
* filer: name the read-only path in the write rejection
The write path rejected creates under a read-only rule with a bare
"read only", giving no hint which path was locked or why. Wrap the
error with the matched location prefix and a quota hint so a FUSE
mkdir or S3 put points straight at the offending bucket.
* return the read-only reason over HTTP and drop any query string from the fallback prefix
* fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest
A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a
huge power-of-two block_size that passed validation, then force a multi-GiB
scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize
(64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and
derive the volume flag cap from the same constant so they cannot drift.
* fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy
writeToFile opened the destination with O_TRUNC before knowing whether the
source had the file, so an optional copy (ignoreSourceFileNotFound) from a source
that lacks the .ecsum truncated and then removed a valid pre-existing destination
sidecar. Stage the optional copy into a temp sibling and commit it with an atomic
rename only when the source actually delivered the file; a missing source is now
a no-op. Mandatory copies keep their in-place behavior.
* ec: add EC bitrot checksum protobuf
EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages,
copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode.
* ec: bitrot checksum sidecar format, validation, and per-volume load
Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a
self-integrity header; validation, rolling builder, backfill primitive, and
EcVolume load on mount + removal on destroy.
* ec: capture per-shard checksums at encode; verify-and-exclude on rebuild
WriteEcFilesWithContext returns the protection computed inline during encoding.
generateMissingEcFiles verifies present inputs against the sidecar, excludes
corrupt ones, regenerates in place, and re-verifies; fail-closed unless
unsafe_ignore_sidecar, removing all generated outputs on failure.
* ec: read-only checksum scrub with Reed-Solomon arbiter
ChecksumScrub verifies each local shard against the sidecar and reconstructs
flagged shards from the clean shards so stale-sidecar false positives are not
reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum.
* ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill
Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant);
remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and
opportunistically backfills a sidecar when all shards are reachable.
* ec: volume server bitrot config flags
-ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16).
* fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply
Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge
flag value cannot overflow int64 and slip past the power-of-two check, and a
block size cannot collapse a sidecar to a few oversized blocks.
* fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path
The worker EC encode wrote the generation-0 sidecar locally but never added it
to shardFiles, so DistributeEcShards never shipped it and the distributed
holders came up unprotected. Append it to shardFiles and map the ecsum shard
type to its extension in the sender so it travels with the shards.
* fix(ec_bitrot): remove orphaned sidecars when the generation is gone
Gate sidecar removal on existingShardCount==0 alone rather than also requiring a
stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be
removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays
gated on hasEcxFile as before.
* fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles
ChecksumScrub's first return is blocks scanned, not files. Discard it so the
scrub response's TotalFiles (a needle/file count) is not inflated by the block
count for CHECKSUM mode.
* test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles
* fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast
The header stores payload_len as a uint32; bound the payload before the
conversion so a pathological manifest cannot truncate the length field and
corrupt the sidecar. A real manifest is a few KB, so this never trips.
* fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB
The block size becomes the per-shard scratch buffer the scrub/backfill path
allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent
scrub worker. Lower the upper bound from 1024 to 64 MiB.
* fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx
The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext,
which fails closed on a malformed/stale .ecsum. Without an override flag an
operator could not complete the rebuild without manually deleting the sidecar.
Expose -ecUnsafeIgnoreSidecar (default false) and thread it through.
* fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull
Guard len(payload) against a plain int constant (1 GiB) before the allocation
instead of a uint64 MaxUint32 compare, so the allocation-size value is provably
bounded (clears the CodeQL overflow alert) and the math import is no longer
needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and
remove the now-redundant readFull helper (os.File.ReadAt fills the slice or
errors).
* test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32
* refactor(ec): fold the EcFiles WithContext variants into the base functions
RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as
before) and WriteEcFiles takes it too (nil => default), removing the parallel
RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an
explicit context drop the WithContext suffix; the default-context callers pass
nil. No behavior change.
* refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles
Add a non-nil BackgroundECContext placeholder (analogous to context.Background())
and have callers with no specific layout pass it instead of a nil *ECContext.
WriteEcFiles resolves a zero/background context to the default ratio and
RebuildEcFiles resolves it from the .vif, so behavior is unchanged.
* fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif
- BackgroundECContext is now a function returning a fresh *ECContext, so callers
cannot mutate a shared singleton or race on it (and it mirrors context.Background,
which is also a function).
- RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but-
unreadable .vif fails closed instead of silently rebuilding with the default
ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
* s3: commit a versioned PutObject and its latest pointer in one transaction
A versioned PutObject wrote the version file and flipped the .versions
latest pointer in two separate routed transactions. Fold the
RECOMPUTE_LATEST into the version file's PUT so both commit atomically
under the object's per-path lock: the recompute, applied after the PUT in
the same transaction, scans the directory and sees the new version. A
crash can no longer leave the version present with a stale pointer.
putToFiler now takes a putFinalize describing the finalize step — routed
mutations folded into the PUT, or an afterCreate run under the object
write lock off the ring. Suspended-versioning keeps its afterCreate-only
form; multipart, copy, and delete-marker finalizes are unchanged.
* s3: trim verbose finalize comments
Adds standard Kubernetes liveness/readiness endpoints to all HTTP
servers that were missing them:
- S3: adds /readyz (already had /healthz)
- IAM: adds /healthz and /readyz (had none)
- Volume: adds /readyz (already had /healthz)
- Filer: adds /readyz on default and readonly mux
- Master: adds /healthz and /readyz at root level
(preserves existing /cluster/healthz)
All endpoints reuse existing health handlers or return 200 OK as a
minimal foundation. Future PRs can enhance /readyz with dependency
checks without breaking the contract.
Closes#9736
Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>
On a read-only watched path the idle heartbeat keeps sync_offset fresh,
but a busy source filer still emits a MaxUnsyncedEvents marker after many
filtered events. The marker has a non-nil but empty EventNotification, so
the client routed it to the event path, where it advanced no real
watermark yet drove offsetFunc to republish the stale processed
watermark — regressing the gauge between heartbeats and spiking the
derived lag every time a filtered-event burst landed.
Route the empty marker through OnIdleHeartbeat like the idle heartbeat so
its fresh timestamp keeps the gauge current; it still advances the
in-stream resume cursor.
* fix(filer): derive inodes by hash instead of a snowflake sequencer
Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.
* chore: remove the experimental NFS gateway
The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.
* refactor(filer): drop the inode->path index
With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
* master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup
The volume-grow goroutine clears the layout's growRequest flag only after
ms.DoAutomaticVolumeGrow returns, and AllocateVolume / DeleteVolume were
calling the volume-server RPC with context.Background(). A volume server
that hung mid-call (heavy I/O, stuck lock, dead peer behind a stable VIP)
would park the goroutine forever, leaving growRequest=true and silently
blocking every subsequent automatic grow for that layout — Assign retries
then drained their 30s budget with "context deadline exceeded" until the
operator restarted the master.
Bound both RPCs with a 5-minute deadline (creating/removing a volume is
sub-second normally, generous for contended disks) and move the flag
clear + filter delete into defers so a panic in DoAutomaticVolumeGrow
doesn't strand the layout either.
* allocate_volume: shorten timeout to 1m for faster recovery
Volume create/delete is sub-second under normal conditions; 1 minute is
generous even on a contended disk and clears the growRequest flag well
before too many client Assigns drain their own retry budget.
* trim comments
* writeJson: drop unused JSONP branch
No in-tree caller uses ?callback=. Always serve application/json
with X-Content-Type-Options: nosniff.
* seaweed-volume: drop unused JSONP branch
Mirror Go: always serve application/json with
X-Content-Type-Options: nosniff.
* writeJson: drop unreachable StatusNotModified check
bodyAllowedForStatus already returns early for 304.
* test/volume_server: rename and rewrite JSONP test to assert callback is ignored
CI: /status?callback=myFunc now returns plain application/json
with X-Content-Type-Options: nosniff.
After a (re)start the owner defers would-be grants for posixLockWarmup
while mounts re-assert, trusting only locally-visible conflicts, so it
does not double-grant from empty state; a deferred grant is a retry for
SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling
now fail-closes: if the previous owner is unreachable during a ring
change, defer rather than risk a double-grant. readyAt is atomic so the
handler reads it without locking.
While the ring changed within the last snapshot interval, a fresh owner
asks the key's previous owner (LockRing.PriorOwner) whether it still
holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so
it does not double-grant before re-assertion rebuilds its local state.
The probe is marked cooling_probe so the previous owner answers from
local state without recursing. PriorOwner uses the snapshot's prebuilt
ring rather than rebuilding a hash ring per call.
* mount: renew POSIX lock leases via keepalive
The mount tracks the inode keys it holds locks on and a background loop
renews its session lease (KEEP_ALIVE) with each key's owner filer every
5s, within the filer's 15s TTL. A live mount is never reaped; a dead one
stops renewing and owners reclaim its locks. Tracking is a superset:
holds are added on grant and dropped only on owner release, so a still
held lock is never under-renewed.
* mount,filer: re-assert held POSIX locks via keepalive
The owner filer holds POSIX advisory locks as in-memory soft state, so a key's
owner change (ring rebalance) or an owner restart lost or stranded them: the new
or restarted owner was blind to existing holders and would double-grant.
Make the keepalive carry the mount's held lock ranges per key. The mount mirrors
its own granted locks (posixOwn), and each tick re-asserts them to the key's
current owner, which rebuilds that session's locks from the assertion — self
-healing after a takeover or restart. The owner arbitrates re-asserted locks
against other sessions so it never double-grants; a lock that lost a migration
race is reported, not forced. A bare keepalive (no ranges) still just renews.
* filer: session lease + reaping for POSIX locks
A mount renews its session lease by keepalive (new KEEP_ALIVE op); the
owner filer records last-seen per session and a background sweeper reaps
the locks of leased sessions that stop renewing — a dead or partitioned
mount. Only sessions that have renewed are leased, so this is inert until
mounts run with -posixLock.
* mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665)
mount: route POSIX advisory locks to the owner filer under -dlm
With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. The mount calls its filer
and relies on filer-side forwarding to reach the owner. Keys are the inode
identity (HardLinkId else path); SetLkw is client-side polling with the
FUSE cancel channel (no server wait queue); a per-mount session id
namespaces owners; a local hint avoids a release RPC on every close.
* mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown
The unlock/release RPCs run off the syscall path (close/flush) and used
context.Background() with no deadline, so a slow or unreachable filer could
hang close() indefinitely; bound them to 5s (they still aren't cancelled by
an interrupt). The lease-reaping sweeper now selects on a stop channel that
FilerServer.Shutdown closes, instead of looping for the process lifetime.
* filer: in-memory POSIX lock authority (Manager)
Concurrent multi-inode authority over the per-inode Set: a Set per opaque
inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead
mount's locks reap in O(locks held). Lock state stays in memory like the
distributed lock manager's, off the replicated meta-log. TryLock/Unlock/
GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and
stale index entries are pruned on release.
* filer: routed PosixLock RPC over the in-memory authority
Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner
drops) that the owner filer answers from its in-memory Manager. The
request key is the inode identity ring key; a non-owner filer forwards
one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's
table stays the single authority under a stale ring view. Strictly
non-blocking; SetLkw polling lives in the mount.
A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.
A non-versioned metadata-only self-copy (CopyObject with source == destination
and the REPLACE directive) is a read-modify-write of one entry, which is why it
held the distributed lock. It now routes to the owner as a serialized
PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements,
delete the dropped keys) onto a fresh read of the entry under its per-path lock,
so a concurrent change to non-managed keys (legal hold, retention, version id) is
preserved instead of clobbered, and bumps mtime.
PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended
self-copies create a new version (already routed via the copy finalize) and the
no-owner bootstrap keep the lock.
A version-specific DELETE (real version or the null version, including
object-lock WORM-checked ones and governance-bypass) now runs as one routed
transaction on the object's owner instead of holding the distributed lock.
For a real version: recompute the .versions pointer excluding the version
(repoint-before-delete, so a crash leaves a recoverable orphan rather than a
dangling pointer), then delete the version file, under the object's per-path lock.
The null version is the regular object entry, deleted directly (no pointer).
Object-lock buckets gate the delete on the version's WORM guards evaluated on the
owner: legal hold (always) + retention (while not elapsed). Governance bypass
scopes the retention guard to COMPLIANCE mode, so the filer allows a
governance-mode delete while still denying compliance and legal hold — the
gateway never reads the version.
Three primitives make this expressible:
- ObjectTransaction.condition_key: evaluate the condition against a named entry
(the version) while the lock stays on lock_key (the object).
- Recompute.exclude_name: omit a child from the scan, to repoint before delete.
- WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a
mode, expressing governance bypass without a gateway-side read.
s3: route versioned PutObject finalize off the distributed lock
A versioned write's finalize (flip the .versions pointer to the newest version,
demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction
on the object's owner filer, under its per-path lock, instead of the unserialized
updateLatestVersionInDirectory. The version file is written first; the owner
re-derives the pointer by scanning the directory.
RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's
size and mtime on the pointer, and demote_key / demote_value to stamp the
displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves.
Falls back to updateLatestVersionInDirectory when no owner is known yet.
* s3: dial the object lock's primary filer directly
The S3 object write lock builds a fresh short-lived lock per write, each
starting at the seed filer. When the seed isn't the key's hash-ring primary
the filer forwards the request to the primary, and in multi-cluster setups
that forward crosses clusters on every write.
Give the lock client a view of the filer lock ring, fed by the master's
LockRingUpdate broadcasts the gateway already receives, so it dials the
primary directly. The view tracks filer membership by version; a stale view
stays correct because the filer still forwards as a fallback.
Also send the initial ring snapshot to S3 clients, not just filers.
* s3: subscribe to lock-ring updates before starting the master loop
The master delivers the initial LockRingUpdate once, on connect. Registering the
callback after KeepConnectedToMaster started left a window where that first
update could arrive before the handler was set and be dropped, delaying the ring
view until the next membership change. Build the lock client and register the
callback in the masters block before launching the loop; the filers block reuses
that client (or creates a plain one when no masters are configured).
* lock_manager: build the hash ring in a deterministic server order
rebuildRing ranged over the server set (a map), whose iteration order is
randomized per process. On a vnode hash collision the last writer into
vnodeToServer wins, so two nodes holding the same server set could resolve the
collision to different servers and disagree on the primary for keys near that
slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement
would route the same key to different filers and defeat per-path serialization.
Iterate the servers in sorted order so the ring is identical on every node with
the same set, regardless of discovery order.
* lock_manager: skip redundant ring rebuilds, trim comments
SetRing now ignores a non-zero version at or below the current one once a ring
exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the
ring.
* s3: hold the lock-ring client on the server for route-by-key
Store the object-write lock client on S3ApiServer so handlers can resolve a
key's owner filer via PrimaryForKey.
* filer: let PATCH_EXTENDED replace Entry.content
PATCH_EXTENDED merges extended attributes under the per-path lock, reading the
entry fresh, so concurrent patches to different keys don't clobber each other.
Some single-key state lives in Entry.content rather than an extended attribute
(e.g. the S3 bucket metadata blob). Add set_content/content to the mutation so a
patch can replace content the same way -- read fresh, set content, preserve the
rest -- letting a content write and an extended-attribute write on the same
entry serialize on the lock instead of racing whole-entry rewrites.
* Update weed/server/filer_grpc_server.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* filer: test set_content FileSize sync; note chosen content-patch approach
Cover the FileSize behavior of a set_content patch: a file's size follows the
new content length (including when it shrinks), a directory's stays zero. Also
document, in the bucket-config design, that extending PATCH_EXTENDED with
set_content is the implemented path for content-backed config.
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
CreateEntry starts with a FindEntry to load the current entry. A conditional
CreateEntry already fetched that entry to evaluate the precondition under the
per-path lock, so the create repeated the lookup.
Add an existing *Entry parameter: when non-nil it is used as the current entry
and the internal lookup is skipped; nil keeps the lookup. The gRPC CreateEntry
handler passes the entry it fetched for the precondition, removing the redundant
read while the lock is held. All other callers pass nil.
A multi-object delete spans many keys that route to different owner filers. The
gateway groups keys by owner and sends one batch per owner; the filer applies
each transaction under its own per-path lock, independent of the others.
A failed transaction (precondition or mutation error) is reported in its own
response without aborting the rest, matching S3 multi-object semantics where
each key succeeds or fails on its own. There is no cross-key atomicity, which S3
batch delete does not require.
Routing object-lock buckets off the distributed lock needs the retention and
legal-hold check to run atomically with the write, under the per-path lock. Move
just the comparison into the filer, not the S3 semantics: two generic clause
kinds on an extended attribute.
IF_EXTENDED_NOT_EQUAL blocks while extended[ext_key] equals ext_value (a legal
hold). IF_EXTENDED_TIME_ELAPSED blocks while extended[ext_key], read as a unix-
second deadline, is in the future against the filer's clock (retention); a
malformed deadline fails safe. The caller composes these from the object-lock
state and, for a governance bypass, simply omits the retention clause once the
bypass is authorized -- the filer makes no authorization decision and keeps no
S3 knowledge.
Deleting a specific version that happens to be the latest needs the new latest
re-derived from the remaining versions, and that scan must run under the same
lock as the delete. The gateway can't do it atomically across RPCs.
Add a RECOMPUTE_LATEST mutation: it scans a directory under the transaction
lock, picks the child that sorts last (descending) or first by name, copies the
mapped extended keys from it into a pointer entry, and stores its name under
name_to_key. An empty directory clears the pointer keys. The filer stays
mechanical and S3-agnostic: the caller, which knows the versioning scheme,
supplies the sort direction and the key mappings. A missing pointer entry is a
no-op, so a replayed transaction is idempotent.
A versioned object write touches several entries that must change together: the
main object, a delete marker or version file, and the latest pointer on the
.versions directory. Holding a distributed lock across separate RPCs to do this
is what the per-path lock was meant to replace, but a single CreateEntry only
covers one entry.
Add ObjectTransaction: a request carries a lock_key (the object path), an
optional WriteCondition, and an ordered list of mutations (PUT / DELETE /
PATCH_EXTENDED). The filer holds the per-path lock on lock_key for the whole
call, checks the condition against the entry at lock_key, then applies the
mutations in order. Callers route the object's writes to its owner filer so the
lock is authoritative across all of the object's entries.
DELETE and PATCH of an absent entry are no-ops, so a replayed transaction is
idempotent. PUT entries are metadata-scoped; data-bearing writes (chunks) are
written before the transaction, as today.
Add an optional WriteCondition to CreateEntryRequest. When set, the filer
evaluates it against the current entry while holding the per-path lock, so the
check and the write are atomic on this filer, and returns PRECONDITION_FAILED
when it does not hold. The caller must route the key's writes to the owner filer
for the check to be authoritative.
A condition is a list of clauses that all must hold (logical AND). One clause is
the common case; several express what a single comparison cannot: an ETag set
(If-Match / If-None-Match with multiple values), weak-ETag comparison, and
compound conditions. ETag comparison mirrors the S3 gateway's precedence (stored
Seaweed ETag attribute, then the Md5/chunk fallback) and follows RFC 7232
strong/weak rules, so results match without coupling the filer to S3 handling.
Condition parsing and evaluation live in filer_grpc_server_condition.go.
CreateEntry is a FindEntry-then-write with no lock, so concurrent creates to the
same path race: OExcl can admit two creators, and a conditional check-then-act
has no atomicity. Add a per-path exclusive lock (util.LockTable, which evicts
idle keys so it stays bounded) on the FilerServer and take it in CreateEntry, so
the existence check and the write are atomic on this filer.
This is the local serialization point that lets callers route a key's writes to
its owner filer and drop the distributed lock for that key. AppendToEntry keeps
its distributed lock for now; it can move to the per-path lock once its callers
route to the owner.
cluster.check asks every master to ping every volume server, but the
Ping gate validated volume-server targets only against the local
topology. Only the leader receives volume-server heartbeats, so a
follower's topology is empty and every probe through it failed with
"unknown ping target ... of type volumeServer".
Fall back to the volume-server set the master learns over its own
MasterClient subscription to the leader, the same source the filer gate
already trusts. The anti-SSRF intent is preserved: Ping still only dials
recognized cluster members.
* fix(filer.sync): keep sync_offset fresh while the source is read-only
sync_offset holds the timestamp of the last replicated source event, so
monitoring derives lag from now-sync_offset. A read-only source emits no
metadata events, so the gauge froze at the last write and the derived lag
grew without bound, making thresholds unusable.
The source filer now sends an idle heartbeat carrying its current time
while a subscriber is caught up to the buffer head. filer.sync uses it to
advance the gauge, so now-sync_offset reflects real lag. Heartbeats are
opt-in (client_supports_idle_heartbeat), are never written to the metadata
log, and do not move the resume checkpoint, so a restart still resumes
from the last real event.
* fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs
In metadata-chunks mode persisted entries replay as log file refs and
never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up
subscriber with an old SinceNs would never get a heartbeat. Use the
read cursor (lastReadTime), which advances in that mode too, max'd with
lastSeenTsNs so the in-memory backlog-then-idle case still works while
the cursor returned to the caller has not yet updated.
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers
When a volume server holds EC shards for the same vid across more than
one disk, each DiskLocation registers its own EcVolume entry and
Store.FindEcVolume returns whichever one it hits first. The shard-info
RPC iterated only that single EcVolume's Shards, so the response missed
every shard mounted on a sibling disk.
The worker's verifyEcShardsBeforeDelete sums the per-server responses
into a union bitmap and refuses to delete the source volume when the
union falls short of dataShards+parityShards. On multi-disk
destinations, the union was systematically under-counted and source
deletion got blocked even though all shards were physically present and
mounted.
Walk every DiskLocation in the handler and emit the deduplicated union
of all shards. The .ecx-backed fields (file counts, volume size) still
come from a single EcVolume since every disk's entry opens the same
.ecx via NewEcVolume's cross-disk fallback.
Tests:
- TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in
weed/server/.
- test/volume_server/grpc/ec_verify_multi_disk_test.go integration test
drives the full generate -> mount -> redistribute -> restart ->
reconcile path and asserts both VolumeEcShardsInfo and
VerifyShardsAcrossServers + RequireFullShardSet (the production
source-deletion gate) report all 14 shards.
- ec_multi_disk_lifecycle_test.go tightened: replaces the
"VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a
full-shard-set assertion.
* review: use ShardBits bitmask + cap-pre-allocation for shard dedup
* fix(ec_distribute): remove partial files on copy stream error
writeToFile opens the destination with O_TRUNC and streams into it. On
a mid-stream receive / write / cancellation error it returned the
failure but left the destination behind in whatever state had been
written so far — typically 0 bytes when the source errored before
sending any FileContent. VolumeEcShardsCopy distributes .ecx by
calling doCopyFile, so this same stub-leaving behaviour produced the
0-byte .ecx files seen on EC encoding failures: the source claims a
non-zero ModifiedTsNs (so the existing "source not found" cleanup
doesn't fire), the stream then errors immediately, and the receiver
ends up with a 0-byte .ecx that downstream code mistook for a valid
empty index.
Clean up the partial file on every error path that returns from the
streaming loop (receive, write, and cancellation). Skip cleanup when
isAppend=true so resumable appends keep their existing content. As
defense in depth, VolumeEcShardsCopy also stats the .ecx after copy
and removes / errors on a 0-byte result so the orchestrator can pick
a different source.
The Rust volume server has only the source side of CopyFile (no
client-side stream-to-disk consumer) and no .ecx subsystem yet, so
this fix has no Rust mirror.
* fix(ec_distribute): close file before remove, fail fast on stat error
Address review feedback:
- writeToFile's mid-stream removeIncomplete called os.Remove while the
destination file handle was still open. On Windows os.Remove fails
while a handle is open, so the cleanup wouldn't run there. Wrap the
handle close in a once-only helper, call it from removeIncomplete
and from the existing "source not found" cleanup, and keep a deferred
close as the safety net for the normal-return path.
- VolumeEcShardsCopy's post-copy .ecx check silently passed when
os.Stat returned an error: doCopyFile had reported success but if
the file was already gone, unreadable, or somehow a directory, the
orchestrator only learned at mount time with no useful context.
Treat any non-nil stat error and any directory result as a copy
failure here and surface it immediately.
* mini: quieter startup with a docker-compose-style progress board
Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.
Also folds in a few cleanups uncovered while making this readable:
- route the admin.go startup prints through glog so quietMiniLogs()
filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
(new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
s3 / iceberg / admin
* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors
loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.
startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().
* ci(s3-proxy-signature): match weed mini's new progress-board ready line
The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits " S3 ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.
Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
* fix(ec): blanket-clean every destination over the full shard range
The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.
Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.
Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.
* log(ec): include disk_id in EC unmount/delete/refusal log lines
The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:
UnmountEcShards 4121.1 -> add disk_id
ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id
volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids
ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.
Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).
* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server
The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
PR #9442 made the filer refuse to register the IAM gRPC service unless
jwt.filer_signing.key was set in security.toml, which broke the admin
UI Users/Groups/Policies pages for every deployment that ships without
a security.toml — weed mini, plain Helm, vanilla weed filer. The Users
tab returns Unimplemented and the page is unusable. Issues #9504,
#9505 and #9509 all trace to this gap.
The rest of the filer's gRPC surface is unauthenticated by default;
treat IAM the same way. The service now always registers, and the
auth gate is a no-op when no signing key is configured. When the key
is set, every RPC still requires an admin-signed Bearer token, matching
the post-#9442 behaviour. Operators who expose the filer gRPC port
beyond a trusted network should set the key on both filer and admin.
The admin client (IamGrpcStore.withIamClient) already skips attaching
the authorization metadata when its key is empty, so no changes there.
fix(tests): make 32-bit GOARCH tests build and run (#9503)
verifyTestFilerClient had bare int64 atomic counters after a map header,
so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on
linux/386. Switch to atomic.Int64, which the stdlib guarantees is
8-byte aligned on all platforms.
rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to
t.Errorf, where it default-promoted to int and overflowed 32-bit int.
Bind it to a typed uint32 const used in both the comparison and the
error message.
* fix(ec): verify full shard set before deleting source volume (#9490)
Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.
- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
union the bitmaps, and refuse to call deleteOriginalVolume unless all
TotalShardsCount distinct shard ids are observed. A failed gate leaves
the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
source destruction is traceable in default-verbosity production logs.
The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.
* fix(ec): trim doc comments on the new shard-verification path
Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).
* fix(ec): drop issue-number anchors from new comments
Issue references age poorly — the why behind each comment already
stands on its own.
* fix(ec): parametrize RequireFullShardSet on totalShards
Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.
* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo
The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.
* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files
Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.
* fix(ec/decode): verify regenerated .dat before deleting EC shards
After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.
Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.
* fix(ec): address PR review nits in verification + fake server
- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
bound is explicit (Set already no-ops on overflow, this is for
clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
doesn't panic the test process.
* fix(master): include GrpcPort in LookupEcVolume response
LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume
builds Location with only Url / PublicUrl / DataCenter, so callers fall back to
ServerToGrpcAddress (httpPort + 10000). On any deployment where that
convention does not hold — multi-disk integration tests, custom port layouts
— EC reads dial the wrong port and quietly degrade to parity recovery.
* fix(volume/ec): probe every DiskLocation when serving local shard reads
reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the
DiskLocation that physically owns it, so a multi-disk volume server can hold
shards for the same vid in two separate ecVolumes — one per disk — with .ecx
on whichever disk owned the original .dat. The read path only consulted the
single EcVolume FindEcVolume picked, so requests for shards on the sibling
disk fell through to errShardNotLocal and then to remote/loopback recovery.
Walk all DiskLocations after the first probe in both readLocalEcShardInterval
and the VolumeEcShardRead gRPC handler; the latter also covers the loopback
that recoverOneRemoteEcShardInterval falls back to when a peer dial fails.
* test(volume/ec): cover the multi-disk EC lifecycle end-to-end
Two integration tests against a real volume server with two data dirs:
TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read ->
drop .dat -> stop -> redistribute shards across disks -> restart -> verify
reconcileEcShardsAcrossDisks attached the orphan shards and reads still
work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild
pulls input from both disks -> reads still work.
TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478
reproducer at the cluster level: seed a healthy .dat on disk 0, plant the
on-disk footprint of an interrupted EC encode on disk 1, restart, and assert
pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0.
Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous
run's volume.log is rotated to volume.log.previous so a startup regression on
the second run does not lose the first run's diagnostics.
* review: trim verbose comments
* review: drop racy fast-path, use locked findEcShard directly
gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval
and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads
the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent
mount / unmount could race with it. findEcShard already walks every
DiskLocation under the right lock, so the fast-path adds nothing but the
race. Collapse both call sites to a single locked call.
Also note in RestartVolumeServer why the log-rotation error is swallowed:
absence on first call is benign; anything else surfaces in the next
os.Create in startVolume.
Revert "master: bind heartbeat claims to the connecting peer (#9443)"
This reverts commit f28c7ce6df.
The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.
After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .
Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)"
This reverts commit 21054b6c18.
The fail-closed gate broke any multi-host cluster: in compose / k8s /
remote-host deployments the master's IP isn't loopback, so every
master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute,
vacuum, scrub, ...) is rejected with PermissionDenied unless the
operator manually configures -whiteList. The e2e workflow has been
failing since 10cc06333 with `not authorized: 172.18.0.2` on
AllocateVolume; downstream symptom is fio fsync EIO because zero
volumes can be grown.
The gate's intent was to lock down destructive admin tooling, but the
same RPCs are the master's normal mechanism for growing and managing
volumes. Reverting to restore cluster-internal operation; a narrower
re-do should distinguish operator/admin callers from the master peer
(e.g. trust IPs resolved from -master) before going back in.
* security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic
The revert in the previous commit also rolled back an unrelated bug fix
that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error
but did not continue, so the nil *net.IPNet was stored in whiteListCIDR
and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on
the next gRPC admin check.
Restore the continue. Orthogonal to the fail-closed semantics this PR
is reverting.
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:
- Reject heartbeats whose Ip does not match the gRPC peer's source
address. Loopback peers are still trusted; operators behind a proxy
can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
and drop foreign re-claims. Non-EC volume claims are bounded by the
replica copy count so legitimate replicas still register. EC
ownership is keyed by (vid, shard_id) so the same vid can legitimately
be split across many peers as long as their EcIndexBits are disjoint;
rejected bits are cleared from the bitmap and the parallel ShardSizes
array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
so disconnect cleanup is O(M) in what that peer held rather than O(N)
over the whole map.
Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.
The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.
Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
the dc/rack/node tree, kept in sync from doLinkChildNode and
UnlinkChildNode plus the ip/port-rewrite branch in
GetOrCreateDataNode. GetTopology now returns nil on a detached
subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
hasVolumeServer answers without scanning every vid location. The
add path skips empty-address entries the same way the delete path
already does, so a zero-value Location cannot leak a permanent
serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
protect currentMaster reads/writes with an RWMutex, fixing the
data race with the heartbeat goroutine. The seed slice is copied
on construction so external mutation cannot desync it from the
frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
servers no longer carry a peer-volume list, and the note next to
the dropped probe is reworded to make clear that direct
volume-to-volume reachability is intentionally not validated by
this command.
Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.
Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use
it to gate destructive volume admin RPCs. IsWhiteListed keeps its
allow-all-when-empty semantics for HTTP compatibility.
For TCP peers with an empty whitelist, off-host callers are rejected but
loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly
cohabits with the master/filer on a single host and in integration-test
clusters; the loopback exception keeps cluster-internal admin traffic
working without -whiteList while still locking out off-host attackers.
Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check
entirely. When `weed server` runs master+volume+filer in a single process
the master dials the volume server in-process and the peer address surfaces
as "@", which has no parseable IP. Such a caller shares our OS process and
cannot be spoofed by a remote attacker, so we treat it as trusted by
construction.
The gate also tolerates a nil guard (developmental / embedded path) and only
enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR
fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.