Distributed POSIX Locks
weed mount supports cross-mount POSIX advisory locks — flock(2) and
fcntl(F_SETLK/F_SETLKW) — so a lock taken on one mount is honored by every
other mount of the same cluster. The feature is opt-in: pass -dlm to weed mount. Without it, locks remain per-mount only (the historical behavior).
This builds directly on Filer Operation Serialization: the same route-by-key layer is reused, with a different authority (an in-memory POSIX lock table on the owner filer) and a different RPC.
Why route, not replicate
POSIX advisory locks are transient coordination, not durable data. Replicating them through the metadata log would add write churn to every advisory lock operation; failover would still race the application's expectations.
The chosen shape — owner filer per inode + in-memory authority + client-side polling for blocking acquires + session leases for dead-client cleanup — is the established pattern for shared-store advisory locking. SeaweedFS already has the routing layer (the lock ring built for the DLM and reused by ObjectTransaction), so adding POSIX semantics on top is a small addition.
At a glance
app on mount A app on mount B
│ │
│ flock(fd, LOCK_EX) │ flock(fd, LOCK_EX)
▼ ▼
weed mount A ─── PosixLock RPC ───► filer X (owner of this inode)
│
▼ posixlock.Manager
in-memory Set per inode
▲
weed mount B ─── PosixLock RPC ───► filer Y → forward (is_moved) → filer X
Every mount picks the filer it talks to (the -filer= argument). That filer
checks the lock ring, and if it is not the owner of this key, forwards the
RPC one hop to the owner. The owner's in-memory posixlock.Manager is the
single source of truth.
Blocking acquires (F_SETLKW) are client-side polling: the mount
re-sends the non-blocking try with bounded backoff until it succeeds or the
syscall is cancelled. There is no server-side wait queue.
The key
The mount converts a FUSE inode to a cluster-stable lock identity in
weed/mount/weedfs_posix_lock_routed.go: posixLockKeyForInode:
| Entry kind | Lock key |
|---|---|
| Regular file | "s3.fuse.lock:" + path |
| Hardlinked file | "s3.fuse.lock:hl:" + hex(HardLinkId) |
POSIX locks are inode-scoped, not name-scoped. Using the HardLinkId for
linked files makes every name for the same inode share one lock table, which
matches what local POSIX gives you and means rename does not move locks.
The FUSE NodeId is mount-local (AsInode = hash(path) + time), so it cannot
be used cross-mount.
Authority — posixlock.Manager
The owner filer keeps the lock state in
weed/filer/posixlock/manager.go:
Set— a per-inode collection of byte-rangeRanges withType(Read/Write),Sid(session id, unique per mount),Owner(the application's lock owner value), andIsFlock(flock and fcntl live in separate namespaces per POSIX).TryLock/Unlock— non-blocking acquire and release.GetLk— query without acquiring.ReleasePosixOwner/ReleaseFlockOwner— drop all of an owner's locks on a key (close-on-fd semantics).Reassert— rebuild the lock set from a list a holder sends after an ownership change or restart (used by KEEP_ALIVE).Renew/ReapExpired— session-lease bookkeeping.
There is exactly one Manager per filer. Lock state is never written to disk; it survives only as long as the owner filer keeps running, and the re-assertion mechanism described below rebuilds it when ownership changes.
The RPC
Defined in weed/pb/filer.proto
as PosixLockRequest / PosixLockResponse. The op enum:
| Op | What it does |
|---|---|
TRY_LOCK |
Non-blocking acquire of one range |
UNLOCK |
Release one range |
GET_LK |
Query — returns the first conflicting range, if any |
RELEASE_POSIX_OWNER |
Release all fcntl ranges owned by (Sid, Owner) on this key |
RELEASE_FLOCK_OWNER |
Release all flock ranges owned by (Sid, Owner) on this key |
KEEP_ALIVE |
Renew the session lease; if locks is non-empty, re-assert held locks |
A non-blocking grant returns granted = true. A conflict returns
has_conflict = true and the offending range as conflict. The handler
lives in
weed/server/filer_grpc_server_posix_lock.go.
Mount-side flow
Routed POSIX locking lives in
weed/mount/weedfs_posix_lock_routed.go.
The mount runs with -dlm, which sets wfs.lockClient; the FUSE handlers
in weed/mount/weedfs_file_lock.go
route through the new path when wfs.crossMountLocks() is true.
- Session id (
Sid) — random 64-bit value per mount, namespaces lock owners so the same FUSEOwnervalue on two mounts never aliases. SetLk(non-blocking) — oneTRY_LOCKRPC; mapEWOULDBLOCKto/fromgranted=false.SetLkw(blocking) —posixPollAcquireloopsTRY_LOCKwith exponential backoff bounded toposixLockMaxBackoff = 200ms; the syscall cancellation (FUSE INT) translates toEINTR.GetLk— singleGET_LKRPC.flush/release— POSIX requires that closing any fd to an inode drops the calling process's fcntl locks on that inode. The mount tracksposixLockHint(a per-inode set of owners we have taken locks for) so it can fireRELEASE_POSIX_OWNER/RELEASE_FLOCK_OWNERon close without an RPC on every close to a file we never locked.- Keepalive —
loopRenewPosixLeases(posixKeepaliveInterval = 5s) sends KEEP_ALIVE per held key. The payload carries the held locks, so a filer that just took over ownership (or just restarted) gets the holder's state pushed to it — see "Re-assertion" below.
Sessions, leases, reaping
Every mount has a 64-bit Sid. Every lock the mount takes carries that
Sid. The owner filer remembers, per Sid, the last time it saw a
KEEP_ALIVE.
startPosixLockSweeper (weed/server/filer_grpc_server_posix_lock.go)
runs on every filer:
posixLockSessionTTL = 15s— sessions silent longer than this are reaped.posixLockSweepInterval = 5s— how often each filer checks.
When a session is reaped, all of its locks across every key on this filer
are released. This is how a kill -9'd mount stops blocking other mounts —
nothing else does, because the kernel cannot tell the cluster that the FD
holding a flock just went away.
Sessions that never call KEEP_ALIVE are never tracked (no resource cost),
so the sweeper is inert on a cluster without -dlm mounts.
Ring changes — re-assertion + cooling + warm-up
The cooling-probe and warm-up machinery described in [[Filer Operation Serialization]] applies here directly. The POSIX lock layer adds one piece on top:
- Re-assertion via KEEP_ALIVE. When a mount's keepalive fires, it sends
every held lock on that key in the request payload, not just a bare
renew. The owner's
posixlock.Manager.Reassertrebuilds the lock set from that payload and reports any range it could not reassert (a real loss of lock to a different session). After at most one keepalive interval following a ring change, every new owner has been told about every lock its keys carry — and the warm-up window (10s) is sized to cover that re-assertion round trip even under load.
The combined picture across an ownership change:
- Master broadcasts the new filer set; every filer applies the new
snapshot to its
LockRing(with the old snapshot retained forLockRing.snapshotInterval). - New owner sees PosixLock RPCs for keys it now owns. Its in-memory
posixlock.Managerhas no entries for those keys yet. - For each request, the new owner asks the prior owner (via
PriorOwner(key)) whether it sees a conflict — bounded byposixCoolingProbeTimeout = 2s. The prior owner replies from its in-memory state withcooling_probe=true, so the answer is local and definitive. - Within
posixKeepaliveInterval = 5s, every mount holding a lock on one of the migrated keys re-asserts via KEEP_ALIVE. The new owner's state for that key is now correct. - The cooling-snapshot ages out (typically
snapshotIntervalafter the ring update). After that, the new owner trusts its local state unconditionally.
A filer that just started is in its own warm-up window
(posixLockWarmup = 10s); during that window it fail-closes on
acquire requests whose "no conflict" answer it cannot verify yet, so
restarts cannot create double grants either.
Platform notes
POSIX advisory locks are only forwarded to the FUSE server on Linux. The
macFUSE kernel module handles flock in the kernel, per mount, and does
not forward SETLK opcodes to the userspace filesystem at all — so two
weed mount instances on the same macOS machine cannot coordinate flocks
even with -dlm. This is a macFUSE behavior, not a SeaweedFS one. The
routed lock path itself works the same on macOS; there is just nothing for
it to handle.
The integration test
test/fuse_dlm/posix_lock_ring_test.go
skips on non-Linux for this reason.
Configuration
To enable cross-mount POSIX locks:
weed mount -dir=/mnt/sw -filer=filer1:8888 -dlm
-dlm enables the routed POSIX locks (and the whole-file write lock that
predates this work; see FUSE Mount). It is opt-in because:
flock/fcntlcalls now make an RPC instead of touching a process-local table — meaningful for applications that lock-on-every-write.- Cluster operators that do not need cross-mount advisory locks are not affected; the default keeps the per-mount table.
No filer-side flag is needed. Every filer registers itself in the lock
ring as part of joining the cluster, and the sweeper is inert until a
-dlm mount calls KEEP_ALIVE.
What is not done by this layer
- Mandatory locks (Linux
chmod g+s,o-xmandatory mode) — advisory only, as POSIX recommends. - Process-tree inheritance of fcntl owners — handled by the kernel and the application, not the filer.
- Replication of lock state — locks are kept in memory and rebuilt by
re-assertion. A filer crash with a
-dlmmount holding locks on its keys means those keys are unprotected for the cooling window plus the re-assertion round trip; the design accepts that as the cost of keeping the lock path off the metadata log.
See also
- Filer Operation Serialization — the routing, per-path lock, and ring-change machinery this page builds on.
- FUSE Mount
- POSIX Compliance
- S3 Object Lock and Retention — the unrelated S3 mechanism (object metadata, not advisory locks).
Introduction
- Quick Start with weed mini
- Simplest S3 Bucket and User Setup
- Components
- Getting Started
- Production Setup
- A typical step‐by‐step example
- Benchmarks
- FAQ
- Applications
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- EC Bitrot Detection
- Server Startup via Systemd
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Management
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Filer Operation Serialization
FUSE Mount
- FIO benchmark
- fstab and systemd mount
- POSIX Compliance
- Distributed POSIX Locks
- P2P reading in weed mount
WebDAV
SFTP Server
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- Supported APIs vs Minio
- S3 Lifecycle
- S3 Lifecycle vs Volume TTL
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
S3 Table Bucket
- S3 Table Bucket
- S3 Table Bucket Commands
- S3 Tables Security
- SeaweedFS Iceberg Catalog
- Iceberg Table Maintenance
Iceberg Integrations
- Spark Iceberg Integration
- Trino Iceberg Integration
- Dremio Iceberg Integration
- DuckDB Iceberg Integration
- Doris Iceberg Integration
- RisingWave Iceberg Integration
- Lakekeeper Iceberg Integration
S3 Authentication & IAM
- S3 Configuration - Start Here
- S3 Credentials (
-s3.config) - OIDC Integration (
-s3.iam.config) - Kubernetes ServiceAccount Authentication (IRSA-style)
- S3 Policy Variables
- S3 Policy Conditions
- S3 Bucket Policies
- Amazon IAM API
- AWS IAM CLI
- weed shell - Shell IAM Commands
Server-Side Encryption
S3 Client Tools
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Metadata Change Events
Messaging
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
Use Cases
Operations
- System Metrics
- weed shell
- Data Backup
- Deployment to Kubernetes and Minikube
- Deployment with seaweed-up
Rust Volume Server
Advanced
- Large File Handling
- Optimization
- Optimization for Many Small Buckets
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure
Security
- Security Overview
- Security Configuration
- Cryptography and FIPS Compliance
- Run Blob Storage on Public Internet