Table of Contents
- P2P reading in weed mount
P2P reading in weed mount
When a fleet of GPU hosts loads the same model file through weed mount, every client pulls bytes from the volume tier. Even with fs.distributeChunks spreading the chunks evenly across every volume server, the volume tier's total NIC bandwidth caps how fast the read burst can complete. Peer chunk sharing lets mounts fetch chunks from each other instead, so after the first wave seeds a handful of mounts, subsequent reads fan out across the whole fleet.
This feature is opt-in via -peer.enable=true on weed mount. The filer side is always on — the registry's idle cost is negligible, so there is no flag to toggle it. When the mount flag is off — the default for weed mount — reads behave exactly as they do today.
The design is documented in design-weed-mount-peer-chunk-sharing.md. This page is the operator-facing summary.
How it works at a glance
┌──────────────────┐ ┌──────────────────┐
│ Filer │ │ Volume servers │
│ mount registry │ │ │
└────────▲─────────┘ └────────▲─────────┘
│ register / list │ read chunks
│ │
┌────────┴─────────┐ read from peer ┌────────┴─────────┐
│ weed mount │ ◄────────────► │ weed mount │
└──────────────────┘ └──────────────────┘
Chunk read sequence
sequenceDiagram
participant K as Kernel (FUSE read)
participant R as Mount (reader)
participant O as Mount (HRW owner)
participant H as Mount (holder)
participant V as Volume server
K->>R: read(chunk fid)
R->>O: ChunkLookup(fid)
alt holders known
O-->>R: [holder, ...]
R->>H: FetchChunk(fid)
H-->>R: chunk bytes
R->>R: verify MD5 vs filer ETag
R-->>K: bytes
else no holders / any failure
R->>V: HTTP chunk read
V-->>R: chunk bytes
R-->>K: bytes
R->>O: ChunkAnnounce(fid, self)
end
- Tier 1 — filer registry. Each filer holds a tiny in-memory map of which mounts are alive. Mounts broadcast
MountRegisterto every configured filer, and merge every filer'sMountListresponse bypeer_addr(newestlast_seen_nswins). That way two mounts pointing at different filers still see each other even though no filer-to-filer sync exists. - Tier 2 — mount-hosted chunk directory. The mount fleet itself shards a
fid → holdersdirectory via rendezvous hashing (HRW) on the registered mount list. Each owner mount only holds directory entries for fids it's HRW-assigned. - One gRPC port.
ChunkAnnounce,ChunkLookup, and the chunk-byte stream (FetchChunk) all go through a single mount-to-mount gRPC service. No separate HTTP peer-serve port. - Peer-serve path. On the read path, each mount asks the HRW owner for holders, picks the best peer by locality (same rack > same DC > elsewhere, LRU tiebreak within each bucket), and server-streams the chunk bytes back. MD5/ETag verification on the full assembled buffer before returning to the kernel; any failure falls through cleanly to the volume tier.
Enabling the feature
On the filer
The filer accepts MountRegister / MountList RPCs unconditionally — the registry is a tiny in-memory map with negligible idle cost, so there is no flag to disable it. Nothing to configure on the filer side.
weed filer -master=master1:9333,master2:9333,master3:9333 \
-ip=filer1 -port=8888
On each mount
weed mount -filer=filer1:8888,filer2:8888 \
-dir=/mnt/seaweedfs \
-peer.enable=true \
-peer.listen=:18080 \
-peer.advertise=10.0.0.5:18080 \
-peer.dataCenter=dc-east \
-peer.rack=rack-a
The mount will:
- Register with every configured filer's mount registry and heartbeat every 30 s.
- Listen on
:18080(gRPC) forChunkAnnounce/ChunkLookup/FetchChunk. - On reads, try peer mounts before the volume tier. On any failure (owner unreachable, holder has since evicted, ETag mismatch, etc.) it falls through transparently to the volume path.
-peer.advertise is optional; when set, mounts receive that address from the filer's MountList instead of whatever the bind string would resolve to. It is required when -peer.listen uses a wildcard host (":18080", "0.0.0.0:18080", "[::]:18080") and auto-detection can't find a reachable IP — in that case the mount will fail to start rather than advertise an unusable loopback address.
Turning it off
Set -peer.enable=false (or omit the flag) on a rolling restart. The feature disables cleanly — the mount stops serving peers, stops registering with filers, and reads take the same path they did before.
Flag reference
weed mount
| Flag | Default | Meaning |
|---|---|---|
-peer.enable |
false |
Opt-in master switch. |
-peer.listen |
:18080 |
bind address for the peer gRPC server (directory RPCs + chunk streaming). |
-peer.advertise |
(auto-detect) | externally-reachable host:port other mounts use to reach this one. Required with a wildcard -peer.listen when auto-detect fails. |
-peer.dataCenter |
"" |
data-center locality label advertised to peers. |
-peer.rack |
"" |
rack locality label (finer than DC). |
weed filer
No flags. The mount registry is always on.
When it helps
- Many mount clients reading the same large file in overlapping windows — LLM model loading across a GPU fleet is the canonical case.
- Fleets where inter-host bandwidth (10/25/100 GbE between GPU hosts) is abundant but the volume-tier aggregate NIC bandwidth is the bottleneck.
- Workloads with long-lived content access: chunks that stay cached on one mount remain servable to new arrivals for as long as they sit in that mount's on-disk cache.
When it does not help
- Single-mount workloads: no peers to share with.
- Short-lived mount processes whose caches evict quickly: the TTL window for being discoverable is short.
- Writes: chunks are not shared peer-to-peer during writes; writes always go to volume servers. Peer sharing is read-only.
Operational notes
Port and firewall
Each mount binds -peer.listen for its gRPC server. That port must be reachable by every other mount in the same cluster (it carries directory RPCs and the chunk-byte stream). On Kubernetes, typically a ClusterIP service or hostNetwork pods; on bare metal, open the port on the inter-host network only.
Authentication
The peer gRPC service reuses the same transport credentials the mount already uses for talking to the filer. When security.toml configures gRPC TLS, peer connections use it too — no separate credential.
Integrity verification
Every fetched peer response is verified end-to-end by MD5 against FileChunk.ETag from the filer entry before its bytes are handed to the kernel. Mismatch → discard, fall through to the volume tier. This closes the trust gap opened by treating peer mounts as untrusted sources.
Cache and TTL
Directory entries on owner mounts expire after 300 s (5 min) without a renewing ChunkAnnounce. Holder mounts re-announce fids they still hold roughly once per 270 s. The TTL is tuned for the desynchronized loader pattern where chunks stay cached for hours; bursty fleets that want faster eviction can shorten the interval in a later configuration flag.
No explicit retraction is sent on eviction — stale directory entries return a gRPC NOT_FOUND from the peer's FetchChunk call and the caller falls through. A single wasted RTT, no correctness impact.
Locality-aware peer selection
When the owner returns multiple holders, the fetcher re-ranks them client-side:
- Same rack (same
-peer.rackAND same-peer.dataCenter) — best. - Same DC, different rack — next best.
- Cross-DC or unknown labels — last.
Within each bucket the server's LRU order is preserved (freshest holder first). Unlabeled peers end up in bucket 3 — always give every mount at least -peer.dataCenter if you want meaningful locality ranking.
Multi-filer deployments
The registrar broadcasts MountRegister to every filer listed in -filer= and merges every filer's MountList response. That's what lets two mounts pointing at different filers find each other even though the filer registries are in-memory per-filer with no cross-filer sync. An unreachable filer is tolerated; the mount keeps running as long as at least one filer succeeded on the last heartbeat.
Disabling in an emergency
If peer sharing misbehaves in production, the kill switch is a rolling restart with -peer.enable=false on the mounts. Because the read path falls through to the volume tier on every failure mode, you should not observe read errors during the restart — just a gradual transfer of load back to the volume servers.
Limitations
- No write-path announce: mounts don't advertise chunks they just uploaded. Same-host write-then-read still works (local cache), but cross-mount discovery of freshly-written chunks waits until another mount reads them first.
- Chunk manifests: supported transparently — the fetcher resolves manifests to leaf chunks before the HRW lookup, so large files using manifest indirection participate in peer sharing at the leaf level.
- No metrics port on
weed mount: internal counters exist but the mount command does not currently expose a Prometheus endpoint. Observability lives in the glog stream (V(2) for warnings, V(4) for per-read success/failure).
End-to-end CI
test/fuse_p2p/ contains a FUSE-backed integration test that brings up a cluster of 3 mounts with -peer.enable, writes a file through one, and verifies a second mount can satisfy the read from the first mount's chunk cache. The .github/workflows/fuse-p2p-integration.yml workflow runs it on every pull request touching mount or filer peer code.
See also
- Distributing AI Model Files for Multi-GPU Loading — the placement side of the same problem; run
fs.distributeChunksafter upload before letting peer sharing do its thing. - FUSE Mount — baseline mount configuration.
- weed shell — operator console.
- Full design:
design-weed-mount-peer-chunk-sharing.mdin the SeaweedFS repo root.
Introduction
- Quick Start with weed mini
- Simplest S3 Bucket and User Setup
- Components
- Getting Started
- Production Setup
- A typical step‐by‐step example
- Benchmarks
- FAQ
- Applications
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- EC Bitrot Detection
- Server Startup via Systemd
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Management
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Filer Operation Serialization
FUSE Mount
- FIO benchmark
- fstab and systemd mount
- POSIX Compliance
- Distributed POSIX Locks
- P2P reading in weed mount
WebDAV
SFTP Server
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- Supported APIs vs Minio
- S3 Lifecycle
- S3 Lifecycle vs Volume TTL
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
S3 Table Bucket
- S3 Table Bucket
- S3 Table Bucket Commands
- S3 Tables Security
- SeaweedFS Iceberg Catalog
- Iceberg Table Maintenance
Iceberg Integrations
- Spark Iceberg Integration
- Trino Iceberg Integration
- Dremio Iceberg Integration
- DuckDB Iceberg Integration
- Doris Iceberg Integration
- RisingWave Iceberg Integration
- Lakekeeper Iceberg Integration
S3 Authentication & IAM
- S3 Configuration - Start Here
- S3 Credentials (
-s3.config) - OIDC Integration (
-s3.iam.config) - Kubernetes ServiceAccount Authentication (IRSA-style)
- S3 Policy Variables
- S3 Policy Conditions
- S3 Bucket Policies
- Amazon IAM API
- AWS IAM CLI
- weed shell - Shell IAM Commands
Server-Side Encryption
S3 Client Tools
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Metadata Change Events
Messaging
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
Use Cases
Operations
- System Metrics
- weed shell
- Data Backup
- Deployment to Kubernetes and Minikube
- Deployment with seaweed-up
Rust Volume Server
Advanced
- Large File Handling
- Optimization
- Optimization for Many Small Buckets
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure
Security
- Security Overview
- Security Configuration
- Cryptography and FIPS Compliance
- Run Blob Storage on Public Internet