Table of Contents
- Iceberg Table Maintenance
- Introduction
- Prerequisites
- Enabling Maintenance
- Operations
- 1. Compact Data Files
- How It Works
- Before / After: Unpartitioned Table
- Before / After: Partitioned Table
- Before / After: Oversized Bin (Split)
- Delete Handling
- Skip Conditions
- 2. Expire Snapshots
- 3. Remove Orphans
- 4. Rewrite Manifests
- Configuration Reference
- Execution Order
- Commit Protocol
- Detection
- Monitoring
- Limitations
- See Also
Iceberg Table Maintenance
Introduction
SeaweedFS includes an automated Iceberg table maintenance worker that keeps S3 Table Buckets healthy. Over time, Iceberg tables accumulate small data files, stale snapshots, orphaned files, and fragmented manifests. Left unchecked, this degrades query performance and wastes storage.
The maintenance worker runs four operations in the recommended Iceberg order:
- Compact -- merge small Parquet data files into larger ones
- Expire Snapshots -- remove old snapshots and their unreferenced files
- Remove Orphans -- delete files on disk that no snapshot references
- Rewrite Manifests -- consolidate many small manifest files into fewer large ones
The worker integrates with the Worker framework: an admin server schedules detection scans, the worker proposes tables that need attention, and the admin assigns execution jobs.
Prerequisites
- A running SeaweedFS cluster with S3 Table Bucket support enabled
- At least one
weed workerinstance withiceberg_maintenancein its job types - Tables created via the S3 Tables API or the Iceberg REST Catalog
Enabling Maintenance
Iceberg maintenance is disabled by default. Enable it through the admin UI or API:
- Navigate to Admin UI > Job Types > Iceberg Maintenance
- Toggle Enabled to on
- Configure detection interval, scope filters, and worker thresholds
Or start a worker with the job type included (it is in the default set):
# Default: includes iceberg_maintenance along with other job types
weed worker -admin=localhost:23646
# Explicitly specify job types
weed worker -admin=localhost:23646 -jobType=iceberg_maintenance
Operations
1. Compact Data Files
Compaction merges many small Parquet files within the same partition into fewer, larger files. This reduces the number of file opens during queries and improves I/O throughput.
How It Works
- Read the current snapshot's manifest list; separate data manifests from delete manifests
- Group small data files (below
target_file_size_mb) by partition spec ID + partition key so files from different specs are never mixed - Filter groups to those with at least
min_input_filesentries - If
apply_deletesis enabled and delete manifests exist, collect position deletes and equality deletes from all delete manifest entries - For each group (bin), read all source Parquet files, filter out deleted rows (position deletes via binary search, equality deletes via hash set lookup), and merge remaining rows into a single output file
- Write one manifest per partition spec with ADDED entries (merged files), DELETED entries (originals), and EXISTING entries (untouched files)
- Carry forward delete manifests if any non-compacted data files remain; drop them if all data files were compacted (deletes fully consumed)
- Commit a new snapshot via optimistic concurrency
Before / After: Unpartitioned Table
Before -- 6 small files totaling 53 MB:
data/
file-001.parquet (10 MB)
file-002.parquet (5 MB)
file-003.parquet (8 MB)
file-004.parquet (12 MB)
file-005.parquet (3 MB)
file-006.parquet (15 MB)
metadata/
v1.metadata.json
snap-100.avro <- manifest list (1 snapshot)
manifest-100.avro <- manifest (6 entries)
After -- 1 merged file:
data/
file-001..006.parquet <- still on disk until orphan removal
compact-100-<ts>-0.parquet (53 MB, new merged file)
metadata/
v1.metadata.json
v2-<nonce>.metadata.json <- new metadata
snap-<ts>.avro <- new manifest list (2 snapshots)
compact-<ts>.avro <- new manifest:
1 ADDED (merged file)
6 DELETED (originals)
The original data files remain on disk. They become unreferenced after expire_snapshots drops the old snapshot, and are cleaned up by remove_orphans.
Before / After: Partitioned Table
Before -- region=us has 5 small files, region=eu has only 2:
data/
region=us/part-001.parquet (20 MB)
region=us/part-002.parquet (15 MB)
region=us/part-003.parquet (10 MB)
region=us/part-004.parquet (25 MB)
region=us/part-005.parquet (18 MB)
region=eu/part-006.parquet (5 MB)
region=eu/part-007.parquet (8 MB)
After -- only region=us is compacted (5 >= min_input_files), region=eu is untouched (2 < 5):
data/
region=us/part-001..005.parquet <- marked DELETED in manifest
region=eu/part-006.parquet <- EXISTING (carried forward)
region=eu/part-007.parquet <- EXISTING (carried forward)
compact-100-<ts>-0.parquet <- merged US data (88 MB)
Before / After: Oversized Bin (Split)
When a partition group exceeds target_file_size_bytes, it is split into sub-bins:
Before -- 20 small files in one partition, totaling 600 MB (target = 256 MB):
After -- split into multiple output files:
compact-...-0.parquet (~250 MB, files 1-8)
compact-...-1.parquet (~240 MB, files 9-15)
(files 16-20 left alone if fewer than min_input_files)
Delete Handling
When apply_deletes is enabled (the default), compaction applies both position deletes and equality deletes during the merge:
- Position deletes: the delete file contains
file_path+poscolumns indicating specific rows to remove. Paths are normalized so absolute S3 URLs and relative paths match correctly. - Equality deletes: the delete file specifies equality field IDs and column values. Rows matching those values are filtered out. Different delete files may use different equality columns — deletes are grouped by field ID set.
After compaction, delete manifests whose referenced data files were all compacted are dropped (deletes fully consumed). If any data files remain uncompacted, delete manifests are carried forward.
Set apply_deletes=false to revert to the previous behavior of skipping tables with delete manifests.
Skip Conditions
| Condition | Result |
|---|---|
All files >= target_file_size_mb |
"no files eligible for compaction" |
No partition has >= min_input_files small files |
"no files eligible for compaction" |
Delete manifests present and apply_deletes=false |
"compaction skipped: delete manifests present and apply_deletes is disabled" |
| No current snapshot | "no current snapshot" |
2. Expire Snapshots
Removes old snapshots from table metadata and deletes files that are no longer referenced by any remaining snapshot.
How It Works
- Sort snapshots by timestamp (newest first)
- The current snapshot is always kept
- Keep the newest
max_snapshots_to_keepsnapshots regardless of age - Among the rest, expire those older than
snapshot_retention_hours - Snapshots within the retention window are kept even if they exceed the count
- Commit new metadata with expired snapshots removed
- Delete files exclusively referenced by expired snapshots (best-effort)
Before / After
Before -- table with 8 snapshots, max_snapshots_to_keep=5, snapshot_retention_hours=168 (7 days):
Snapshots (newest first):
snap-8 (current, 1 hour ago) <- always kept
snap-7 (2 days ago) <- kept (within budget)
snap-6 (3 days ago) <- kept (within budget)
snap-5 (5 days ago) <- kept (within budget)
snap-4 (6 days ago) <- kept (within budget, count=5)
snap-3 (8 days ago) <- EXPIRED (beyond budget + retention)
snap-2 (10 days ago) <- EXPIRED
snap-1 (14 days ago) <- EXPIRED
After:
Snapshots:
snap-8 (current)
snap-7
snap-6
snap-5
snap-4
Deleted: manifest lists, manifests, and data files
exclusively referenced by snap-1, snap-2, snap-3
Result: "expired 3 snapshot(s), deleted 12 unreferenced file(s)"
3. Remove Orphans
Deletes files in the table's metadata/ and data/ directories that are not referenced by any snapshot. This catches files left behind by failed writes, interrupted compactions, or other incomplete operations.
How It Works
- Collect all file paths referenced by all snapshots (manifest lists, manifests, data files) plus the active metadata file and previous metadata log entries
- Walk the
metadata/anddata/directories on the filer - For each file not in the reference set:
- Skip if the file is newer than
orphan_older_than_hours(safety window) - Skip if the file has no attributes (unknown age)
- Delete the file
- Skip if the file is newer than
Before / After
Before -- table directory has leftover files from a failed compaction and an old metadata version:
metadata/
v1.metadata.json <- referenced (previous metadata log)
v2-<nonce>.metadata.json <- referenced (current)
snap-100.avro <- referenced
compact-failed.avro <- NOT referenced, 4 days old
v0.metadata.json <- NOT referenced, 10 days old
data/
file-001.parquet <- referenced
compact-orphan.parquet <- NOT referenced, 5 days old
temp-upload.parquet <- NOT referenced, 1 hour old
After (with orphan_older_than_hours=72):
Deleted:
metadata/compact-failed.avro (4 days > 72h)
metadata/v0.metadata.json (10 days > 72h)
data/compact-orphan.parquet (5 days > 72h)
Kept:
data/temp-upload.parquet (1 hour < 72h safety window)
Result: "removed 3 orphan file(s)"
4. Rewrite Manifests
Consolidates many small manifest files into fewer large ones. When a table accumulates many manifests (from frequent small commits), query engines must open each one to plan scans. Rewriting reduces this overhead.
How It Works
- Read all data manifests from the current snapshot's manifest list
- Skip if manifest count is below
min_input_files - Group manifest entries by partition spec ID (for spec-evolved tables, one output manifest per spec)
- Write merged manifest(s) containing all entries
- Carry forward any delete manifests unchanged
- Write a new manifest list and commit a new snapshot
Before / After
Before -- 12 small manifests from frequent commits:
metadata/
manifest-001.avro (50 entries)
manifest-002.avro (10 entries)
manifest-003.avro (5 entries)
...
manifest-012.avro (8 entries)
snap-500.avro <- manifest list pointing to 12 manifests
After -- 1 merged manifest:
metadata/
merged-<ts>-spec0-<ts>.avro (all entries combined)
snap-<ts>-<ts>.avro <- new manifest list (1 manifest)
Result: "rewrote 12 manifests into 1 (320 entries)"
Skip Conditions
| Condition | Result |
|---|---|
Data manifest count < min_manifests_to_rewrite |
"only N data manifests, below threshold of M" |
| No current snapshot | "no current snapshot" |
| No data entries | "no data entries to rewrite" |
Configuration Reference
Admin Config (Scope Filters)
These control which tables the detection scan considers. Set via the Admin UI under the Iceberg Maintenance job type.
| Parameter | Default | Description |
|---|---|---|
bucket_filter |
(blank = all) | Comma-separated wildcard patterns for table bucket names (* and ? supported) |
namespace_filter |
(blank = all) | Comma-separated wildcard patterns for namespaces |
table_filter |
(blank = all) | Comma-separated wildcard patterns for table names |
Examples:
bucket_filter: prod-*, staging-*
namespace_filter: analytics, events-*
table_filter: clicks, orders-*
Worker Config (Thresholds)
These control how maintenance operations behave. Set via the Admin UI or passed as worker config values.
| Parameter | Default | Description |
|---|---|---|
snapshot_retention_hours |
168 (7 days) |
Expire snapshots older than this |
max_snapshots_to_keep |
5 |
Always keep at least this many newest snapshots |
orphan_older_than_hours |
72 (3 days) |
Safety window: only delete orphans older than this |
target_file_size_mb |
256 (MB) |
Files smaller than this are compaction candidates |
min_input_files |
5 |
Minimum small files in a partition to trigger compaction |
apply_deletes |
true |
When true, compaction applies position and equality deletes to data files. When false, tables with delete manifests are skipped |
min_manifests_to_rewrite |
5 |
Minimum manifests before rewriting is triggered |
max_commit_retries |
5 |
Max optimistic concurrency retries on version conflict |
operations |
all |
Comma-separated list of operations, or all |
Operations Selection
The operations parameter controls which operations run and in what order. Operations always execute in canonical order regardless of the order specified:
compact -> expire_snapshots -> remove_orphans -> rewrite_manifests
Examples:
operations: all # run everything (default)
operations: expire_snapshots,remove_orphans # snapshots + cleanup only
operations: compact # compaction only
Admin Runtime Defaults
| Setting | Default | Description |
|---|---|---|
| Enabled | false |
Must be explicitly enabled |
| Detection interval | 3600s (1 hour) |
How often the worker scans for tables needing maintenance |
| Detection timeout | 300s |
Max time for a detection scan |
| Max jobs per detection | 100 |
Cap on proposals per scan |
| Global execution concurrency | 4 |
Cluster-wide parallel maintenance jobs |
| Per-worker execution concurrency | 2 |
Parallel jobs per worker instance |
| Job max runtime | 3600s (1 hour) |
Timeout for a single maintenance job |
| Retry limit | 1 |
Scheduler-level retries for failed jobs |
Execution Order
Operations execute in a fixed order that follows Iceberg best practices:
compact -> expire_snapshots -> remove_orphans -> rewrite_manifests
This order matters because:
- Compact creates a new snapshot with merged files and marks originals as deleted
- Expire Snapshots drops old snapshots, making the original (now-deleted) files unreferenced
- Remove Orphans cleans up unreferenced files from disk (including compaction leftovers)
- Rewrite Manifests consolidates the manifests accumulated from compaction and other commits
If one operation fails, subsequent operations still attempt to run. The job reports partial success with per-operation results.
Commit Protocol
All metadata-modifying operations (compact, expire_snapshots, rewrite_manifests) use an optimistic concurrency protocol:
- Read current metadata from the table entry's xattr
- Plan the changes (build new manifests, manifest lists, etc.)
- Write new metadata file with a unique name (
v<N>-<timestamp>.metadata.json) - Compare-and-swap the table entry's xattr: verify the stored
metadataVersionmatches what was read, then update atomically - On version conflict, clean up the staged metadata file and retry with exponential backoff (50ms, 100ms, 200ms, ... capped at 5s)
Each operation also includes a stale-plan guard: if the current snapshot has changed since planning began, the operation aborts rather than committing against outdated state.
Detection
The detection scan runs periodically (default: hourly) and evaluates each table using metadata-only heuristics (no manifest reading):
- Snapshot count exceeds
max_snapshots_to_keep - Any snapshot is older than
snapshot_retention_hours
Tables matching either condition are proposed as maintenance jobs. The actual operations then perform precise filtering (e.g., expire_snapshots checks both conditions together before expiring).
Monitoring
Maintenance jobs report progress and results through the worker framework:
- Activity events: scan start, scan complete (with table count), per-operation start/complete
- Progress updates: percentage based on completed operations, with per-bin granularity during compaction
- Job result summary: per-operation outcomes joined by semicolons
- Structured metrics: per-operation metrics in
OutputValueswith dot-prefixed keys
Example result summary:
compact: compacted 15 files into 3 (across 3 bins);
expire_snapshots: expired 2 snapshot(s), deleted 8 unreferenced file(s);
remove_orphans: removed 4 orphan file(s);
rewrite_manifests: rewrote 7 manifests into 1 (120 entries)
Structured Metrics
Each operation returns structured metrics in the job's OutputValues map, keyed with the operation name as prefix:
| Key | Description |
|---|---|
compact.files_merged |
Number of input files merged |
compact.files_written |
Number of output files produced |
compact.bins |
Number of compaction bins processed |
compact.duration_ms |
Time spent on compaction |
expire_snapshots.snapshots_expired |
Number of snapshots removed |
expire_snapshots.files_deleted |
Unreferenced files cleaned up |
expire_snapshots.duration_ms |
Time spent on expiration |
remove_orphans.orphans_removed |
Number of orphan files deleted |
remove_orphans.duration_ms |
Time spent on orphan removal |
rewrite_manifests.manifests_rewritten |
Number of data manifests consolidated |
rewrite_manifests.entries_total |
Total manifest entries in rewritten output |
rewrite_manifests.duration_ms |
Time spent on manifest rewriting |
Limitations
- Client-side CAS: the compare-and-swap on metadata version is enforced client-side. Concurrent maintenance on the same table should be avoided (the deduplication key prevents this under normal scheduling).
- Single filer per job: detection tries each filer address in the cluster context until one connects, but each execution job uses the single filer address recorded in its proposal.
See Also
- S3 Table Bucket -- Table bucket concepts and file layout
- S3 Table Bucket Commands -- CLI examples for creating and managing table buckets
- SeaweedFS Iceberg Catalog -- Using the Iceberg REST Catalog with Spark, Trino, Dremio, and other engines
- S3 Tables Security -- IAM permissions for table buckets
- Worker -- Worker framework overview
Introduction
- Quick Start with weed mini
- Simplest S3 Bucket and User Setup
- Components
- Getting Started
- Production Setup
- A typical step‐by‐step example
- Benchmarks
- FAQ
- Applications
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- EC Bitrot Detection
- Server Startup via Systemd
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Management
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Filer Operation Serialization
FUSE Mount
- FIO benchmark
- fstab and systemd mount
- POSIX Compliance
- Distributed POSIX Locks
- P2P reading in weed mount
WebDAV
SFTP Server
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- Supported APIs vs Minio
- S3 Lifecycle
- S3 Lifecycle vs Volume TTL
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
S3 Table Bucket
- S3 Table Bucket
- S3 Table Bucket Commands
- S3 Tables Security
- SeaweedFS Iceberg Catalog
- Iceberg Table Maintenance
Iceberg Integrations
- Spark Iceberg Integration
- Trino Iceberg Integration
- Dremio Iceberg Integration
- DuckDB Iceberg Integration
- Doris Iceberg Integration
- RisingWave Iceberg Integration
- Lakekeeper Iceberg Integration
S3 Authentication & IAM
- S3 Configuration - Start Here
- S3 Credentials (
-s3.config) - OIDC Integration (
-s3.iam.config) - Kubernetes ServiceAccount Authentication (IRSA-style)
- S3 Policy Variables
- S3 Policy Conditions
- S3 Bucket Policies
- Amazon IAM API
- AWS IAM CLI
- weed shell - Shell IAM Commands
Server-Side Encryption
S3 Client Tools
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Metadata Change Events
Messaging
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
Use Cases
Operations
- System Metrics
- weed shell
- Data Backup
- Deployment to Kubernetes and Minikube
- Deployment with seaweed-up
Rust Volume Server
Advanced
- Large File Handling
- Optimization
- Optimization for Many Small Buckets
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure
Security
- Security Overview
- Security Configuration
- Cryptography and FIPS Compliance
- Run Blob Storage on Public Internet