Table of Contents

Iceberg Table Maintenance

1. Compact Data Files

How It Works
Before / After: Unpartitioned Table
Before / After: Partitioned Table
Before / After: Oversized Bin (Split)
Delete Handling
Skip Conditions

2. Expire Snapshots

How It Works
Before / After

3. Remove Orphans

How It Works
Before / After

4. Rewrite Manifests

How It Works
Before / After
Skip Conditions

Configuration Reference

Admin Config (Scope Filters)
Worker Config (Thresholds)
Operations Selection
Admin Runtime Defaults

Execution Order
Commit Protocol
Detection
Monitoring

Structured Metrics

Limitations
See Also

Iceberg Table Maintenance

Introduction

SeaweedFS includes an automated Iceberg table maintenance worker that keeps S3 Table Buckets healthy. Over time, Iceberg tables accumulate small data files, stale snapshots, orphaned files, and fragmented manifests. Left unchecked, this degrades query performance and wastes storage.

The maintenance worker runs four operations in the recommended Iceberg order:

Compact -- merge small Parquet data files into larger ones
Expire Snapshots -- remove old snapshots and their unreferenced files
Remove Orphans -- delete files on disk that no snapshot references
Rewrite Manifests -- consolidate many small manifest files into fewer large ones

The worker integrates with the Worker framework: an admin server schedules detection scans, the worker proposes tables that need attention, and the admin assigns execution jobs.

Prerequisites

A running SeaweedFS cluster with S3 Table Bucket support enabled
At least one weed worker instance with iceberg_maintenance in its job types
Tables created via the S3 Tables API or the Iceberg REST Catalog

Enabling Maintenance

Iceberg maintenance is disabled by default. Enable it through the admin UI or API:

Navigate to Admin UI > Job Types > Iceberg Maintenance
Toggle Enabled to on
Configure detection interval, scope filters, and worker thresholds

Or start a worker with the job type included (it is in the default set):

# Default: includes iceberg_maintenance along with other job types
weed worker -admin=localhost:23646

# Explicitly specify job types
weed worker -admin=localhost:23646 -jobType=iceberg_maintenance

Operations

1. Compact Data Files

Compaction merges many small Parquet files within the same partition into fewer, larger files. This reduces the number of file opens during queries and improves I/O throughput.

How It Works

Read the current snapshot's manifest list; separate data manifests from delete manifests
Group small data files (below target_file_size_mb) by partition spec ID + partition key so files from different specs are never mixed
Filter groups to those with at least min_input_files entries
If apply_deletes is enabled and delete manifests exist, collect position deletes and equality deletes from all delete manifest entries
For each group (bin), read all source Parquet files, filter out deleted rows (position deletes via binary search, equality deletes via hash set lookup), and merge remaining rows into a single output file
Write one manifest per partition spec with ADDED entries (merged files), DELETED entries (originals), and EXISTING entries (untouched files)
Carry forward delete manifests if any non-compacted data files remain; drop them if all data files were compacted (deletes fully consumed)
Commit a new snapshot via optimistic concurrency

Before / After: Unpartitioned Table

Before -- 6 small files totaling 53 MB:

data/
  file-001.parquet   (10 MB)
  file-002.parquet    (5 MB)
  file-003.parquet    (8 MB)
  file-004.parquet   (12 MB)
  file-005.parquet    (3 MB)
  file-006.parquet   (15 MB)
metadata/
  v1.metadata.json
  snap-100.avro           <- manifest list (1 snapshot)
  manifest-100.avro       <- manifest (6 entries)

After -- 1 merged file:

data/
  file-001..006.parquet   <- still on disk until orphan removal
  compact-100-<ts>-0.parquet  (53 MB, new merged file)
metadata/
  v1.metadata.json
  v2-<nonce>.metadata.json    <- new metadata
  snap-<ts>.avro              <- new manifest list (2 snapshots)
  compact-<ts>.avro           <- new manifest:
                                   1 ADDED   (merged file)
                                   6 DELETED (originals)

The original data files remain on disk. They become unreferenced after expire_snapshots drops the old snapshot, and are cleaned up by remove_orphans.

Before / After: Partitioned Table

Before -- region=us has 5 small files, region=eu has only 2:

data/
  region=us/part-001.parquet  (20 MB)
  region=us/part-002.parquet  (15 MB)
  region=us/part-003.parquet  (10 MB)
  region=us/part-004.parquet  (25 MB)
  region=us/part-005.parquet  (18 MB)
  region=eu/part-006.parquet   (5 MB)
  region=eu/part-007.parquet   (8 MB)

After -- only region=us is compacted (5 >= min_input_files), region=eu is untouched (2 < 5):

data/
  region=us/part-001..005.parquet  <- marked DELETED in manifest
  region=eu/part-006.parquet       <- EXISTING (carried forward)
  region=eu/part-007.parquet       <- EXISTING (carried forward)
  compact-100-<ts>-0.parquet       <- merged US data (88 MB)

Before / After: Oversized Bin (Split)

When a partition group exceeds target_file_size_bytes, it is split into sub-bins:

Before -- 20 small files in one partition, totaling 600 MB (target = 256 MB):

After -- split into multiple output files:

compact-...-0.parquet  (~250 MB, files 1-8)
compact-...-1.parquet  (~240 MB, files 9-15)
(files 16-20 left alone if fewer than min_input_files)

Delete Handling

When apply_deletes is enabled (the default), compaction applies both position deletes and equality deletes during the merge:

Position deletes: the delete file contains file_path + pos columns indicating specific rows to remove. Paths are normalized so absolute S3 URLs and relative paths match correctly.
Equality deletes: the delete file specifies equality field IDs and column values. Rows matching those values are filtered out. Different delete files may use different equality columns — deletes are grouped by field ID set.

After compaction, delete manifests whose referenced data files were all compacted are dropped (deletes fully consumed). If any data files remain uncompacted, delete manifests are carried forward.

Set apply_deletes=false to revert to the previous behavior of skipping tables with delete manifests.

Skip Conditions

Condition	Result
All files >= `target_file_size_mb`	`"no files eligible for compaction"`
No partition has >= `min_input_files` small files	`"no files eligible for compaction"`
Delete manifests present and `apply_deletes=false`	`"compaction skipped: delete manifests present and apply_deletes is disabled"`
No current snapshot	`"no current snapshot"`

2. Expire Snapshots

Removes old snapshots from table metadata and deletes files that are no longer referenced by any remaining snapshot.

How It Works

Sort snapshots by timestamp (newest first)
The current snapshot is always kept
Keep the newest max_snapshots_to_keep snapshots regardless of age
Among the rest, expire those older than snapshot_retention_hours
Snapshots within the retention window are kept even if they exceed the count
Commit new metadata with expired snapshots removed
Delete files exclusively referenced by expired snapshots (best-effort)

Before / After

Before -- table with 8 snapshots, max_snapshots_to_keep=5, snapshot_retention_hours=168 (7 days):

Snapshots (newest first):
  snap-8  (current, 1 hour ago)    <- always kept
  snap-7  (2 days ago)             <- kept (within budget)
  snap-6  (3 days ago)             <- kept (within budget)
  snap-5  (5 days ago)             <- kept (within budget)
  snap-4  (6 days ago)             <- kept (within budget, count=5)
  snap-3  (8 days ago)             <- EXPIRED (beyond budget + retention)
  snap-2  (10 days ago)            <- EXPIRED
  snap-1  (14 days ago)            <- EXPIRED

After:

Snapshots:
  snap-8  (current)
  snap-7
  snap-6
  snap-5
  snap-4

Deleted: manifest lists, manifests, and data files
         exclusively referenced by snap-1, snap-2, snap-3

Result: "expired 3 snapshot(s), deleted 12 unreferenced file(s)"

3. Remove Orphans

Deletes files in the table's metadata/ and data/ directories that are not referenced by any snapshot. This catches files left behind by failed writes, interrupted compactions, or other incomplete operations.

How It Works

Collect all file paths referenced by all snapshots (manifest lists, manifests, data files) plus the active metadata file and previous metadata log entries
Walk the metadata/ and data/ directories on the filer
For each file not in the reference set:
- Skip if the file is newer than orphan_older_than_hours (safety window)
- Skip if the file has no attributes (unknown age)
- Delete the file

Before / After

Before -- table directory has leftover files from a failed compaction and an old metadata version:

metadata/
  v1.metadata.json         <- referenced (previous metadata log)
  v2-<nonce>.metadata.json <- referenced (current)
  snap-100.avro            <- referenced
  compact-failed.avro      <- NOT referenced, 4 days old
  v0.metadata.json         <- NOT referenced, 10 days old
data/
  file-001.parquet         <- referenced
  compact-orphan.parquet   <- NOT referenced, 5 days old
  temp-upload.parquet      <- NOT referenced, 1 hour old

After (with orphan_older_than_hours=72):

Deleted:
  metadata/compact-failed.avro    (4 days > 72h)
  metadata/v0.metadata.json       (10 days > 72h)
  data/compact-orphan.parquet     (5 days > 72h)

Kept:
  data/temp-upload.parquet        (1 hour < 72h safety window)

Result: "removed 3 orphan file(s)"

4. Rewrite Manifests

Consolidates many small manifest files into fewer large ones. When a table accumulates many manifests (from frequent small commits), query engines must open each one to plan scans. Rewriting reduces this overhead.

How It Works

Read all data manifests from the current snapshot's manifest list
Skip if manifest count is below min_input_files
Group manifest entries by partition spec ID (for spec-evolved tables, one output manifest per spec)
Write merged manifest(s) containing all entries
Carry forward any delete manifests unchanged
Write a new manifest list and commit a new snapshot

Before / After

Before -- 12 small manifests from frequent commits:

metadata/
  manifest-001.avro   (50 entries)
  manifest-002.avro   (10 entries)
  manifest-003.avro    (5 entries)
  ...
  manifest-012.avro    (8 entries)
  snap-500.avro        <- manifest list pointing to 12 manifests

After -- 1 merged manifest:

metadata/
  merged-<ts>-spec0-<ts>.avro  (all entries combined)
  snap-<ts>-<ts>.avro          <- new manifest list (1 manifest)

Result: "rewrote 12 manifests into 1 (320 entries)"

Skip Conditions

Condition	Result
Data manifest count < `min_manifests_to_rewrite`	`"only N data manifests, below threshold of M"`
No current snapshot	`"no current snapshot"`
No data entries	`"no data entries to rewrite"`

Configuration Reference

Admin Config (Scope Filters)

These control which tables the detection scan considers. Set via the Admin UI under the Iceberg Maintenance job type.

Parameter	Default	Description
`bucket_filter`	(blank = all)	Comma-separated wildcard patterns for table bucket names (`*` and `?` supported)
`namespace_filter`	(blank = all)	Comma-separated wildcard patterns for namespaces
`table_filter`	(blank = all)	Comma-separated wildcard patterns for table names

Examples:

bucket_filter:    prod-*, staging-*
namespace_filter: analytics, events-*
table_filter:     clicks, orders-*

Worker Config (Thresholds)

These control how maintenance operations behave. Set via the Admin UI or passed as worker config values.

Parameter	Default	Description
`snapshot_retention_hours`	`168` (7 days)	Expire snapshots older than this
`max_snapshots_to_keep`	`5`	Always keep at least this many newest snapshots
`orphan_older_than_hours`	`72` (3 days)	Safety window: only delete orphans older than this
`target_file_size_mb`	`256` (MB)	Files smaller than this are compaction candidates
`min_input_files`	`5`	Minimum small files in a partition to trigger compaction
`apply_deletes`	`true`	When true, compaction applies position and equality deletes to data files. When false, tables with delete manifests are skipped
`min_manifests_to_rewrite`	`5`	Minimum manifests before rewriting is triggered
`max_commit_retries`	`5`	Max optimistic concurrency retries on version conflict
`operations`	`all`	Comma-separated list of operations, or `all`

Operations Selection

The operations parameter controls which operations run and in what order. Operations always execute in canonical order regardless of the order specified:

compact -> expire_snapshots -> remove_orphans -> rewrite_manifests

Examples:

operations: all                              # run everything (default)
operations: expire_snapshots,remove_orphans  # snapshots + cleanup only
operations: compact                          # compaction only

Admin Runtime Defaults

Setting	Default	Description
Enabled	`false`	Must be explicitly enabled
Detection interval	`3600s` (1 hour)	How often the worker scans for tables needing maintenance
Detection timeout	`300s`	Max time for a detection scan
Max jobs per detection	`100`	Cap on proposals per scan
Global execution concurrency	`4`	Cluster-wide parallel maintenance jobs
Per-worker execution concurrency	`2`	Parallel jobs per worker instance
Job max runtime	`3600s` (1 hour)	Timeout for a single maintenance job
Retry limit	`1`	Scheduler-level retries for failed jobs

Execution Order

Operations execute in a fixed order that follows Iceberg best practices:

compact -> expire_snapshots -> remove_orphans -> rewrite_manifests

This order matters because:

Compact creates a new snapshot with merged files and marks originals as deleted
Expire Snapshots drops old snapshots, making the original (now-deleted) files unreferenced
Remove Orphans cleans up unreferenced files from disk (including compaction leftovers)
Rewrite Manifests consolidates the manifests accumulated from compaction and other commits

If one operation fails, subsequent operations still attempt to run. The job reports partial success with per-operation results.

Commit Protocol

All metadata-modifying operations (compact, expire_snapshots, rewrite_manifests) use an optimistic concurrency protocol:

Read current metadata from the table entry's xattr
Plan the changes (build new manifests, manifest lists, etc.)
Write new metadata file with a unique name (v<N>-<timestamp>.metadata.json)
Compare-and-swap the table entry's xattr: verify the stored metadataVersion matches what was read, then update atomically
On version conflict, clean up the staged metadata file and retry with exponential backoff (50ms, 100ms, 200ms, ... capped at 5s)

Each operation also includes a stale-plan guard: if the current snapshot has changed since planning began, the operation aborts rather than committing against outdated state.

Detection

The detection scan runs periodically (default: hourly) and evaluates each table using metadata-only heuristics (no manifest reading):

Snapshot count exceeds max_snapshots_to_keep
Any snapshot is older than snapshot_retention_hours

Tables matching either condition are proposed as maintenance jobs. The actual operations then perform precise filtering (e.g., expire_snapshots checks both conditions together before expiring).

Monitoring

Maintenance jobs report progress and results through the worker framework:

Activity events: scan start, scan complete (with table count), per-operation start/complete
Progress updates: percentage based on completed operations, with per-bin granularity during compaction
Job result summary: per-operation outcomes joined by semicolons
Structured metrics: per-operation metrics in OutputValues with dot-prefixed keys

Example result summary:

compact: compacted 15 files into 3 (across 3 bins);
expire_snapshots: expired 2 snapshot(s), deleted 8 unreferenced file(s);
remove_orphans: removed 4 orphan file(s);
rewrite_manifests: rewrote 7 manifests into 1 (120 entries)

Structured Metrics

Each operation returns structured metrics in the job's OutputValues map, keyed with the operation name as prefix:

Key	Description
`compact.files_merged`	Number of input files merged
`compact.files_written`	Number of output files produced
`compact.bins`	Number of compaction bins processed
`compact.duration_ms`	Time spent on compaction
`expire_snapshots.snapshots_expired`	Number of snapshots removed
`expire_snapshots.files_deleted`	Unreferenced files cleaned up
`expire_snapshots.duration_ms`	Time spent on expiration
`remove_orphans.orphans_removed`	Number of orphan files deleted
`remove_orphans.duration_ms`	Time spent on orphan removal
`rewrite_manifests.manifests_rewritten`	Number of data manifests consolidated
`rewrite_manifests.entries_total`	Total manifest entries in rewritten output
`rewrite_manifests.duration_ms`	Time spent on manifest rewriting

Limitations

Client-side CAS: the compare-and-swap on metadata version is enforced client-side. Concurrent maintenance on the same table should be avoided (the deduplication key prevents this under normal scheduling).
Single filer per job: detection tries each filer address in the cluster context until one connects, but each execution job uses the single filer address recorded in its proposal.

Iceberg Table Maintenance

Introduction

Prerequisites

Enabling Maintenance

Operations

1. Compact Data Files

How It Works

Before / After: Unpartitioned Table

Before / After: Partitioned Table

Before / After: Oversized Bin (Split)

Delete Handling

Skip Conditions

2. Expire Snapshots

How It Works

Before / After

3. Remove Orphans

How It Works

Before / After

4. Rewrite Manifests

How It Works

Before / After

Skip Conditions

Configuration Reference

Admin Config (Scope Filters)

Worker Config (Thresholds)

Operations Selection

Admin Runtime Defaults

Execution Order

Commit Protocol

Detection

Monitoring

Structured Metrics

Limitations

See Also

Introduction

API

Configuration

Filer

Filer Stores

Management

Advanced Filer Configurations

FUSE Mount

WebDAV

SFTP Server

Cloud Drive

AWS S3 API

S3 Table Bucket

Iceberg Integrations

S3 Authentication & IAM

Server-Side Encryption

S3 Client Tools

Machine Learning

HDFS

Replication and Backup

Metadata Change Events

Messaging

Use Cases

Operations

Rust Volume Server

Advanced

Security

Misc Use Case Examples