Table of Contents
S3 Lifecycle — Troubleshooting
Incident-response playbook for the S3 lifecycle worker. For monitoring background, see S3-Lifecycle-Monitoring.
Stuck cursor
Symptom: s3_lifecycle_cursor_min_ts_ns{shard=N} is not advancing. Heartbeat shows cursor_lag_max growing unbounded.
Cause: The cursor advance is gated on every match from the event dispatching successfully (DONE, NOOP_RESOLVED, or SKIPPED_OBJECT_LOCK). Any unresolved outcome (RETRY_LATER, BLOCKED, transport error after in-run retries) halts the run for that shard and persists the cursor at the last fully-processed event. Head-of-line blocking is intentional — it surfaces a real problem rather than silently retrying forever.
Diagnostics:
# Which shard is stuck?
time() * 1e9 - s3_lifecycle_cursor_min_ts_ns
# What outcomes are being returned?
sum by (outcome) (rate(s3_lifecycle_dispatch_total[5m]))
Look at worker log for the offending event. The dispatcher logs at glog.V(1):
daily_run: RETRY_LATER on <bucket>/<key> EXPIRATION_DAYS
daily_run: BLOCKED on <bucket>/<key> NONCURRENT_DAYS
daily_run: transport error on <bucket>/<key> ABORT_MPU: <err>
Mitigations:
| Outcome | Root cause | Action |
|---|---|---|
RETRY_LATER (high rate) |
Filer or rate limiter is throttling | Lower cluster_deletes_per_second to give the filer headroom, or scale filer capacity |
BLOCKED FATAL_EVENT_ERROR |
A malformed event the server refuses to dispatch | Check log for the specific reason. May need a code fix; file an issue with the log line |
BLOCKED SKIPPED_OBJECT_LOCK |
Object is locked (legal hold, retention) | Wait for lock to expire, or remove lock manually. Cursor advances normally — this isn't stuck. |
RPC_ERROR (sustained) |
Transport / network issue | Check S3 server health and filer reachability |
The worker doesn't auto-skip past a stuck event. If you've verified the event is malformed and want to skip it: first pause the s3_lifecycle job in the admin UI so the worker isn't running mid-edit, then edit the cursor file directly (/etc/s3/lifecycle/daily-cursors/shard-NN.json), advancing ts_ns past the bad event's TsNs. Resume the job. Editing while the worker is active would race with the worker's own save and either lose your change or overwrite the persisted progress.
Walker stuck (no progress on walker-only rules)
Symptom: s3_lifecycle_daily_run_last_walked_ns{shard=N} is not advancing. Rules like Expiration.Date, ExpiredObjectDeleteMarker, NewerNoncurrent aren't firing on objects that should be due.
Causes:
walker_interval_minutesis too long for your invocation cadence. Worker runs once per day but interval is set to 48h.- Walker is hitting an error mid-walk (filer listing failure). Look for
recovery walk:orsteady walk:errors in the heartbeat'serrors=Ncount. - The bucket has only walker-bound rules and the empty-replay branch's throttle hasn't elapsed.
Diagnostics:
# Walker age per shard
(time() * 1e9 - s3_lifecycle_daily_run_last_walked_ns) / 1e9
Check worker config: walker_interval_minutes should be ≤ the daily worker schedule interval.
Mitigation: lower walker_interval_minutes. Setting 0 temporarily forces every pass to walk.
Test PUT a file with a 1-day rule, didn't expire
The S3 API rejects Expiration.Days < 1, so the smallest "expire after N days" you can configure is 1 day. The worker runs once per day by default. Object PUT + 1-day rule + waiting one day is the minimum scenario.
For testing, the in-repo integration suite uses a trick: backdate the entry's Mtime via filer_pb.UpdateEntry to 30+ days ago. See test/s3/lifecycle/ for the pattern.
For ad-hoc verification, invoke the worker manually:
weed shell -master <host:http_port.grpc_port>
> s3.lifecycle.run-shard -shards 0-15 -s3 <s3-host:port> -refresh 1s -runtime 30s
This runs the same code path as the scheduled worker, but driven from your shell rather than the admin scheduler.
All shards report errors=16 every pass
Symptom: Heartbeat consistently shows status=error shards=16 errors=16 duration=Ns.
Common causes:
- Filer unreachable. Subscription fails on every shard. Check filer health and gRPC connectivity.
- passCtx timeout from
-refreshloop. If-refreshis less than the pass cap, the timeout fires before the drain completes. This is now treated as "clean end-of-pass" — it should not show as errors=N. If you see this on a build before #9481, upgrade. - Bucket walker is timing out. Big bucket, walker hits ctx deadline. Increase
max_runtime_minutes.
Some objects expired, others didn't (same rule)
Symptom: Two objects matching the same rule with the same age — one is deleted, the other isn't.
Common causes:
- The non-deleted object's mtime is wrong. Check the entry's mtime — it might be more recent than you expect (e.g., a recent metadata update bumped it).
- The objects are on different shards and one shard has a stuck cursor while the other doesn't. Check
s3_lifecycle_cursor_min_ts_nsper shard. Filterdoesn't match what you think. A prefix-only filter requires the object key to start with that prefix; a tag filter requires the matching tag. Verify withaws s3api head-object(returns tags via--query).- Object lock or retention. The dispatcher returns
SKIPPED_OBJECT_LOCKfor protected objects. Checks3_lifecycle_dispatch_total{outcome="SKIPPED_OBJECT_LOCK"}.
How to read the cursor files
Cursors live at /etc/s3/lifecycle/daily-cursors/shard-NN.json on the filer. Read with the filer's read API or weed shell:
weed shell -master <host:http_port.grpc_port>
> fs.cat /etc/s3/lifecycle/daily-cursors/shard-00.json
Schema:
{
"version": 1,
"shard_id": 0,
"ts_ns": 1715600000000000000,
"rule_set_hash": "<base64 32 bytes>",
"promoted_hash": "<base64 32 bytes>",
"last_walked_ns": 1715620000000000000
}
ts_ns == 0means "no replay progress" — either cold start or a bucket whose rules are all walker-only.last_walked_ns == 0(or absent) means "never walked steady-state". Next pass will walk.
Manually editing the cursor is supported as an escape hatch but obviously breaks the invariant that "everything before persisted.TsNs has been processed under the same rules." Use sparingly.
Resetting a shard
If a shard's cursor is corrupted or wedged in an unrecoverable state:
weed shell -master <host:http_port.grpc_port>
> fs.rm /etc/s3/lifecycle/daily-cursors/shard-07.json
Next pass treats the shard as cold start: recovery walker fires over RecoveryView(snap), then the cursor seeds at runNow - maxTTL. This re-replays a maxTTL-wide window of meta-log events. Identity-CAS on the server side makes redundant deletes no-ops, so re-replay is safe.
Suspending the worker
The admin UI's plugin scheduler allows pausing the s3_lifecycle job type. The worker won't be invoked while paused; cursors are preserved, and on resume the next pass picks up at the persisted state.
For a single-bucket "stop deleting from this bucket" without pausing the worker:
aws --endpoint $S3_ENDPOINT s3api delete-bucket-lifecycle --bucket my-bucket
The worker reads the empty rule set on the next pass; rsh == [32]byte{} causes the empty-replay branch to run only the walker (which has nothing to walk for an empty rule set) and exit. No deletes.
Reverting
If something goes very wrong, the streaming worker (the previous design) is no longer in the codebase. Revert is via downgrading the binary. The cursor format has a version field — versions >1 would fail-loud on load by an older binary that only knows version 1. Currently version 1 is the only version.
For escape-hatch operations (delete all cursors, suspend worker globally), prefer pausing via the admin UI over force-killing the worker process; the worker exits cleanly between passes.
Introduction
- Quick Start with weed mini
- Simplest S3 Bucket and User Setup
- Components
- Getting Started
- Production Setup
- A typical step‐by‐step example
- Benchmarks
- FAQ
- Applications
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- EC Bitrot Detection
- Server Startup via Systemd
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Management
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Filer Operation Serialization
FUSE Mount
- FIO benchmark
- fstab and systemd mount
- POSIX Compliance
- Distributed POSIX Locks
- P2P reading in weed mount
WebDAV
SFTP Server
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- Supported APIs vs Minio
- S3 Lifecycle
- S3 Lifecycle vs Volume TTL
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
S3 Table Bucket
- S3 Table Bucket
- S3 Table Bucket Commands
- S3 Tables Security
- SeaweedFS Iceberg Catalog
- Iceberg Table Maintenance
Iceberg Integrations
- Spark Iceberg Integration
- Trino Iceberg Integration
- Dremio Iceberg Integration
- DuckDB Iceberg Integration
- Doris Iceberg Integration
- RisingWave Iceberg Integration
- Lakekeeper Iceberg Integration
S3 Authentication & IAM
- S3 Configuration - Start Here
- S3 Credentials (
-s3.config) - OIDC Integration (
-s3.iam.config) - Kubernetes ServiceAccount Authentication (IRSA-style)
- S3 Policy Variables
- S3 Policy Conditions
- S3 Bucket Policies
- Amazon IAM API
- AWS IAM CLI
- weed shell - Shell IAM Commands
Server-Side Encryption
S3 Client Tools
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Metadata Change Events
Messaging
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
Use Cases
Operations
- System Metrics
- weed shell
- Data Backup
- Deployment to Kubernetes and Minikube
- Deployment with seaweed-up
Rust Volume Server
Advanced
- Large File Handling
- Optimization
- Optimization for Many Small Buckets
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure
Security
- Security Overview
- Security Configuration
- Cryptography and FIPS Compliance
- Run Blob Storage on Public Internet