S3 Lifecycle — Monitoring
This page lists the Prometheus signals the worker exposes and how to read the heartbeat log line. For incident response, see S3-Lifecycle-Troubleshooting.
Prometheus metrics
All labels are in weed/stats/metrics.go under the s3_lifecycle subsystem.
Per-shard gauges
| Metric | Labels | What |
|---|---|---|
s3_lifecycle_cursor_min_ts_ns |
shard |
UnixNano of the last meta-log event on this shard for which all matches dispatched successfully |
s3_lifecycle_daily_run_last_walked_ns |
shard |
UnixNano of the most recent successful walker fire |
Derived queries:
# Per-shard replay lag in seconds
(time() * 1e9 - s3_lifecycle_cursor_min_ts_ns) / 1e9
# Per-shard walker freshness in seconds
(time() * 1e9 - s3_lifecycle_daily_run_last_walked_ns) / 1e9
# Worst-shard lag across the cluster
max(time() * 1e9 - s3_lifecycle_cursor_min_ts_ns) / 1e9
Zero values mean "not started yet" — distinct from "0s caught up". The heartbeat line uses cold as the marker for that state.
Counters
| Metric | Labels | What |
|---|---|---|
s3_lifecycle_dispatch_total |
bucket, kind, outcome |
Per-bucket dispatch counter, partitioned by action kind and server outcome |
s3_lifecycle_daily_run_events_scanned_total |
shard |
Meta-log events drainShardEvents processed |
s3_lifecycle_bootstrap_dispatch_total |
bucket, kind |
Walker dispatch counter |
s3_lifecycle_metadata_only_total |
bucket, rule_hash |
Successful deletes that took the metadata-only path |
outcome values: DONE, NOOP_RESOLVED, SKIPPED_OBJECT_LOCK, RETRY_LATER, BLOCKED, LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED, RPC_ERROR. The first three are success outcomes that advance the cursor; the others halt the run. LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value — a healthy worker / server pair should never emit it; a non-zero count there indicates an internal error or a version mismatch between worker and server.
Histograms
| Metric | What |
|---|---|
s3_lifecycle_daily_run_shard_duration_seconds{shard} |
Wall-clock per shard pass. p95 climbing toward max_runtime_minutes means the shard is brushing its budget. |
s3_lifecycle_dispatch_limiter_wait_seconds |
Time spent waiting on the cluster rate limiter before issuing LifecycleDelete. Near-zero = cap not binding; long-tail at 1/rate = cap is the active throttle. |
Heartbeat log line
Emitted at the end of every dailyrun.Run invocation, at glog.V(0) (default verbosity):
daily_run: status=ok shards=16 errors=0 duration=7s cursor_lag_max=2m walked_max_age=3m
Tokens are space-separated key=value for grep / log-aggregator filtering. Stable across versions:
| Token | Meaning |
|---|---|
status=ok or status=error |
Whether any shard returned an error |
shards=N |
Number of shards processed this pass |
errors=N |
Per-shard error count |
duration=Ns |
Wall-clock for the whole pass |
cursor_lag_max=... |
Worst per-shard replay lag, or cold if no shard has a persisted cursor yet |
walked_max_age=... |
Worst per-shard walker age, or cold if no shard has walked yet |
A healthy production heartbeat looks like:
daily_run: status=ok shards=16 errors=0 duration=12.3s cursor_lag_max=45s walked_max_age=58m
Read it as: 16 shards finished cleanly in 12 seconds; the worst-case replay lag is 45 seconds behind real-time; the oldest walker fire on any shard is 58 minutes ago (so walker_interval_minutes=60 is roughly honored).
Anti-patterns to alert on
| Pattern | Meaning | What to do |
|---|---|---|
cursor_lag_max grows unbounded |
Stuck cursor; head-of-line blocking on some shard | See Troubleshooting → Stuck cursor |
walked_max_age exceeds walker_interval_minutes × 2 |
Walker isn't firing as configured | Check errors=N in heartbeat and s3_lifecycle_dispatch_total{outcome="RPC_ERROR"} |
errors=16 (all shards) on every pass |
Filer is unreachable or returning errors | Check filer health |
s3_lifecycle_dispatch_total{outcome="RETRY_LATER"} rising fast |
Server rate-limited or filer overloaded | Lower cluster_deletes_per_second or add capacity |
s3_lifecycle_dispatch_total{outcome="BLOCKED"} non-zero |
Programmatic event content error | Check worker logs for FATAL_EVENT_ERROR |
duration=Ns ramping up across passes |
Walker is firing too often | Set walker_interval_minutes |
Suggested alerts
# `> 0` filters out shards whose gauge is still the proto zero (never
# started); without it, every fresh-install heartbeat triggers the alert.
- alert: S3LifecycleCursorLagHigh
expr: max(time() * 1e9 - (s3_lifecycle_cursor_min_ts_ns > 0)) / 1e9 > 3600
for: 30m
annotations:
summary: "S3 lifecycle replay lag > 1h on shard {{ $labels.shard }}"
runbook: https://github.com/seaweedfs/seaweedfs/wiki/S3-Lifecycle-Troubleshooting#stuck-cursor
- alert: S3LifecycleWalkerStuck
expr: max(time() * 1e9 - (s3_lifecycle_daily_run_last_walked_ns > 0)) / 1e9 > 86400
for: 1h
annotations:
summary: "S3 lifecycle walker hasn't run in > 24h"
runbook: https://github.com/seaweedfs/seaweedfs/wiki/S3-Lifecycle-Troubleshooting#walker-stuck
- alert: S3LifecycleDispatchFailures
expr: |
rate(s3_lifecycle_dispatch_total{outcome=~"RETRY_LATER|BLOCKED|RPC_ERROR"}[5m]) > 0.1
for: 15m
annotations:
summary: "S3 lifecycle delete failure rate > 0.1/s"
Adjust thresholds to your cluster's normal levels — these are starting points.
Introduction
- Quick Start with weed mini
- Simplest S3 Bucket and User Setup
- Components
- Getting Started
- Production Setup
- A typical step‐by‐step example
- Benchmarks
- FAQ
- Applications
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- EC Bitrot Detection
- Server Startup via Systemd
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Management
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Filer Operation Serialization
FUSE Mount
- FIO benchmark
- fstab and systemd mount
- POSIX Compliance
- Distributed POSIX Locks
- P2P reading in weed mount
WebDAV
SFTP Server
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- Supported APIs vs Minio
- S3 Lifecycle
- S3 Lifecycle vs Volume TTL
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
S3 Table Bucket
- S3 Table Bucket
- S3 Table Bucket Commands
- S3 Tables Security
- SeaweedFS Iceberg Catalog
- Iceberg Table Maintenance
Iceberg Integrations
- Spark Iceberg Integration
- Trino Iceberg Integration
- Dremio Iceberg Integration
- DuckDB Iceberg Integration
- Doris Iceberg Integration
- RisingWave Iceberg Integration
- Lakekeeper Iceberg Integration
S3 Authentication & IAM
- S3 Configuration - Start Here
- S3 Credentials (
-s3.config) - OIDC Integration (
-s3.iam.config) - Kubernetes ServiceAccount Authentication (IRSA-style)
- S3 Policy Variables
- S3 Policy Conditions
- S3 Bucket Policies
- Amazon IAM API
- AWS IAM CLI
- weed shell - Shell IAM Commands
Server-Side Encryption
S3 Client Tools
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Metadata Change Events
Messaging
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
Use Cases
Operations
- System Metrics
- weed shell
- Data Backup
- Deployment to Kubernetes and Minikube
- Deployment with seaweed-up
Rust Volume Server
Advanced
- Large File Handling
- Optimization
- Optimization for Many Small Buckets
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure
Security
- Security Overview
- Security Configuration
- Cryptography and FIPS Compliance
- Run Blob Storage on Public Internet