diff --git a/Worker.md b/Worker.md new file mode 100644 index 0000000..8d95857 --- /dev/null +++ b/Worker.md @@ -0,0 +1,283 @@ +# This is still work in progress! + +# Weed Worker + +The `weed worker` command starts a maintenance worker that connects to an admin server to process cluster maintenance tasks. + +## Overview + +Workers are distributed maintenance agents that connect to the admin server to process various maintenance tasks such as: +- **Vacuum**: Reclaim disk space by removing deleted files +- **Erasure Coding**: Convert volumes to erasure-coded format for storage efficiency +- **Remote Upload**: Upload volumes to remote/cloud storage +- **Replication**: Fix replication issues and maintain data consistency +- **Balance**: Redistribute volumes across volume servers for load balancing + +Workers automatically register with the admin server and receive tasks based on their capabilities and current load. + +## Usage + +```bash +weed worker [options] +``` + +## Options + +| Option | Default | Description | +|--------|---------|-------------| +| `-admin` | localhost:23646 | Admin server address | +| `-capabilities` | vacuum,erasure_coding,balance | Comma-separated list of task types this worker can handle | +| `-maxConcurrent` | 2 | Maximum number of concurrent tasks | +| `-heartbeat` | 30s | Heartbeat interval to admin server | +| `-taskInterval` | 5s | Task request interval | + +## Examples + +### Basic Usage + +```bash +# Start worker connecting to local admin server +weed worker -admin=localhost:23646 + +# Connect to remote admin server +weed worker -admin=admin.example.com:23646 + +# Start worker with custom admin server and port +weed worker -admin=192.168.1.100:8080 +``` + +### Capability Configuration + +```bash +# Worker that only handles vacuum tasks +weed worker -admin=localhost:23646 -capabilities=vacuum + +# Worker that handles vacuum and replication tasks +weed worker -admin=localhost:23646 -capabilities=vacuum,replication + +# Worker with all capabilities (default) +weed worker -admin=localhost:23646 -capabilities=vacuum,ec,remote,replication,balance + +# Worker using capability aliases +weed worker -admin=localhost:23646 -capabilities=vacuum,ec,remote,replication +``` + +### Performance Tuning + +```bash +# High-performance worker with more concurrent tasks +weed worker -admin=localhost:23646 -maxConcurrent=8 + +# More frequent task requests for busy clusters +weed worker -admin=localhost:23646 -taskInterval=2s + +# Custom heartbeat interval +weed worker -admin=localhost:23646 -heartbeat=10s +``` + +## Task Capabilities + +Workers can be configured to handle specific types of maintenance tasks: + +### Available Task Types + +| Capability | Description | +|------------|-------------| +| `vacuum` | Reclaim disk space by removing deleted files | +| `erasure_coding` | Convert volumes to erasure-coded format | +| `balance` | Redistribute volumes for load balancing | + +## Worker Architecture + +### Worker Lifecycle + +1. **Registration**: Worker connects to admin server via gRPC +2. **Capabilities**: Worker reports its capabilities to admin +3. **Task Request**: Worker periodically requests tasks from admin +4. **Task Execution**: Worker processes assigned tasks +5. **Heartbeat**: Worker sends periodic heartbeats to admin +6. **Graceful Shutdown**: Worker completes current tasks before stopping + +### Connection Details + +- **Protocol**: gRPC connection to admin server +- **Port**: Admin HTTP port + 10000 (e.g., admin on 23646 → gRPC on 33646) +- **Security**: Supports TLS using `[grpc.worker]` configuration +- **Fallback**: Falls back to insecure connection if TLS unavailable + +## Configuration + +### Security Configuration + +Workers read TLS configuration from `security.toml`: + +```toml +[grpc.worker] +cert = "/etc/ssl/worker.crt" +key = "/etc/ssl/worker.key" +ca = "/etc/ssl/ca.crt" +``` + +### Worker Identification + +- **Worker ID**: Automatically generated unique identifier +- **Address**: Worker's network address (auto-detected) +- **Capabilities**: Reported task capabilities +- **Status**: Current worker status (active, idle, busy) + +## Task Processing + +### Concurrent Task Handling + +- **Max Concurrent**: Configurable via `-maxConcurrent` (default: 2) +- **Task Queue**: Workers maintain internal task queues +- **Load Balancing**: Admin distributes tasks based on worker load +- **Task Completion**: Workers report task completion status + +### Task Request Cycle + +1. Worker requests tasks from admin server +2. Admin assigns tasks based on worker capabilities and load +3. Worker processes tasks concurrently +4. Worker reports task completion/failure +5. Cycle repeats based on `-taskInterval` + +## Monitoring and Status + +### Worker Status + +Workers report the following status information: +- **Worker ID**: Unique identifier +- **Current Load**: Number of active tasks +- **Capabilities**: Supported task types +- **Last Heartbeat**: Timestamp of last heartbeat +- **Tasks Completed**: Total completed tasks +- **Tasks Failed**: Total failed tasks +- **Uptime**: Worker uptime duration + +### Health Monitoring + +- **Heartbeat**: Periodic heartbeat to admin server +- **Task Timeout**: Tasks have configurable timeouts +- **Error Reporting**: Failed tasks are reported to admin +- **Automatic Retry**: Failed tasks may be retried + +## Best Practices + +### Deployment + +1. **Multiple Workers**: Deploy multiple workers for redundancy +2. **Capability Specialization**: Consider specialized workers for specific tasks +3. **Resource Allocation**: Ensure adequate CPU and memory for concurrent tasks +4. **Network Connectivity**: Ensure reliable connection to admin server + +### Performance + +1. **Concurrent Tasks**: Tune `-maxConcurrent` based on available resources +2. **Task Interval**: Adjust `-taskInterval` based on cluster activity +3. **Heartbeat Frequency**: Balance between responsiveness and overhead +4. **Resource Monitoring**: Monitor worker resource usage + +### Security + +1. **TLS Configuration**: Use TLS for production deployments +2. **Network Security**: Secure communication between workers and admin +3. **Access Control**: Limit worker deployment to trusted systems +4. **Certificate Management**: Manage and rotate TLS certificates + +## Troubleshooting + +### Common Issues + +1. **Cannot connect to admin server**: + - Verify admin server address and port + - Check network connectivity + - Ensure admin server is running + - Verify gRPC port (admin HTTP port + 10000) + +2. **No tasks received**: + - Check worker capabilities match available tasks + - Verify worker registration with admin + - Check admin server logs for task assignment + - Ensure worker is not overloaded + +3. **TLS connection failures**: + - Verify `security.toml` configuration + - Check certificate paths and permissions + - Ensure certificates are valid + - Check certificate compatibility + +4. **Task execution failures**: + - Check worker logs for error details + - Verify worker has necessary permissions + - Check disk space and resources + - Ensure target volumes are accessible + +### Debug Information + +Enable debug logging: + +```bash +# Run with verbose logging +weed worker -admin=localhost:23646 -v=4 +``` + +### Worker Logs + +Workers log important events: +- Connection status to admin server +- Task assignments and completion +- Error conditions and failures +- Heartbeat and health information + +## Task-Specific Information + +### Vacuum Tasks + +- **Purpose**: Reclaim disk space from deleted files +- **Requirements**: Access to volume servers +- **Duration**: Varies based on volume size and deleted data +- **Impact**: Temporary increase in I/O during vacuum process + +### Erasure Coding Tasks + +- **Purpose**: Convert volumes to erasure-coded format +- **Requirements**: Multiple volume servers for redundancy +- **Duration**: Long-running, depends on volume size +- **Impact**: Reduces storage requirements but increases complexity + +### Remote Upload Tasks + +- **Purpose**: Upload volumes to remote/cloud storage +- **Requirements**: Cloud storage credentials and connectivity +- **Duration**: Depends on volume size and upload bandwidth +- **Impact**: Enables tiered storage and backup strategies + +### Replication Tasks + +- **Purpose**: Fix replication consistency issues +- **Requirements**: Access to master and volume servers +- **Duration**: Quick, depends on replication factor +- **Impact**: Ensures data consistency and availability + +### Balance Tasks + +- **Purpose**: Redistribute volumes across volume servers +- **Requirements**: Multiple volume servers +- **Duration**: Depends on data movement requirements +- **Impact**: Improves cluster load distribution + +## Related Commands + +- [`weed admin`](Weed-Admin.md): Start admin server that manages workers +- [`weed master`](https://github.com/seaweedfs/seaweedfs/wiki/Master-Server): Start master servers +- [`weed volume`](https://github.com/seaweedfs/seaweedfs/wiki/Volume-Server): Start volume servers +- [`weed scaffold`](https://github.com/seaweedfs/seaweedfs/wiki/Scaffold): Generate configuration files + +## See Also + +- [SeaweedFS Architecture](https://github.com/seaweedfs/seaweedfs/wiki/SeaweedFS-Architecture) +- [Maintenance Operations](https://github.com/seaweedfs/seaweedfs/wiki/Maintenance) +- [Security Configuration](https://github.com/seaweedfs/seaweedfs/wiki/Security-Configuration) +- [Erasure Coding](https://github.com/seaweedfs/seaweedfs/wiki/Erasure-Coding) +- [Remote Storage](https://github.com/seaweedfs/seaweedfs/wiki/Remote-Storage) \ No newline at end of file