mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
Page:
Spark Iceberg Integration
Pages
A typical step‐by‐step example
AWS CLI with SeaweedFS
AWS IAM CLI
Actual Users
Admin UI OIDC
Admin UI
Amazon IAM API
Amazon S3 API
Applications
Async Backup
Async Filer Metadata Backup
Async Replication to Cloud
Async Replication to another Filer
Benchmark SeaweedFS as a GlusterFS replacement
Benchmarks from jinleileiking
Benchmarks
Cache Remote Storage
Choosing a Filer Store
Client Libraries
Cloud Drive Architecture
Cloud Drive Benefits
Cloud Drive Quick Setup
Cloud Monitoring
Cloud Tier
Cluster Plan Day 2 Operations
Cluster Plan Inventory Reference
Cluster Plan Workflow
Components
Configure Remote Storage
Cryptography and FIPS Compliance
Customize Filer Store
Data Backup
Data Structure for Large Files
Deployment to Kubernetes and Minikube
Deployment with seaweed up
Directories and Files
Distributed POSIX Locks
Distributing AI Model Files for Multi GPU Loading
Docker Compose for S3
Docker Image Registry with SeaweedFS
Doris Iceberg Integration
Dremio Iceberg Integration
DuckDB Iceberg Integration
EC Bitrot Detection
Environment Variables
Erasure Coding for warm storage
Error reporting to sentry
FAQ
FIO benchmark
FUSE Mount
Failover Master Server
File Operations Quick Reference
Filer Active Active cross cluster continuous synchronization
Filer Cassandra Setup
Filer Change Data Capture
Filer Commands and Operations
Filer Data Encryption
Filer JWT Use
Filer Metadata Events
Filer Notification Webhook
Filer Operation Serialization
Filer Redis Setup
Filer Server API
Filer Setup
Filer Store Replication
Filer Stores
Filer as a Key Large Value Store
Gateway to Remote Object Storage
Getting Started
HDFS via S3 connector
Hadoop Benchmark
Hadoop Compatible File System
Hardware
Hobbyest Tinkerer scale on premises tutorial
Home
Iceberg Table Maintenance
Independent Benchmarks
Kafka to Kafka Gateway to SMQ to SQL
Kubernetes Backups and Recovery with K8up
Kubernetes ServiceAccount Authentication
Lakekeeper Iceberg Integration
Large File Handling
Load Command Line Options from a file
Master Server API
Migrate Maintenance Scripts to Admin Script Plugin
Migrate to Filer Store
Mount Remote Storage
OIDC Integration
Optimization for Many Small Buckets
Optimization
P2P reading in weed mount
POSIX Compliance
Path Specific Configuration
Path Specific Filer Store
Plugin Worker Scheduling
PostgreSQL compatible Server weed db
Production Setup
Pub Sub to SMQ to SQL
Quick Start with weed mini
Replication
RisingWave Iceberg Integration
Run Blob Storage on Public Internet
Run Presto on SeaweedFS
Rust Volume Server
S3 API Audit log
S3 API Benchmark
S3 API FAQ
S3 Bucket Policies
S3 Bucket Quota
S3 CORS
S3 Conditional Operations
S3 Configuration
S3 Credentials
S3 Lifecycle Architecture
S3 Lifecycle Monitoring
S3 Lifecycle Operator Guide
S3 Lifecycle Troubleshooting
S3 Lifecycle vs Volume TTL
S3 Lifecycle
S3 Nginx Proxy
S3 Object Lock and Retention
S3 Object Versioning
S3 Policy Conditions
S3 Policy Variables
S3 Rate Limiting
S3 Table Bucket Commands
S3 Table Bucket
S3 Tables Security
SFTP Server
SQL Queries on Message Queue
SQL Quick Reference
SRV Service Discovery
Seaweed Message Queue
SeaweedFS Architecture
SeaweedFS Iceberg Catalog
SeaweedFS Java Client
SeaweedFS in Docker Swarm
Security Configuration
Security Overview
Server Side Encryption SSE C
Server Side Encryption SSE KMS
Server Side Encryption
Server Startup via Systemd
Simplest S3 Bucket and User Setup
Spark Iceberg Integration
Store file with a Time To Live
Structured Data Lake with SMQ and SQL
Super Large Directories
Supported APIs vs Minio
System Metrics
TUS Resumable Uploads
TensorFlow with SeaweedFS
Tiered Storage
Trino Iceberg Integration
UrBackup with SeaweedFS
Use Cases
Volume Files Structure
Volume Management
Volume Server API
WebDAV
Words from SeaweedFS Users
Worker
fstab and systemd mount
nodejs with Seaweed S3
rclone with SeaweedFS
restic with SeaweedFS
run HBase on SeaweedFS
run Spark on SeaweedFS
s3cmd with SeaweedFS
weed shell
Clone
2
Spark Iceberg Integration
Chris Lu edited this page 2026-05-03 23:48:41 -07:00
Spark Iceberg Integration
Apache Spark connects to SeaweedFS Iceberg tables using the Iceberg Spark runtime with the rest catalog type and SigV4 authentication.
Prerequisites
- Spark 3.5+ with Iceberg packages
- SeaweedFS started as shown in Setup below
Required packages:
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2(match your Spark version)org.apache.iceberg:iceberg-aws-bundle:1.5.2
Setup
Start weed mini with credentials and a pre-created table bucket via environment variables:
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export S3_TABLE_BUCKET=my-table-bucket
weed mini -dir ~/data
This brings up the Iceberg REST Catalog on http://localhost:8181, the S3 endpoint on http://localhost:8333, an admin S3 identity using the AWS env vars (used as Spark's SigV4 credentials below), and the table bucket my-table-bucket pre-created.
Configuration
PySpark
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("SeaweedFS Iceberg")
.config("spark.jars.packages",
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,"
"org.apache.iceberg:iceberg-aws-bundle:1.5.2")
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.iceberg",
"org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.iceberg.type", "rest")
.config("spark.sql.catalog.iceberg.uri", "http://localhost:8181")
# SigV4 authentication
.config("spark.sql.catalog.iceberg.rest.auth.type", "sigv4")
.config("spark.sql.catalog.iceberg.rest.sigv4-enabled", "true")
.config("spark.sql.catalog.iceberg.rest.signing-name", "s3")
.config("spark.sql.catalog.iceberg.rest.access-key-id", "AKIAIOSFODNN7EXAMPLE")
.config("spark.sql.catalog.iceberg.rest.secret-access-key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
# S3 FileIO
.config("spark.sql.catalog.iceberg.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.iceberg.s3.endpoint", "http://localhost:8333")
.config("spark.sql.catalog.iceberg.s3.region", "us-east-1")
.config("spark.sql.catalog.iceberg.s3.access-key", "AKIAIOSFODNN7EXAMPLE")
.config("spark.sql.catalog.iceberg.s3.secret-key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
.config("spark.sql.catalog.iceberg.s3.path-style-access", "true")
.getOrCreate()
)
spark-sql CLI
spark-sql \
--packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.iceberg:iceberg-aws-bundle:1.5.2" \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.defaultCatalog=iceberg" \
--conf "spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.iceberg.type=rest" \
--conf "spark.sql.catalog.iceberg.uri=http://localhost:8181" \
--conf "spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
--conf "spark.sql.catalog.iceberg.s3.endpoint=http://localhost:8333" \
--conf "spark.sql.catalog.iceberg.s3.path-style-access=true" \
--conf "spark.sql.catalog.iceberg.s3.access-key=AKIAIOSFODNN7EXAMPLE" \
--conf "spark.sql.catalog.iceberg.s3.secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
--conf "spark.sql.catalog.iceberg.rest.sigv4-enabled=true" \
--conf "spark.sql.catalog.iceberg.rest.signing-name=s3"
Example SQL
Namespace and Table Operations
-- Create a namespace
CREATE NAMESPACE iceberg.my_namespace;
-- Create a table
CREATE TABLE iceberg.my_namespace.users (
id INT,
name STRING,
age INT
) USING iceberg;
-- Insert data
INSERT INTO iceberg.my_namespace.users VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35);
-- Query
SELECT * FROM iceberg.my_namespace.users;
SELECT COUNT(*) FROM iceberg.my_namespace.users;
-- Update and delete
UPDATE iceberg.my_namespace.users SET age = 31 WHERE id = 1;
DELETE FROM iceberg.my_namespace.users WHERE id = 3;
Multi-Level Namespaces
CREATE NAMESPACE iceberg.analytics;
CREATE NAMESPACE iceberg.analytics.web;
CREATE TABLE iceberg.analytics.web.pageviews (
id INT,
url STRING,
ts TIMESTAMP
) USING iceberg;
Time Travel
-- Query a table at a specific point in time
SELECT COUNT(*) FROM iceberg.my_namespace.users
TIMESTAMP AS OF '2024-01-15 10:30:00';
Anonymous Access
When SeaweedFS runs without IAM, disable SigV4:
spark = (SparkSession.builder
.config("spark.sql.catalog.iceberg.type", "rest")
.config("spark.sql.catalog.iceberg.uri", "http://localhost:8181")
.config("spark.sql.catalog.iceberg.rest.sigv4-enabled", "false")
.config("spark.sql.catalog.iceberg.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.iceberg.s3.endpoint", "http://localhost:8333")
.config("spark.sql.catalog.iceberg.s3.path-style-access", "true")
# ... other standard configs
.getOrCreate()
)
See Also
- SeaweedFS Iceberg Catalog - Architecture and concepts
- S3 Table Bucket - Managing table buckets
- run Spark on SeaweedFS - Spark with SeaweedFS HDFS connector (non-Iceberg)
Introduction
- Quick Start with weed mini
- Simplest S3 Bucket and User Setup
- Components
- Getting Started
- Production Setup
- A typical step‐by‐step example
- Benchmarks
- FAQ
- Applications
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- EC Bitrot Detection
- Server Startup via Systemd
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Management
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Filer Operation Serialization
FUSE Mount
- FIO benchmark
- fstab and systemd mount
- POSIX Compliance
- Distributed POSIX Locks
- P2P reading in weed mount
WebDAV
SFTP Server
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- Supported APIs vs Minio
- S3 Lifecycle
- S3 Lifecycle vs Volume TTL
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
S3 Table Bucket
- S3 Table Bucket
- S3 Table Bucket Commands
- S3 Tables Security
- SeaweedFS Iceberg Catalog
- Iceberg Table Maintenance
Iceberg Integrations
- Spark Iceberg Integration
- Trino Iceberg Integration
- Dremio Iceberg Integration
- DuckDB Iceberg Integration
- Doris Iceberg Integration
- RisingWave Iceberg Integration
- Lakekeeper Iceberg Integration
S3 Authentication & IAM
- S3 Configuration - Start Here
- S3 Credentials (
-s3.config) - OIDC Integration (
-s3.iam.config) - Kubernetes ServiceAccount Authentication (IRSA-style)
- S3 Policy Variables
- S3 Policy Conditions
- S3 Bucket Policies
- Amazon IAM API
- AWS IAM CLI
- weed shell - Shell IAM Commands
Server-Side Encryption
S3 Client Tools
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Metadata Change Events
Messaging
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
Use Cases
Operations
- System Metrics
- weed shell
- Data Backup
- Deployment to Kubernetes and Minikube
- Deployment with seaweed-up
Rust Volume Server
Advanced
- Large File Handling
- Optimization
- Optimization for Many Small Buckets
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure
Security
- Security Overview
- Security Configuration
- Cryptography and FIPS Compliance
- Run Blob Storage on Public Internet