Table of Contents

Spark Iceberg Integration

PySpark
spark-sql CLI

Example SQL

Namespace and Table Operations
Multi-Level Namespaces
Time Travel

Anonymous Access
See Also

Spark Iceberg Integration

Apache Spark connects to SeaweedFS Iceberg tables using the Iceberg Spark runtime with the rest catalog type and SigV4 authentication.

Prerequisites

Spark 3.5+ with Iceberg packages
SeaweedFS started as shown in Setup below

Required packages:

org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2 (match your Spark version)
org.apache.iceberg:iceberg-aws-bundle:1.5.2

Setup

Start weed mini with credentials and a pre-created table bucket via environment variables:

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export S3_TABLE_BUCKET=my-table-bucket

weed mini -dir ~/data

This brings up the Iceberg REST Catalog on http://localhost:8181, the S3 endpoint on http://localhost:8333, an admin S3 identity using the AWS env vars (used as Spark's SigV4 credentials below), and the table bucket my-table-bucket pre-created.

Configuration

PySpark

from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .appName("SeaweedFS Iceberg")
    .config("spark.jars.packages",
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,"
            "org.apache.iceberg:iceberg-aws-bundle:1.5.2")
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.iceberg",
            "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.iceberg.type", "rest")
    .config("spark.sql.catalog.iceberg.uri", "http://localhost:8181")
    # SigV4 authentication
    .config("spark.sql.catalog.iceberg.rest.auth.type", "sigv4")
    .config("spark.sql.catalog.iceberg.rest.sigv4-enabled", "true")
    .config("spark.sql.catalog.iceberg.rest.signing-name", "s3")
    .config("spark.sql.catalog.iceberg.rest.access-key-id", "AKIAIOSFODNN7EXAMPLE")
    .config("spark.sql.catalog.iceberg.rest.secret-access-key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
    # S3 FileIO
    .config("spark.sql.catalog.iceberg.io-impl",
            "org.apache.iceberg.aws.s3.S3FileIO")
    .config("spark.sql.catalog.iceberg.s3.endpoint", "http://localhost:8333")
    .config("spark.sql.catalog.iceberg.s3.region", "us-east-1")
    .config("spark.sql.catalog.iceberg.s3.access-key", "AKIAIOSFODNN7EXAMPLE")
    .config("spark.sql.catalog.iceberg.s3.secret-key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
    .config("spark.sql.catalog.iceberg.s3.path-style-access", "true")
    .getOrCreate()
)

spark-sql CLI

spark-sql \
    --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.iceberg:iceberg-aws-bundle:1.5.2" \
    --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
    --conf "spark.sql.defaultCatalog=iceberg" \
    --conf "spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog" \
    --conf "spark.sql.catalog.iceberg.type=rest" \
    --conf "spark.sql.catalog.iceberg.uri=http://localhost:8181" \
    --conf "spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
    --conf "spark.sql.catalog.iceberg.s3.endpoint=http://localhost:8333" \
    --conf "spark.sql.catalog.iceberg.s3.path-style-access=true" \
    --conf "spark.sql.catalog.iceberg.s3.access-key=AKIAIOSFODNN7EXAMPLE" \
    --conf "spark.sql.catalog.iceberg.s3.secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
    --conf "spark.sql.catalog.iceberg.rest.sigv4-enabled=true" \
    --conf "spark.sql.catalog.iceberg.rest.signing-name=s3"

Example SQL

Namespace and Table Operations

-- Create a namespace
CREATE NAMESPACE iceberg.my_namespace;

-- Create a table
CREATE TABLE iceberg.my_namespace.users (
    id INT,
    name STRING,
    age INT
) USING iceberg;

-- Insert data
INSERT INTO iceberg.my_namespace.users VALUES
    (1, 'Alice', 30),
    (2, 'Bob', 25),
    (3, 'Charlie', 35);

-- Query
SELECT * FROM iceberg.my_namespace.users;
SELECT COUNT(*) FROM iceberg.my_namespace.users;

-- Update and delete
UPDATE iceberg.my_namespace.users SET age = 31 WHERE id = 1;
DELETE FROM iceberg.my_namespace.users WHERE id = 3;

Multi-Level Namespaces

CREATE NAMESPACE iceberg.analytics;
CREATE NAMESPACE iceberg.analytics.web;

CREATE TABLE iceberg.analytics.web.pageviews (
    id INT,
    url STRING,
    ts TIMESTAMP
) USING iceberg;

Time Travel

-- Query a table at a specific point in time
SELECT COUNT(*) FROM iceberg.my_namespace.users
    TIMESTAMP AS OF '2024-01-15 10:30:00';

Anonymous Access

When SeaweedFS runs without IAM, disable SigV4:

spark = (SparkSession.builder
    .config("spark.sql.catalog.iceberg.type", "rest")
    .config("spark.sql.catalog.iceberg.uri", "http://localhost:8181")
    .config("spark.sql.catalog.iceberg.rest.sigv4-enabled", "false")
    .config("spark.sql.catalog.iceberg.io-impl",
            "org.apache.iceberg.aws.s3.S3FileIO")
    .config("spark.sql.catalog.iceberg.s3.endpoint", "http://localhost:8333")
    .config("spark.sql.catalog.iceberg.s3.path-style-access", "true")
    # ... other standard configs
    .getOrCreate()
)

Spark Iceberg Integration

Prerequisites

Setup

Configuration

PySpark

spark-sql CLI

Example SQL

Namespace and Table Operations

Multi-Level Namespaces

Time Travel

Anonymous Access

See Also

Introduction

API

Configuration

Filer

Filer Stores

Management

Advanced Filer Configurations

FUSE Mount

WebDAV

SFTP Server

Cloud Drive

AWS S3 API

S3 Table Bucket

Iceberg Integrations

S3 Authentication & IAM

Server-Side Encryption

S3 Client Tools

Machine Learning

HDFS

Replication and Backup

Metadata Change Events

Messaging

Use Cases

Operations

Rust Volume Server

Advanced

Security

Misc Use Case Examples