From 5a56420296abb58ca39be1be8cabd8691ada4d1c Mon Sep 17 00:00:00 2001 From: Chris Lu Date: Sun, 3 May 2026 23:31:55 -0700 Subject: [PATCH] Add Doris Iceberg Integration page --- Doris-Iceberg-Integration.md | 115 +++++++++++++++++++++++++++++++++++ Home.md | 1 + SeaweedFS-Iceberg-Catalog.md | 5 +- _Sidebar.md | 1 + 4 files changed, 120 insertions(+), 2 deletions(-) create mode 100644 Doris-Iceberg-Integration.md diff --git a/Doris-Iceberg-Integration.md b/Doris-Iceberg-Integration.md new file mode 100644 index 0000000..419dc7f --- /dev/null +++ b/Doris-Iceberg-Integration.md @@ -0,0 +1,115 @@ +# Apache Doris Iceberg Integration + +Apache Doris connects to SeaweedFS Iceberg tables using an external catalog of `type=iceberg` and `iceberg.catalog.type=rest`. Authentication to the REST catalog uses OAuth2 client credentials via the standard Iceberg `credential` property; data files are read directly from S3 using `s3.access_key` / `s3.secret_key`. + +This page reflects the integration verified by the [Apache Doris all-in-one 2.1.0 catalog test](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris). + +## Prerequisites + +- Apache Doris 2.1.0 or later (the all-in-one Docker image works for local testing) +- A MySQL-protocol client (`mysql` CLI or any Go/Java/Python MySQL driver) — Doris speaks MySQL on port `9030` +- SeaweedFS started as shown in [Setup](#setup) below + +## Setup + +Start `weed mini` with credentials and a pre-created table bucket via environment variables: + +```bash +export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE +export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY +export S3_TABLE_BUCKET=my-table-bucket + +weed mini -dir ~/data +``` + +This brings up the Iceberg REST Catalog on `http://localhost:8181`, the S3 endpoint on `http://localhost:8333`, an admin S3 identity using the AWS env vars (used as Doris's `credential` and `s3.*` keys below), and the table bucket `my-table-bucket` pre-created. + +If Doris runs in a container and SeaweedFS runs on the host, use `host.docker.internal` (with `--add-host host.docker.internal:host-gateway` on Linux) in the URLs below. + +## Configuration + +Doris external catalogs are registered with `CREATE CATALOG`. Connect with any MySQL client and run: + +```sql +CREATE CATALOG iceberg_catalog PROPERTIES ( + "type" = "iceberg", + "iceberg.catalog.type" = "rest", + "uri" = "http://localhost:8181", + "warehouse" = "s3://my-table-bucket", + "credential" = "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", + "s3.endpoint" = "http://localhost:8333", + "s3.access_key" = "AKIAIOSFODNN7EXAMPLE", + "s3.secret_key" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", + "s3.region" = "us-east-1", + "use_path_style" = "true" +); +``` + +Key settings: + +- `uri` points at the SeaweedFS Iceberg REST Catalog (default `:8181`). +- `warehouse` is `s3://`. SeaweedFS maps this to the bucket of the same name. +- `credential` is the Iceberg-standard OAuth2 client credentials in `client_id:client_secret` form. Doris's REST client exchanges this for a bearer token at `/v1/oauth/tokens` using the `client_credentials` grant. +- `s3.endpoint`, `s3.access_key`, `s3.secret_key`, `s3.region` are used by the BE workers to read parquet files directly from S3. +- `use_path_style=true` is required for SeaweedFS's path-style S3. + +If you change tables outside of Doris (e.g. via Spark, Trino, or PyIceberg), refresh the catalog so the new metadata is picked up: + +```sql +REFRESH CATALOG iceberg_catalog; +``` + +## Example SQL + +### Browse the catalog + +```sql +-- List all catalogs registered on this Doris cluster +SHOW CATALOGS; + +-- Switch into the Iceberg catalog and list namespaces +SWITCH iceberg_catalog; +SHOW DATABASES; + +-- List tables in a namespace +SHOW TABLES FROM iceberg_catalog.my_namespace; +``` + +### Read tables + +```sql +-- Three-part identifier: catalog.namespace.table +SELECT * FROM iceberg_catalog.my_namespace.events; +SELECT COUNT(*) FROM iceberg_catalog.my_namespace.events; + +-- Identifiers with hyphens or special characters need backticks +SELECT * FROM iceberg_catalog.`my-ns`.`my-table`; +``` + +The integration test exercises catalog discovery (`SHOW CATALOGS` / `SHOW DATABASES` / `SHOW TABLES`), schema parsing (column-name projection on `id, label`), `SELECT COUNT(*)` against an empty table, and reading three rows that were written by a PyIceberg writer before Doris connected. This validates both the metadata path and the parquet read path. + +Write paths from Doris (`CREATE TABLE`, `INSERT INTO`) against an Iceberg REST catalog are not exercised by the SeaweedFS test suite. Treat Doris primarily as a reader against tables produced by Spark, Trino, or other writers. + +## Anonymous Access + +When SeaweedFS runs without IAM (e.g. `weed mini` with no `-s3.config`), the REST catalog accepts unsigned requests. Drop `credential` from the catalog properties and leave the `s3.*` keys set — SeaweedFS accepts any value when IAM is disabled: + +```sql +CREATE CATALOG iceberg_catalog PROPERTIES ( + "type" = "iceberg", + "iceberg.catalog.type" = "rest", + "uri" = "http://localhost:8181", + "warehouse" = "s3://my-table-bucket", + "s3.endpoint" = "http://localhost:8333", + "s3.access_key" = "any", + "s3.secret_key" = "any", + "s3.region" = "us-east-1", + "use_path_style" = "true" +); +``` + +## See Also + +- [[SeaweedFS Iceberg Catalog]] - Architecture and concepts +- [[S3 Table Bucket]] - Managing table buckets +- [Doris integration test](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris) - End-to-end reference diff --git a/Home.md b/Home.md index 9a35d99..96af793 100644 --- a/Home.md +++ b/Home.md @@ -16,6 +16,7 @@ SeaweedFS stands out for its high performance, scalability, and flexibility. It - Customizable tiered storage that intelligently places data based on activity, moving less active data to cheaper cloud storage. - Elastic scalability, easily expanding capacity by adding volume servers. - A robust, high-performance, S3-compatible object store that can serve as an in-house alternative to HDFS. +- A built-in **Iceberg REST Catalog** that turns SeaweedFS into a self-contained lakehouse: Spark, Trino, Dremio, DuckDB, RisingWave, and Apache Doris can query Iceberg tables directly, with no external metastore (see [[S3 Table Bucket]]). The system is designed for high availability and durability, with features like: diff --git a/SeaweedFS-Iceberg-Catalog.md b/SeaweedFS-Iceberg-Catalog.md index 36c83ee..0d26079 100644 --- a/SeaweedFS-Iceberg-Catalog.md +++ b/SeaweedFS-Iceberg-Catalog.md @@ -8,7 +8,7 @@ The SeaweedFS S3 Tables feature implements the **Iceberg REST Catalog API**. Thi - **Iceberg REST Catalog**: Available on a dedicated port (default `8181`) - **S3 Data Access**: Available on the S3 port (default `8333`) -- **Authentication**: SigV4 (Spark, Trino, RisingWave), OAuth2 (DuckDB), or unsigned REST + S3 access keys (Dremio) +- **Authentication**: SigV4 (Spark, Trino, RisingWave), OAuth2 (DuckDB, Doris), or unsigned REST + S3 access keys (Dremio) ## Catalog and Bucket Relationship @@ -50,6 +50,7 @@ See the integration guide for your engine below. | **Trino** | SigV4 | [[Trino Iceberg Integration]] | | **Dremio** | S3 access keys (REST source) | [[Dremio Iceberg Integration]] | | **DuckDB** | OAuth2 | [[DuckDB Iceberg Integration]] | +| **Apache Doris** | OAuth2 | [[Doris Iceberg Integration]] | | **RisingWave** | SigV4 | [[RisingWave Iceberg Integration]] | | **Lakekeeper** | STS + SigV4 | [[Lakekeeper Iceberg Integration]] | @@ -80,7 +81,7 @@ SeaweedFS supports two authentication methods for the Iceberg REST Catalog: **SigV4 (Spark, Trino, RisingWave)** — Clients sign each request using AWS Signature Version 4. This is the standard method used by most Iceberg-compatible engines. -**OAuth2 (DuckDB)** — Clients exchange S3 credentials for a bearer token via `POST /v1/oauth/tokens` using the `client_credentials` grant type. The S3 access key is used as `client_id` and the secret key as `client_secret`. +**OAuth2 (DuckDB, Doris)** — Clients exchange S3 credentials for a bearer token via `POST /v1/oauth/tokens` using the `client_credentials` grant type. The S3 access key is used as `client_id` and the secret key as `client_secret`. ### Authorization (IAM) Permissions are managed via **S3 Bucket Policies** applied to the Table Bucket. diff --git a/_Sidebar.md b/_Sidebar.md index 5b452ac..b003c75 100644 --- a/_Sidebar.md +++ b/_Sidebar.md @@ -105,6 +105,7 @@ * [[Trino Iceberg Integration]] * [[Dremio Iceberg Integration]] * [[DuckDB Iceberg Integration]] +* [[Doris Iceberg Integration]] * [[RisingWave Iceberg Integration]] * [[Lakekeeper Iceberg Integration]]