Skip to content

Commit c491be8

Browse files
authored
refine iceberg integration doc (#563)
1 parent bb20b21 commit c491be8

File tree

1 file changed

+166
-99
lines changed

1 file changed

+166
-99
lines changed
Lines changed: 166 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,101 +1,136 @@
1-
## Overview
1+
## Overview
22

3-
[Apache Iceberg](https://iceberg.apache.org/) is an open table format for large-scale analytic datasets, designed for high performance and reliability. It provides an open, vendor-neutral solution that supports multiple engines, making it ideal for various analytics workloads. Initially, the Iceberg ecosystem was primarily built around Java, but with the increasing adoption of the REST catalog specification, Timeplus is among the first vendors to integrate with Iceberg purely in C++. This allows Timeplus users to stream data to Iceberg with a high performance, low memory footprint, and easy installation without relying on Java dependencies.
3+
Timeplus natively supports the [Apache Iceberg](https://iceberg.apache.org/) open table format — a high-performance, reliable storage format for large-scale analytics. This integration allows Timeplus users to **stream data directly to Iceberg** and **query Iceberg tables efficiently**, all implemented purely in **C++**, without any Java dependencies.
44

5-
Since Timeplus Proton 1.7(to be released soon) and [Timeplus Enterprise 2.8](/enterprise-v2.8), we provide native support for Apache Iceberg as a new database type. This allows you to read and write data using the Apache Iceberg open table format, with support for the Iceberg REST Catalog (IRC). In the initial release, we focused on writing data to Iceberg, with basic query optimization for reading data from Iceberg. The integration with Amazon S3, [AWS Glue's Iceberg REST endpoint](https://docs.aws.amazon.com/glue/latest/dg/connect-glu-iceberg-rest.html) and [the Apache Gravitino Iceberg REST Server](https://gravitino.apache.org/docs/0.8.0-incubating/iceberg-rest-service) are validated. More REST catalog implementations are planned.
5+
### Supported Catalogs and Storage
66

7-
## Key Benefits for Timeplus Iceberg Integration
7+
The Iceberg REST Catalog integration works with common cloud and open-source backends, including:
88

9-
- Using Timeplus materialized views, users can continuously process and transform streaming data (from Apache Kafka for example) and write to the cost-effective object storage in Apache Iceberg open table format.
10-
- Apache Iceberg's open table format ensures you're never locked into a single vendor or query engine
11-
- Query your Iceberg tables with multiple engines including Timeplus, Apache Spark, Apache Flink, ClickHouse, DuckDB, and AWS Athena
12-
- Future-proof your data architecture with broad industry support and an active open-source community
9+
- **Amazon S3** for object storage
10+
- **AWS Glue Iceberg REST endpoint**
11+
- **[Apache Gravitino REST Server](https://gravitino.apache.org/docs/0.8.0-incubating/iceberg-rest-service)**
1312

14-
## Create Iceberg Database {#syntax}
13+
### Key Features
1514

16-
To create an Iceberg database in Timeplus, use the following syntax:
15+
| Feature | Description |
16+
|----------|--------------|
17+
| **Native C++ Integration** | Fully implemented in C++ — no Java runtime required. |
18+
| **REST Catalog Support** | Works with any Iceberg REST Catalog implementation. |
19+
| **Stream-to-Iceberg Writes** | Continuously write streaming data into Iceberg tables. |
20+
| **Direct Reads from Iceberg** | Query Iceberg tables natively using Timeplus SQL. |
21+
| **Cloud Ready** | Optimized for S3 and compatible object storage systems. |
22+
23+
:::info
24+
Data compaction is **not yet supported** in the current Timeplus Iceberg integration.
25+
:::
26+
27+
## Create an Iceberg Database
28+
29+
You can create an **Iceberg database** in Timeplus using the `CREATE DATABASE` statement with the `type='iceberg'` setting.
30+
31+
### Syntax
1732

1833
```sql
1934
CREATE DATABASE <database_name>
2035
SETTINGS
21-
type='iceberg',
22-
catalog_uri='<catalog_uri>',
23-
catalog_type='rest',
24-
warehouse='<warehouse_path>',
25-
storage_endpoint='<s3_endpoint>',
26-
rest_catalog_sigv4_enabled=<true|false>,
27-
rest_catalog_signing_region='<region>',
28-
rest_catalog_signing_name='<service_name>',
29-
use_environment_credentials=<true|false>,
30-
credential='<username:password>',
31-
catalog_credential='<username:password>',
32-
storage_credential='<username:password>';
36+
type = 'iceberg',
37+
catalog_uri = '<catalog_uri>',
38+
catalog_type = 'rest',
39+
warehouse = '<warehouse_path>',
40+
storage_endpoint = '<s3_endpoint>',
41+
rest_catalog_sigv4_enabled = <true|false>,
42+
rest_catalog_signing_region = '<region>',
43+
rest_catalog_signing_name = '<service_name>',
44+
use_environment_credentials = <true|false>,
45+
credential = '<username:password>',
46+
catalog_credential = '<username:password>',
47+
storage_credential = '<username:password>';
3348
```
3449

35-
### DDL Settings {#settings}
50+
### Settings
3651

37-
- `type` – Specifies the type of the database. Be sure to use `iceberg` for Iceberg tables.
38-
- `catalog_uri` – Specifies the URI of the Iceberg catalog.
39-
- `catalog_type` Specifies the catalog type. Currently, only `rest` is supported in Timeplus.
40-
- `warehouse` The Iceberg warehouse identifier where the table data is stored.
41-
- `storage_endpoint` The S3-compatible endpoint where the data is stored. For AWS S3, use `https://bucketname.s3.region.amazonaws.com`.
42-
- `rest_catalog_sigv4_enabled` Enables AWS SigV4 authentication for secure catalog communication.
43-
- `rest_catalog_signing_region` AWS region used for signing the catalog requests.
44-
- `rest_catalog_signing_name` The service name used in AWS SigV4 signing.
45-
- `use_environment_credentials` – Default to true, Timeplus will use environment-based credentials, useful for cases where Timeplus runs in an AWS EC2 instance with an assigned IAM role, or AWS credentials in environment variables as `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. Setting this to false if you are using local minio or public S3 bucket.
46-
- `credential` A unified credential (username:password format) that applies to both catalog and storage if they share the same authentication (e.g. AWS access key and secret key).
47-
- `catalog_credential` – If the catalog requires a separate credential, specify it here.
48-
- `storage_credential` – If the storage (e.g. S3) requires a different credential, specify it separately.
52+
- `type` — Must be set to `'iceberg'` to indicate an Iceberg database.
53+
- `catalog_uri` — The URI of the Iceberg catalog (e.g., AWS Glue, Gravitino, or another REST catalog endpoint).
54+
- `catalog_type` Specifies the catalog type. Currently, only `'rest'` is supported in Timeplus.
55+
- `warehouse` The path or identifier of the Iceberg warehouse where table data is stored (e.g., an S3 path).
56+
- `storage_endpoint` The S3-compatible endpoint where data files are stored. For AWS S3, use `https://<bucket>.s3.<region>.amazonaws.com`.
57+
- `rest_catalog_sigv4_enabled` Enables [AWS SigV4](https://docs.aws.amazon.com/general/latest/gr/signing_aws_api_requests.html) authentication for secure catalog communication.
58+
- `rest_catalog_signing_region` — The AWS region used for SigV4 signing (e.g., `us-west-2`).
59+
- `rest_catalog_signing_name` The service name used in SigV4 signing (typically `glue` or `s3`).
60+
- `use_environment_credentials` — Defaults to `true`. When enabled, Timeplus uses environment-based credentials such as IAM roles or environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`). Set this to `false` when using local MinIO or a public S3 bucket.
61+
- `credential` A unified credential in `username:password` format (for example, AWS access key and secret key). Used for both catalog and storage if they share the same authentication.
62+
- `catalog_credential` — Optional. Use when the catalog requires credentials different from the storage layer.
63+
- `storage_credential` — Optional. Use when the storage backend (e.g., S3 or MinIO) requires separate credentials.
4964

50-
### Example: AWS Glue REST Catalog {#example_glue}
65+
### Example: AWS Glue REST Catalog
5166

5267
```sql
5368
CREATE DATABASE demo
54-
SETTINGS type='iceberg',
55-
catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
56-
catalog_type='rest',
57-
warehouse='(aws-12-id)',
58-
storage_endpoint='https://the-bucket.s3.us-west-2.amazonaws.com',
59-
rest_catalog_sigv4_enabled=true,
60-
rest_catalog_signing_region='us-west-2',
61-
rest_catalog_signing_name='glue';
69+
SETTINGS
70+
type='iceberg',
71+
catalog_type='rest',
72+
catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
73+
warehouse='aws-12-id',
74+
storage_endpoint='https://the-bucket.s3.us-west-2.amazonaws.com',
75+
rest_catalog_sigv4_enabled=true,
76+
rest_catalog_signing_region='us-west-2',
77+
rest_catalog_signing_name='glue';
6278
```
6379

64-
### Example: AWS S3 Table REST Catalog {#example_s3table}
80+
**Explanation**:
81+
- This example connects Timeplus to an **AWS Glue REST-based Iceberg catalog**.
82+
- `rest_catalog_sigv4_enabled=true` ensures secure communication using AWS SigV4 signing.
83+
- The `warehouse` value identifies the Iceberg warehouse managed by Glue.
84+
- The `storage_endpoint` points to the S3 bucket where the Iceberg table data resides.
85+
86+
### Example: AWS S3 Table REST Catalog
6587

6688
```sql
6789
CREATE DATABASE demo
68-
SETTINGS type='iceberg',
69-
catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
70-
catalog_type='rest',
71-
warehouse='(aws-12-id):s3tablescatalog/(bucket-name)',
72-
rest_catalog_sigv4_enabled=true,
73-
rest_catalog_signing_region='us-west-2',
74-
rest_catalog_signing_name='glue';
90+
SETTINGS
91+
type='iceberg',
92+
catalog_type='rest',
93+
catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
94+
warehouse='aws-12-id:s3tablescatalog/bucket-name',
95+
rest_catalog_sigv4_enabled=true,
96+
rest_catalog_signing_region='us-west-2',
97+
rest_catalog_signing_name='glue';
7598
```
7699

77-
If you want to create new Iceberg tables from Timeplus, you can also set `storage_credential` to `'https://s3tables.us-west-2.amazonaws.com/(bucket-name)'`.
100+
**Explanation**:
101+
- This example configures an **AWS S3 Table REST Catalog** for Iceberg in Timeplus.
102+
- The warehouse setting specifies the Glue catalog and S3 bucket location.
103+
- `rest_catalog_sigv4_enabled=true` enables secure communication with AWS using SigV4 signing.
104+
- To **create new Iceberg tables** directly from Timeplus, you can also set: `storage_credential='https://s3tables.us-west-2.amazonaws.com/bucket-name';`
78105

79-
### Example: Apache Gravitino REST Catalog {#example_gravitino}
106+
### Example: AWS S3 Table REST Catalog
80107

81108
```sql
82109
CREATE DATABASE demo
83-
SETTINGS type='iceberg',
84-
catalog_uri='http://127.0.0.1:9001/iceberg/',
85-
catalog_type='rest',
86-
warehouse='s3://mybucket/demo/gravitino1',
87-
storage_endpoint='https://the-bucket.s3.us-west-2.amazonaws.com';
110+
SETTINGS
111+
type='iceberg',
112+
catalog_type='rest',
113+
catalog_uri='http://127.0.0.1:9001/iceberg/',
114+
warehouse='s3://mybucket/demo/gravitino1',
115+
storage_endpoint='https://the-bucket.s3.us-west-2.amazonaws.com';
88116
```
89117

90-
#### Gravitino Configuration
91-
Here is the sample configuration for Gravitino Iceberg REST Server 0.9.0:
118+
**Explanation**:
119+
- This example connects Timeplus to an **Apache Gravitino Iceberg REST Catalog**.
120+
- The `catalog_uri` points to the running Gravitino REST service.
121+
- The `warehouse` specifies the S3 path where Iceberg tables are stored.
122+
- The `storage_endpoint` defines the S3-compatible storage endpoint.
123+
124+
**Sample Gravitino Configuration**
125+
126+
Below is an example configuration file for Gravitino Iceberg REST Server 0.9.0:
92127

93128
```properties
94129
# conf/gravitino-iceberg-rest-server.conf
95130
gravitino.iceberg-rest.catalog-backend = memory
96131

97132
gravitino.iceberg-rest.warehouse = s3://mybucket/demo/gravitino1
98-
gravitino.iceberg-rest.io-impl= org.apache.iceberg.aws.s3.S3FileIO
133+
gravitino.iceberg-rest.io-impl = org.apache.iceberg.aws.s3.S3FileIO
99134

100135
gravitino.iceberg-rest.s3-endpoint = https://s3.us-west-2.amazonaws.com
101136
gravitino.iceberg-rest.s3-region = us-west-2
@@ -104,13 +139,25 @@ gravitino.iceberg-rest.s3-access-key-id = theaccesskeyid
104139
gravitino.iceberg-rest.s3-secret-access-key = thesecretaccesskey
105140
```
106141

107-
## Creating and Writing to an Iceberg Table {#create_table}
142+
**Explanation**:
143+
- `catalog-backend=memory` stores catalog metadata in-memory (useful for testing).
144+
- `warehouse` and `io-impl` specify where and how data is stored in S3.
145+
- `s3-endpoint` and `s3-region` define the AWS region and endpoint.
146+
- `credential-provider-type` and the access keys provide authentication for S3 access.
147+
148+
## Create Iceberg Stream
149+
150+
```sql
151+
-- Create a Iceberg stream under the Iceberg database
152+
CREATE STREAM <iceberg_database>.<stream_name> (
153+
-- column definitions
154+
);
155+
```
108156

109-
Once the Iceberg database is created in Timeplus, you can list existing tables in the database or create new table via Timeplus SQL:
157+
**Example**:
110158

111159
```sql
112-
-- Make sure to create the table under the iceberg database
113-
CREATE STREAM demo.transformed(
160+
CREATE STREAM demo.transformed (
114161
timestamp datetime64,
115162
org_id string,
116163
float_value float,
@@ -120,64 +167,84 @@ CREATE STREAM demo.transformed(
120167
);
121168
```
122169

123-
## Writing to Iceberg via a Materialized View {#write_via_mv}
124-
You can run `INSERT INTO` statements to write data to Iceberg tables, or set up a materialized view to continuously write data to Iceberg tables.
170+
After creating an Iceberg database in Timeplus, you can list existing tables or create new ones directly via SQL.
171+
172+
## Writing to Iceberg
173+
174+
You can insert data directly via `INSERT INTO` SQL statement or continuously write to Iceberg streams using materialized views:
175+
176+
**Example**:
125177

126178
```sql
127-
CREATE MATERIALIZED VIEW mv_write_iceberg INTO demo.transformed AS
128-
SELECT now() AS timestamp, org_id, float_value,
129-
length(`array_of_records.a_num`) AS array_length,
130-
array_max(`array_of_records.a_num`) AS max_num,
131-
array_min(`array_of_records.a_num`) AS min_num
179+
CREATE MATERIALIZED VIEW sink_to_iceberg_mv INTO demo.transformed AS
180+
SELECT
181+
now() AS timestamp,
182+
org_id,
183+
float_value,
184+
length(array_of_records.a_num) AS array_length,
185+
array_max(array_of_records.a_num) AS max_num,
186+
array_min(array_of_records.a_num) AS min_num
132187
FROM msk_stream_read
133188
SETTINGS s3_min_upload_file_size=1024;
134189
```
135190

136-
## Querying Iceberg Data with SparkSQL {#query_iceberg}
191+
This example continuously writes transformed data from a streaming source (`msk_stream_read`) into an Iceberg table.
192+
193+
## Reading from Iceberg
137194

138-
### Using SQL in Timeplus {#query_timeplus}
139-
You can query Iceberg data in Timeplus by:
195+
You can query Iceberg data in Timeplus using standard SQL syntax:
140196
```sql
141-
SELECT * FROM demo.transformed
197+
SELECT ... FROM <iceberg_database>.<iceberg_stream>;
142198
```
143-
This will return all results and terminate the query. No streaming mode is supported for Iceberg tables yet. It's recommended to set `LIMIT` to a small value to avoid loading too much data from Iceberg to Timeplus.
144199

200+
:::info
201+
Iceberg streams in Timeplus behave like static tables — queries return the full result set and then terminate.
202+
203+
For large tables, it’s recommended to include a LIMIT clause to avoid excessive data loading.
204+
205+
In future releases, **continuous streaming query support** for Iceberg streams will be added, allowing real-time incremental reads from Iceberg data.
206+
:::
207+
208+
**Example**:
145209
```sql
146210
SELECT count() FROM iceberg_db.table_name;
147211
```
148-
This query is optimized to return the count of rows in the specified Iceberg table with minimal scanning of metadata and data files.
149212

150-
### Using SparkSQL {#query_sparksql}
213+
You can also use **SparkSQL** to validate or analyze Iceberg data created by Timeplus.
214+
Depending on your catalog setup, use one of the following configurations:
151215

152-
Depending on whether you setup the catalog via AWS Glue or Apache Gravitino, you can also start a SparkSQL session to query or insert data into Iceberg tables.
216+
**For AWS Glue REST Catalog**:
153217

154218
```bash
155219
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1,software.amazon.awssdk:bundle:2.30.2,software.amazon.awssdk:url-connection-client:2.30.2 \
156-
--conf spark.sql.defaultCatalog=spark_catalog \
157-
--conf spark.sql.catalog.spark_catalog.type=rest \
158-
--conf spark.sql.catalog.spark_catalog.uri=https://glue.us-west-2.amazonaws.com/iceberg \
159-
--conf spark.sql.catalog.spark_catalog.warehouse=$AWS_12_ID \
160-
--conf spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true \
161-
--conf spark.sql.catalog.spark_catalog.rest.signing-name=glue \
162-
--conf spark.sql.catalog.spark_catalog.rest.signing-region=us-west-2
220+
--conf spark.sql.defaultCatalog=spark_catalog \
221+
--conf spark.sql.catalog.spark_catalog.type=rest \
222+
--conf spark.sql.catalog.spark_catalog.uri=https://glue.us-west-2.amazonaws.com/iceberg \
223+
--conf spark.sql.catalog.spark_catalog.warehouse=$AWS_12_ID \
224+
--conf spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true \
225+
--conf spark.sql.catalog.spark_catalog.rest.signing-name=glue \
226+
--conf spark.sql.catalog.spark_catalog.rest.signing-region=us-west-2
163227
```
164228

229+
**For Apache Gravitino REST Catalog**:
230+
165231
```bash
166232
spark-sql -v --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-aws-bundle:1.8.1,software.amazon.awssdk:bundle:2.30.2,software.amazon.awssdk:url-connection-client:2.30.2 \
167-
--conf spark.sql.defaultCatalog=spark_catalog \
168-
--conf spark.sql.catalog.spark_catalog.type=rest \
169-
--conf spark.sql.catalog.spark_catalog.uri=http://127.0.0.1:9001/iceberg/ \
170-
--conf spark.sql.catalog.spark_catalog.warehouse=s3://mybucket/demo/gravitino1
233+
--conf spark.sql.defaultCatalog=spark_catalog \
234+
--conf spark.sql.catalog.spark_catalog.type=rest \
235+
--conf spark.sql.catalog.spark_catalog.uri=http://127.0.0.1:9001/iceberg/ \
236+
--conf spark.sql.catalog.spark_catalog.warehouse=s3://mybucket/demo/gravitino1
171237
```
172238

173-
## Drop Iceberg Database
239+
240+
## Dropping Iceberg Database
241+
242+
To remove an Iceberg database from Timeplus:
174243

175244
```sql
176-
DROP DATABASE demo CASCADE;
245+
DROP DATABASE <iceberg_database> CASCADE;
177246
```
178247

179-
Please note this won't delete the data in catalog or S3 storage.
180-
181-
## Limitations
182-
- As of March 2025, only the REST catalog is supported.
183-
- Verified to work with AWS S3 for data storage. Other S3-compatible storages may work but are unverified.
248+
:::info
249+
This command deletes metadata within Timeplus, **but does not remove data** from the Iceberg catalog or the underlying S3 storage.
250+
:::

0 commit comments

Comments
 (0)