Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 65 additions & 18 deletions docs/content/hdfs_parquet.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_

## <a id="datatype_map"></a>Data Type Mapping

To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type.
To read and write Parquet primitive data types in Apache Cloudberry, map Parquet data values to Apache Cloudberry columns of the same type.

Parquet supports a small set of primitive data types, and uses metadata annotations to extend the data types that it supports. These annotations specify how to interpret the primitive type. For example, Parquet stores both `INTEGER` and `DATE` types as the `INT32` primitive type. An annotation identifies the original type as a `DATE`.

Expand All @@ -45,7 +45,7 @@ Parquet supports a small set of primitive data types, and uses metadata annotati

PXF uses the following data type mapping when reading Parquet data:

| Parquet Physical Type | Parquet Logical Type | PXF/Greenplum Data Type |
| Parquet Physical Type | Parquet Logical Type | PXF/Cloudberry Data Type |
|-------------------|---------------|--------------------------|
| boolean | -- | Boolean |
| binary \(byte\_array\) | -- | Bytea |
Expand All @@ -67,7 +67,7 @@ PXF uses the following data type mapping when reading Parquet data:

PXF can read a Parquet `LIST` nested type when it represents a one-dimensional array of certain Parquet types. The supported mappings follow:

| Parquet Data Type | PXF/Greenplum Data Type |
| Parquet Data Type | PXF/Cloudberry Data Type |
|-------------------|-------------------------|
| list of \<boolean> | Boolean[] |
| list of \<binary> | Bytea[] |
Expand All @@ -90,7 +90,7 @@ PXF can read a Parquet `LIST` nested type when it represents a one-dimensional a

PXF uses the following data type mapping when writing Parquet data:

| PXF/Greenplum Data Type | Parquet Physical Type | Parquet Logical Type |
| PXF/Cloudberry Data Type | Parquet Physical Type | Parquet Logical Type |
|-------------------|---------------|--------------------------|
| Bigint | int64 | -- |
| Boolean | boolean | -- |
Expand All @@ -114,7 +114,7 @@ PXF uses the following data type mapping when writing Parquet data:

PXF can write a one-dimensional `LIST` of certain Parquet data types. The supported mappings follow:

| PXF/Greenplum Data Type | Parquet Data Type |
| PXF/Cloudberry Data Type | Parquet Data Type |
|-------------------|--------------------------|
| Bigint[] | list of \<int64> |
| Boolean[] | list of \<boolean> |
Expand Down Expand Up @@ -149,7 +149,7 @@ When you provide the Parquet schema file to PXF, you must specify the absolute p

The PXF HDFS connector `hdfs:parquet` profile supports reading and writing HDFS data in Parquet-format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.

Use the following syntax to create a Greenplum Database external table that references an HDFS directory:
Use the following syntax to create a Apache Cloudberry external table that references an HDFS directory:

``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
Expand All @@ -160,7 +160,7 @@ FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```

The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html) command are described in the table below.
The specific keywords and values used in the Apache Cloudberry [CREATE EXTERNAL TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-external-table/) command are described in the table below.

| Keyword | Value |
|-------|-------------------------------------|
Expand All @@ -169,10 +169,36 @@ The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. |
| \<custom&#8209;option\> | \<custom-option\>s are described below.|
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or `<column_name>` on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
| DISTRIBUTED BY | If you want to load data from an existing Apache Cloudberry table into the writable external table, consider specifying the same distribution policy or `<column_name>` on both tables. Doing so will avoid extra motion of data between segments on the load operation. |


## <a id="profile_cfdw"></a>Creating the Foreign Table

The PXF HDFS `hdfs_pxf_fdw` foreign data wrapper supports reading and writing Parquet-formatted HDFS files. When you insert records into a foreign table, the block(s) of data that you insert are written to one file per segment in the directory that you specified in the `resource` clause.

Use the following syntax to create an Apache Cloudberry foreign table that references an HDFS file or directory:

``` sql
CREATE SERVER <foreign_server> FOREIGN DATA WRAPPER hdfs_pxf_fdw;
CREATE USER MAPPING FOR <user_name> SERVER <foreign_server>;

CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
SERVER <foreign_server>
OPTIONS ( resource '<path-to-file>', format 'parquet' [, <custom-option> '<value>'[...]]);
```

The specific keywords and values used in the Apache Cloudberry [CREATE FOREIGN TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-foreign-table) command are described below.

| Keyword | Value |
|-------|-------------------------------------|
| \<foreign_server\> | The named server configuration that PXF uses to access the data. You can override credentials in `CREATE SERVER` statement as described in [Overriding the S3 Server Configuration for Foreign Tables](access_s3.html#s3_override_fdw) |
| \<path&#8209;to&#8209;hdfs&#8209;file\> | The path to the directory in the HDFS data store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path&#8209;to&#8209;hdfs&#8209;file\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path&#8209;to&#8209;hdfs&#8209;file\> must not specify a relative path nor include the dollar sign (`$`) character. |
| format | The file format; specify `'parquet'` for Parquet-formatted data. |
| \<custom-option\> | \<custom-option\>s are described below. |

<a id="customopts"></a>
The PXF `hdfs:parquet` profile supports the following read option. You specify this option in the `CREATE EXTERNAL TABLE` `LOCATION` clause:
The PXF `hdfs:parquet` profile supports the following read option:

| Read Option | Value Description |
|-------|-------------------------------------|
Expand All @@ -188,7 +214,7 @@ The PXF `hdfs:parquet` profile supports encoding- and compression-related write
| ENABLE\_DICTIONARY | A boolean value that specifies whether or not to enable dictionary encoding. The default value is `true`; dictionary encoding is enabled when PXF writes Parquet files. |
| DICTIONARY\_PAGE\_SIZE | When dictionary encoding is enabled, there is a single dictionary page per column, per row group. `DICTIONARY_PAGE_SIZE` is similar to `PAGE_SIZE`, but for the dictionary. The default dictionary page size is `1 * 1024 * 1024` bytes. |
| PARQUET_VERSION | The Parquet version; PXF supports the values `v1` and `v2` for this option. The default Parquet version is `v1`. |
| SCHEMA | The absolute path to the Parquet schema file on the Greenplum host or on HDFS. |
| SCHEMA | The absolute path to the Parquet schema file on the Cloudberry PXF host or on HDFS. |

**Note**: You must explicitly specify `uncompressed` if you do not want PXF to compress the data.

Expand All @@ -208,12 +234,29 @@ This example utilizes the data schema introduced in [Example: Reading Text Data

In this example, you create a Parquet-format writable external table that uses the default PXF server to reference Parquet-format data in HDFS, insert some data into the table, and then create a readable external table to read the data.

1. Use the `hdfs:parquet` profile to create a writable external table. For example:
1. Apache Cloudberry does not support both reading and writing single external table. Create two table - one for read and one for write referencing same HDFS directory:

``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');

postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

OR create single foreign table to read and write operations:

```
testdb=# CREATE SERVER example_parquet FOREIGN DATA WRAPPER hdfs_pxf_fdw;
testdb=# CREATE USER MAPPING FOR CURRENT_USER SERVER example_parquet;
testdb=# CREATE FOREIGN TABLE pxf_tbl_parquet(location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
SERVER example_parquet
OPTIONS (
resource 'data/pxf_examples/pxf_parquet',
format 'parquet'
);
```

2. Write a few records to the `pxf_parquet` HDFS directory by inserting directly into the `pxf_tbl_parquet` table. For example:
Expand All @@ -223,20 +266,24 @@ In this example, you create a Parquet-format writable external table that uses t
postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 2, '{3333,7777}', 96645.37 );
```

3. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in `pxf_parquet`, create a readable external Greenplum Database referencing this HDFS directory:
3. Query the readable external table `read_pxf_parquet`:

``` sql
postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
```
``` pre
location | month | number_of_orders | item_quantity_per_order | total_sales
-----------+-------+------------------+-------------------------+-------------
Frankfurt | Mar | 777 | {1,11,111} | 3956.98
Cleveland | Oct | 3812 | {3333,7777} | 96645.4
(2 rows)
```

4. Query the readable external table `read_pxf_parquet`:
OR query the same foreign table `pxf_tbl_parquet`:

``` sql
postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
postgres=# SELECT * FROM pxf_tbl_parquet ORDER BY total_sales;
```

``` pre
location | month | number_of_orders | item_quantity_per_order | total_sales
-----------+-------+------------------+-------------------------+-------------
Expand Down
58 changes: 49 additions & 9 deletions docs/content/objstore_parquet.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Ensure that you have met the PXF Object Store [Prerequisites](access_objstore.ht

## <a id="datatype_map"></a>Data Type Mapping

Refer to [Data Type Mapping](hdfs_parquet.html#datatype_map) in the PXF HDFS Parquet documentation for a description of the mapping between Greenplum Database and Parquet data types.
Refer to [Data Type Mapping](hdfs_parquet.html#datatype_map) in the PXF HDFS Parquet documentation for a description of the mapping between Apache Cloudberry and Parquet data types.

## <a id="profile_cet"></a>Creating the External Table

Expand All @@ -47,7 +47,7 @@ The PXF `<objstore>:parquet` profiles support reading and writing data in Parque
| S3 | s3 |


Use the following syntax to create a Greenplum Database external table that references an HDFS directory. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
Use the following syntax to create a Apache Cloudberry external table that references an HDFS directory. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.

``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
Expand All @@ -58,7 +58,7 @@ FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```

The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html) command are described in the table below.
The specific keywords and values used in the Apache Cloudberry [CREATE EXTERNAL TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-external-table/) command are described in the table below.

| Keyword | Value |
|-------|-------------------------------------|
Expand All @@ -67,30 +67,70 @@ The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. |
| \<custom&#8209;option\>=\<value\> | Parquet-specific custom options are described in the [PXF HDFS Parquet documentation](hdfs_parquet.html#customopts). |
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or `<column_name>` on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
| DISTRIBUTED BY | If you want to load data from an existing Apache Cloudberry table into the writable external table, consider specifying the same distribution policy or `<column_name>` on both tables. Doing so will avoid extra motion of data between segments on the load operation. |

If you are accessing an S3 object store:

- You can provide S3 credentials via custom options in the `CREATE EXTERNAL TABLE` command as described in [Overriding the S3 Server Configuration for External Tables DDL](access_s3.html#s3_override_ext).
- If you are reading Parquet data from S3, you can direct PXF to use the S3 Select Amazon service to retrieve the data. Refer to [Using the Amazon S3 Select Service](access_s3.html#s3_select) for more information about the PXF custom option used for this purpose.

## <a id="profile_cfdw"></a>Creating the Foreign Table

Use one of the following foreign data wrappers with `format 'parquet'`.

| Object Store | Foreign Data Wrapper |
|-------|-------------------------------------|
| Azure Blob Storage | wasbs_pxf_fdw |
| Azure Data Lake Storage Gen2 | abfss_pxf_fdw |
| Google Cloud Storage | gs_pxf_fdw |
| MinIO | s3_pxf_fdw |
| S3 | s3_pxf_fdw |

The following syntax creates a Apache Cloudberry foreign table that references an Parquet-format file:

``` sql
CREATE SERVER <foreign_server> FOREIGN DATA WRAPPER <store>_pxf_fdw;
CREATE USER MAPPING FOR <user_name> SERVER <foreign_server>;

CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
SERVER <foreign_server>
OPTIONS ( resource '<path-to-file>', format 'parquet' [, <custom-option> '<value>' [, ...] ]);
```

| Keyword | Value |
|-------|-------------------------------------|
| \<foreign_server\> | The named server configuration that PXF uses to access the data. You can override credentials in `CREATE SERVER` statement as described in [Overriding the S3 Server Configuration for Foreign Tables](access_s3.html#s3_override_fdw) |
| resource \<path&#8209;to&#8209;file\> | The path to the directory or file in the object store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path&#8209;to&#8209;file\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path&#8209;to&#8209;file\> must not specify a relative path nor include the dollar sign (`$`) character. |
| format 'parquet' | The file format; specify `'parquet'` for Parquet-formatted data. |
| \<custom&#8209;option\>=\<value\> | parquet-specific custom options are described in the [PXF HDFS parquet documentation](hdfs_parquet.html#customopts). |


## <a id="example"></a> Example

Refer to the [Example](hdfs_parquet.html#parquet_write) in the PXF HDFS Parquet documentation for a Parquet write/read example. Modifications that you must make to run the example with an object store include:

- Using the `CREATE WRITABLE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above for the writable external table. For example, if your server name is `s3srvcfg`:
- Using the `CREATE WRITABLE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above for the writable and readable external tables. For example, if your server name is `s3srvcfg`:

``` sql
CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet_s3 (location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_parquet?PROFILE=s3:parquet&SERVER=s3srvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```

- Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above for the readable external table. For example, if your server name is `s3srvcfg`:

``` sql
CREATE EXTERNAL TABLE read_pxf_parquet_s3(location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_parquet?PROFILE=s3:parquet&SERVER=s3srvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
- Using the `CREATE FOREIGN TABLE` syntax and settings described above for the foreign table. For example, if your server name is `s3srvcfg`:

``` sql
CREATE SERVER s3srvcfg FOREIGN DATA WRAPPER s3_pxf_fdw;
CREATE USER MAPPING FOR CURRENT_USER SERVER s3srvcfg;

CREATE FOREIGN TABLE pxf_parquet_s3 (location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision)
SERVER s3srvcfg
OPTIONS (
resource 'BUCKET/pxf_examples/pxf_parquet',
format 'parquet'
)
```
Loading