Skip to content

[Bug] flink write with iceberg metadata and use spark read, but spark scan all data without doing partition prune  #7108

@blackflash997997

Description

@blackflash997997

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

flink catalog:

CREATE CATALOG paimon_catalog WITH (
    'type' = 'paimon',
    'metastore' = 'hive',
    'uri' = 'thrift://xxx:9083',
    'warehouse' = 'jfs://poc-jfs/user/hive/lakehouse_paimon',
    'table-default.metadata.iceberg.storage'='hive-catalog',
    'table-default.metadata.iceberg.uri'='thrift://x:9083'xxx
);

USE CATALOG paimon_catalog;

i'm using following sql write data from kafka

insert into my_database.paimon_table
select *,
DATE_FORMAT(SYSTEMDATE,'yyyyMMdd') -- this is a partition field : dt
from kafka
where `TIME` IS NOT NULL
;

then i use spark-sql to query:

spark-sql  --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive

select * from my_database.paimon_table where dt=20251231

and i found that spark would scan all data to find these "dt=20251231" rows

I also found that using the describe formatted my_database.paimon_table; did not display the # Metadata Columns fields that would be present when creating the paimon table using spark-sql.

like following, there is no # Metadata Columns ,which cause using partition field to filter data failed

......
dt                      string                  from deserializer   
                                                                    
# Detailed Table Information                                                
Catalog                 spark_catalog                               
Database                paimon_flink1                               
Table                   zvos_flink_14_append2                       
Owner                   zoomspace                                   
Created Time            Thu Jan 22 17:40:42 CST 2026                        
Last Access             Thu Jan 22 17:40:42 CST 2026                        
Created By              Spark 2.2 or prior                          
Type                    MANAGED                                     
Provider                hive                                        
Comment                                                             
Table Properties        [metadata.iceberg.storage=hive-catalog, metadata.iceberg.uri=thrift://xxxx:9083, metadata_location=jfs://poc-jfs/user/hive/lakehouse_paimon/iceberg/paimon_flink1/zvos_flink_14_append2/metadata/v3.metadata.json, partition=dt, previous_metadata_location=jfs://poc-jfs/user/hive/lakehouse_paimon/iceberg/paimon_flink1/zvos_flink_14_append2/metadata/v2.metadata.json, storage_handler=org.apache.paimon.hive.PaimonStorageHandler, table_type=PAIMON, transient_lastDdlTime=1769074842]                         
Statistics              87763321 bytes                              
Location                jfs://poc-jfs/user/hive/lakehouse_paimon/paimon_flink1.db/zvos_flink_14_append2                     
Serde Library           org.apache.paimon.hive.PaimonSerDe                          
InputFormat             org.apache.paimon.hive.mapred.PaimonInputFormat                     
OutputFormat            org.apache.paimon.hive.mapred.PaimonOutputFormat                            

Compute Engine

paimon version: 1.4.1 snapshot
Write: using flink 1.20.1 on yarn with JuiceFS filesystem
Read: using spark3.5.2 、iceberg 1.6.1

Minimal reproduce step

use flink to write a partition-key table with iceberg metadata,
use where partition-key=xxxx to filter data

What doesn't meet your expectations?

use where partition-key=xxxx to filter data would be scan a specific path ,not scan all data

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions