Data types produce larger than necessary parquet files

An OED portfolio that is created with "native pandas" functions can become much larger when loaded into ods_tools due to inefficient datatypes.
- Some columns that typically have a limited number of values are set to strings or integers instead of categoricals: `OccupancyCode`, `ConstructionCode`, `Basement`, `FoundationType`...
- Some columns that typically have many different values are set to categorical: `LocNumber` (and `FlexiLocXYZ` although this will depend on the values)...
- Nullable datatypes such as `Int64Dtype()` take more space than a plain `int64`.

In the data I was working with, the parquet file went from 3GB initially to 61GB after loading into ods_tools (it has 86 million rows so perhaps not typical).

The snippet below demonstrates how we can reverse this inflation of data size by changing some datatypes. It assumes that the size of a dataframe in memory correlates with the size in a parquet file which may or may not be correct. This is with ods_tools 4.0.1. It may be beneficial to change some of the default datatypes to reduce file size.

```python
In [70]: loc_cat = oed_exposure.location.dataframe.astype({"OccupancyCode": "category", "ConstructionCode": "category", "LocNumber": "str", "FlexiLocOriginalId": "str", "NumberOfBuildings": "int64", "Basement": "category", "FoundationType": "cate
       ⋮ gory", "NumberOfStoreys": "int32"})

In [72]: pd.concat(
    ...:     {
    ...:         "oasis_memory_megabytes": oed_exposure.location.dataframe.memory_usage(deep=True) / 1024 / 1024,
    ...:         "oasis_dtype": oed_exposure.location.dataframe.dtypes,
    ...:         "updated_memory_megabytes": loc_cat.memory_usage(deep=True) / 1024 / 1024,
    ...:         "updated_dtype": loc_cat.dtypes,
    ...:     },
    ...:     axis="columns",
    ...: ).sort_values("oasis_memory_megabytes", ascending=False)

Out[72]: 
                    oasis_memory_megabytes oasis_dtype  updated_memory_megabytes updated_dtype
LocNumber                      7066.309267    category               4673.594064        object
FlexiLocOriginalId             7066.309260    category               4673.594057        object
OccupancyCode                   739.609016       Int64                 82.179143      category
ConstructionCode                739.609016       Int64                 82.178891      category
NumberOfBuildings               739.609016       Int64                657.430237         int64
Latitude                        657.430237     float64                657.430237       float64
FirstFloorHeight                657.430237     float64                657.430237       float64
BuildingTIV                     657.430237     float64                657.430237       float64
Longitude                       657.430237     float64                657.430237       float64
FoundationType                  410.893898       Int32                 82.178872      category
NumberOfStoreys                 410.893898       Int32                328.715118         int32
Basement                        410.893898       Int32                 82.178872      category
GeogName1                       334.046868    category                334.046868      category
PostalCode                      166.024951    category                166.024951      category
AreaName                         82.181823    category                 82.181823      category
LocUserDef2                      82.179667    category                 82.179667      category
GeogName2                        82.179589    category                 82.179589      category
OEDVersion                       82.178994    category                 82.178994      category
LocPerilsCovered                 82.178987    category                 82.178987      category
GeogScheme1                      82.178979    category                 82.178979      category
LocCurrency                      82.178979    category                 82.178979      category
GeogScheme2                      82.178979    category                 82.178979      category
CountryCode                      82.178978    category                 82.178978      category
PortNumber                       82.178977    category                 82.178977      category
AccNumber                        82.178977    category                 82.178977      category
FloorsOccupied                   82.178929    category                 82.178929      category
Index                             0.000126         NaN                  0.000126           NaN

In [74]: oed_exposure.location.dataframe.agg("nunique")
Out[74]: 
OccupancyCode                9
AreaName                    36
GeogName1                57952
Longitude             86170650
Latitude              86170667
LocUserDef2                 11
BuildingTIV                  1
PortNumber                   1
AccNumber                    1
CountryCode                  1
LocPerilsCovered             1
LocCurrency                  1
GeogScheme1                  1
GeogScheme2                  1
FirstFloorHeight             1
NumberOfBuildings            1
NumberOfStoreys              1
FloorsOccupied               1
FoundationType               1
OEDVersion                   1
Basement                     1
ConstructionCode             1
LocNumber             86170696
FlexiLocOriginalId    86170696
GeogName2                   11
PostalCode               22592
dtype: int64
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data types produce larger than necessary parquet files #190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data types produce larger than necessary parquet files #190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions