An OED portfolio that is created with "native pandas" functions can become much larger when loaded into ods_tools due to inefficient datatypes.
- Some columns that typically have a limited number of values are set to strings or integers instead of categoricals:
OccupancyCode, ConstructionCode, Basement, FoundationType...
- Some columns that typically have many different values are set to categorical:
LocNumber (and FlexiLocXYZ although this will depend on the values)...
- Nullable datatypes such as
Int64Dtype() take more space than a plain int64.
In the data I was working with, the parquet file went from 3GB initially to 61GB after loading into ods_tools (it has 86 million rows so perhaps not typical).
The snippet below demonstrates how we can reverse this inflation of data size by changing some datatypes. It assumes that the size of a dataframe in memory correlates with the size in a parquet file which may or may not be correct. This is with ods_tools 4.0.1. It may be beneficial to change some of the default datatypes to reduce file size.
In [70]: loc_cat = oed_exposure.location.dataframe.astype({"OccupancyCode": "category", "ConstructionCode": "category", "LocNumber": "str", "FlexiLocOriginalId": "str", "NumberOfBuildings": "int64", "Basement": "category", "FoundationType": "cate
⋮ gory", "NumberOfStoreys": "int32"})
In [72]: pd.concat(
...: {
...: "oasis_memory_megabytes": oed_exposure.location.dataframe.memory_usage(deep=True) / 1024 / 1024,
...: "oasis_dtype": oed_exposure.location.dataframe.dtypes,
...: "updated_memory_megabytes": loc_cat.memory_usage(deep=True) / 1024 / 1024,
...: "updated_dtype": loc_cat.dtypes,
...: },
...: axis="columns",
...: ).sort_values("oasis_memory_megabytes", ascending=False)
Out[72]:
oasis_memory_megabytes oasis_dtype updated_memory_megabytes updated_dtype
LocNumber 7066.309267 category 4673.594064 object
FlexiLocOriginalId 7066.309260 category 4673.594057 object
OccupancyCode 739.609016 Int64 82.179143 category
ConstructionCode 739.609016 Int64 82.178891 category
NumberOfBuildings 739.609016 Int64 657.430237 int64
Latitude 657.430237 float64 657.430237 float64
FirstFloorHeight 657.430237 float64 657.430237 float64
BuildingTIV 657.430237 float64 657.430237 float64
Longitude 657.430237 float64 657.430237 float64
FoundationType 410.893898 Int32 82.178872 category
NumberOfStoreys 410.893898 Int32 328.715118 int32
Basement 410.893898 Int32 82.178872 category
GeogName1 334.046868 category 334.046868 category
PostalCode 166.024951 category 166.024951 category
AreaName 82.181823 category 82.181823 category
LocUserDef2 82.179667 category 82.179667 category
GeogName2 82.179589 category 82.179589 category
OEDVersion 82.178994 category 82.178994 category
LocPerilsCovered 82.178987 category 82.178987 category
GeogScheme1 82.178979 category 82.178979 category
LocCurrency 82.178979 category 82.178979 category
GeogScheme2 82.178979 category 82.178979 category
CountryCode 82.178978 category 82.178978 category
PortNumber 82.178977 category 82.178977 category
AccNumber 82.178977 category 82.178977 category
FloorsOccupied 82.178929 category 82.178929 category
Index 0.000126 NaN 0.000126 NaN
In [74]: oed_exposure.location.dataframe.agg("nunique")
Out[74]:
OccupancyCode 9
AreaName 36
GeogName1 57952
Longitude 86170650
Latitude 86170667
LocUserDef2 11
BuildingTIV 1
PortNumber 1
AccNumber 1
CountryCode 1
LocPerilsCovered 1
LocCurrency 1
GeogScheme1 1
GeogScheme2 1
FirstFloorHeight 1
NumberOfBuildings 1
NumberOfStoreys 1
FloorsOccupied 1
FoundationType 1
OEDVersion 1
Basement 1
ConstructionCode 1
LocNumber 86170696
FlexiLocOriginalId 86170696
GeogName2 11
PostalCode 22592
dtype: int64
An OED portfolio that is created with "native pandas" functions can become much larger when loaded into ods_tools due to inefficient datatypes.
OccupancyCode,ConstructionCode,Basement,FoundationType...LocNumber(andFlexiLocXYZalthough this will depend on the values)...Int64Dtype()take more space than a plainint64.In the data I was working with, the parquet file went from 3GB initially to 61GB after loading into ods_tools (it has 86 million rows so perhaps not typical).
The snippet below demonstrates how we can reverse this inflation of data size by changing some datatypes. It assumes that the size of a dataframe in memory correlates with the size in a parquet file which may or may not be correct. This is with ods_tools 4.0.1. It may be beneficial to change some of the default datatypes to reduce file size.