Skip to content

find_column_type takes too long for large dataset #336

@gladysteh99

Description

@gladysteh99

Trying to parse a string to timestamp takes very long when the dataframe is large with many rows for actual string columns that are not in any timestamp format.

Possible solutions:

  1. Thinking if we could leave it to the user to parse it explicitly in their original dataframe before uploading to snowflake? Technically, Polars users should have clear distinction since polars only has separate string column type and temporal dtypes, unlike the object dtype in pandas which could include pretty much everything.
  2. Use a sample of the records instead of everything
  3. Remove trying to cast "str" to any other datatypes. Technically, an object type shouldn't exist in polars DataFrame. Even if it did, our current validate_X_object helper functions for polars do not cover them as polars just doesn't have any functions for object type. Our current helper functions only cast from string to date/float, which is different from the behaviour for a pandas dataframe.

Justification for 3.
Small example:

import numpy as np
import pandas as pd
arr = np.array([1, "hello", 3.14, [1, 2, 3], {"a": 1}], dtype=object)
date = pd.date_range(start='1/1/2018', end='1/05/2018')

pd_df=pd.DataFrame({"mixed":arr,"date": date})
pd_df["date_str"] = pd_df["date"].dt.strftime('%Y-%m-%d').astype('string')
pd_df["date_object"] = pd_df["date"].dt.strftime('%Y-%m-%d')
pd_df["string"] = ["a","b","c","d","e"]
pd_df.dtypes

gives

mixed                  object
date           datetime64[ns]
date_str       string[python]
date_object            object
string                 object
dtype: object
import locopy
from locopy.utility import find_column_type

find_column_type(pd_df, "snowflake")

outputs: OrderedDict([('mixed', 'varchar'),
('date', 'timestamp'),
('date_str', 'varchar'),
('date_object', 'date'),
('string', 'varchar')])

On the other hand for polars (see the difference in "date_str")

pl_df=pl.from_pandas(pd_df[["date","date_str","date_object","string"]])
pl_df=pl_df.with_columns(mixed =arr) #adding explicitly otherwise it will error out if we use an object type in from_pandas
find_column_type(pl_df,"snowflake")

outputs:
OrderedDict([('date', 'timestamp'),
('date_str', 'date'),
('date_object', 'date'),
('string', 'varchar'),
('mixed', 'varchar')])

polars also automatically try to infer datatypes

eg. pl_df["date_object"] is automatically changed to "string"

shape: (5, 5)

date date_str date_object string mixed
datetime[ns] str str str object
2018-01-01 00:00:00 "2018-01-01" "2018-01-01" "a" 1
2018-01-02 00:00:00 "2018-01-02" "2018-01-02" "b" hello
2018-01-03 00:00:00 "2018-01-03" "2018-01-03" "c" 3.14
2018-01-04 00:00:00 "2018-01-04" "2018-01-04" "d" [1, 2, 3]
2018-01-05 00:00:00 "2018-01-05" "2018-01-05" "e" {'a': 1}

Due to the discrepancies in polars and pandas, I think the best is to pass all string columns as 'VARCHAR' without trying to cast them (same as any pandas string)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions