find_column_type takes too long for large dataset

[Trying to parse a string to timestamp](https://github.com/capitalone/locopy/blob/5f32896b36cf18d2f074331a8c4bb7bd01e84388/locopy/utility.py#L414-L416) takes very long when the dataframe is large with many rows for actual string columns that are not in any timestamp format. 

Possible solutions:
1. Thinking if we could leave it to the user to parse it explicitly in their original dataframe before uploading to snowflake? Technically, Polars users should have clear distinction since polars only has separate string column type and temporal dtypes, unlike the object dtype in pandas which could include pretty much everything. 
2. Use a sample of the records instead of everything
3. Remove trying to cast "str" to any other datatypes. Technically, an object type shouldn't exist in polars DataFrame. Even if it did, our current `validate_X_object` helper functions for polars do not cover them as polars just doesn't have any functions for object type. Our current helper functions only cast from string to date/float, which is different from the behaviour for a pandas dataframe.

Justification for 3.
Small example:
```
import numpy as np
import pandas as pd
arr = np.array([1, "hello", 3.14, [1, 2, 3], {"a": 1}], dtype=object)
date = pd.date_range(start='1/1/2018', end='1/05/2018')

pd_df=pd.DataFrame({"mixed":arr,"date": date})
pd_df["date_str"] = pd_df["date"].dt.strftime('%Y-%m-%d').astype('string')
pd_df["date_object"] = pd_df["date"].dt.strftime('%Y-%m-%d')
pd_df["string"] = ["a","b","c","d","e"]
pd_df.dtypes
```

gives

```
mixed                  object
date           datetime64[ns]
date_str       string[python]
date_object            object
string                 object
dtype: object
```

```
import locopy
from locopy.utility import find_column_type

find_column_type(pd_df, "snowflake")
```
outputs: OrderedDict([('mixed', 'varchar'),
             ('date', 'timestamp'),
             ('date_str', 'varchar'),
             ('date_object', 'date'),
             ('string', 'varchar')])

On the other hand for polars (see the difference in "date_str")

```
pl_df=pl.from_pandas(pd_df[["date","date_str","date_object","string"]])
pl_df=pl_df.with_columns(mixed =arr) #adding explicitly otherwise it will error out if we use an object type in from_pandas
find_column_type(pl_df,"snowflake")
```
outputs: 
OrderedDict([('date', 'timestamp'),
             ('date_str', 'date'),
             ('date_object', 'date'),
             ('string', 'varchar'),
             ('mixed', 'varchar')])

polars also automatically try to infer datatypes

eg. `pl_df["date_object"]` is automatically changed to "string"

<small style="color: rgba(0, 0, 0, 0.87); font-family: system-ui, -apple-system, blinkmacsystemfont, &quot;Segoe UI&quot;, helvetica, arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">shape: (5, 5)</small>
date | date_str | date_object | string | mixed
-- | -- | -- | -- | --
datetime[ns] | str | str | str | object
2018-01-01 00:00:00 | "2018-01-01" | "2018-01-01" | "a" | 1
2018-01-02 00:00:00 | "2018-01-02" | "2018-01-02" | "b" | hello
2018-01-03 00:00:00 | "2018-01-03" | "2018-01-03" | "c" | 3.14
2018-01-04 00:00:00 | "2018-01-04" | "2018-01-04" | "d" | [1, 2, 3]
2018-01-05 00:00:00 | "2018-01-05" | "2018-01-05" | "e" | {'a': 1}

Due to the discrepancies in polars and pandas, I think the best is to pass all string columns as 'VARCHAR' without trying to cast them (same as any pandas string) 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_column_type takes too long for large dataset #336

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

date	date_str	date_object	string	mixed
datetime[ns]	str	str	str	object
2018-01-01 00:00:00	"2018-01-01"	"2018-01-01"	"a"	1
2018-01-02 00:00:00	"2018-01-02"	"2018-01-02"	"b"	hello
2018-01-03 00:00:00	"2018-01-03"	"2018-01-03"	"c"	3.14
2018-01-04 00:00:00	"2018-01-04"	"2018-01-04"	"d"	[1, 2, 3]
2018-01-05 00:00:00	"2018-01-05"	"2018-01-05"	"e"	{'a': 1}

find_column_type takes too long for large dataset #336

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions