migrate from csv files to sqlite databases for downstream use in queries#120
migrate from csv files to sqlite databases for downstream use in queries#120rfl-urbaniak wants to merge 11 commits intomainfrom
csv files to sqlite databases for downstream use in queries#120Conversation
csv files to sqllite databases for downstream use in queriescsv files to sqlite databases for downstream use in queries
|
Have been running into this issue apparently. |
|
still issues with what isort does between CI and locally, despite the version numbers being the same. Other people have faced this, I'm looking for a solution, but slowly considering using black w/o isort at least till switching to isort 6.0 |
Niklewa
left a comment
There was a problem hiding this comment.
The SQL database's poor performance might be because we're treating it like regular DataFrames. I think using it differently, focusing on fetching only the necessary columns in each code section, would suit SQL databases better. Comparing performance by listing all features (one of our tests) isn't really fair.
Here are some actions that might improve performance:
- Using SQL queries directly on the database instead of using a pandas DataFrame. This way, we only fetch the data needed for each task (also optimize the queries).
- Implementing indexing in the SQL queries to restructure the database and add indexed variables.
Other options to consider:
- In-memory databases like Redis or Memcached
- Adding a caching layer to the SQL setup
- Columnar databases such as Amazon Redshift or Apache Cassandra
I'm not sure if these actions will make SQL perform better, keep it the same, or not improve performance significantly. The same applies if we try new methods. However, I think these points are worth considering, especially as our dataset grows.
| self.wide[feature] = pd.read_csv(file_path) | ||
| file_path = os.path.join(self.data_path, f"{feature}_{table_suffix}.csv") | ||
| df = pd.read_csv(file_path) | ||
| if table_suffix == "wide": |
There was a problem hiding this comment.
Something like that could be faster (so not relying on multiple elifs), not sure if the change in the performance would be meaningful:
suffix_map = {
"wide": self.wide,
"std_wide": self.std_wide,
"long": self.long,
"std_long": self.std_long
}
if table_suffix in suffix_map:
suffix_map[table_suffix][feature] = df
else:
raise ValueError("Invalid table suffix. Please choose 'wide', 'std_wide', 'long', or 'std_long'.")
All csv data are now in two
.dbfiles for the two levels of analysis (counties and msa). Prior to deploymentof dbs to the polis server, these live locally. As the db files are now too large to store on GitHub, the user needs to
run
csv_to_db_pipeline.pybefore the first use to generate the db locally.The original
DataGrabberclasses have been refactored, renamed to...CSVand decoupled from use dowstream.The new
DataGrabberDBclass has been introduced and passed on to function as genericDataGrabberandMSADataGrabber.Additional tests for
DataGrabberDBhave been introduced intest_data_grabber_sql. Additionally,DataGrabberDBunder the generic alias passes all the tests that the originalDataGrabberdid.generate_sql.ipynb(docs/experimental) contains performance tests for both approaches. At least in the current setting the orignal method is faster. The main culprit seems to be:This is not too surprising, after some reflection, as illustrated by this comment from ChatGPT:
As the ultimate tests of the switch to DB would involve data updates and model retraining, I leave the original
.csvfiles and classes until those events. Keep in mind they are now not needed for queries to work correctly (they are needed to generate the.dbfiles and for some of the tests).The new
pytestrelease leads to incompatibilities that might be worth investigating later. For now, fixed thepytestversion to be7.4.3.insetup.py.Some cleaning scripts have been moved to a subfolder, which required a small refactoring of import statements in generic data cleaning pipeline scripts.
Incorrect indentation in a
DataGrabbertest has been fixed.It turns out that
isortwith --profile black on the runner still works as if without this profile. Checked versions between local install and the one on the runner, the version numbers are the same. More people have similar issues, I suspended isort and decided to trust black, at least till stableisort 6.0gets out.Inference tests succeed with
pyro-ppl==1.8.5fail withpyro-ppl==1.9. For now fixed the version number insetup.py, but will think about investigating this deeper.