Conversation
fanneiOZ
left a comment
There was a problem hiding this comment.
Review
By following requirements, this exercise work doesn't meet them.
Clean design and sound software engineering practices.
Structured repo so can be used as working repo in future.
Production-ready solution.
Schema designing need more critical thinking. For junior level, this is okay. for mid, acceptable, but for senior, sorry to say it's not there yet.
| DBHOST=localhost | ||
| DBUSER=demouser | ||
| DBPASSWORD=demopassword | ||
| DATABASE=financial_database | ||
| TABLE=financial |
There was a problem hiding this comment.
ETL script refer to .env file but none found in repository. I add this one my self
| cols = '`,`'.join([str(i) for i in list_col]) | ||
|
|
||
| ''' CONNECT TO THE DATABASE ''' | ||
| connection = pymysql.connect( |
There was a problem hiding this comment.
This implies that we need MySQL running on default port 3306 only. No flexibility for port connection
| USE financial_database; | ||
|
|
||
| DROP TABLE IF EXISTS financial; | ||
| CREATE TABLE financial ( financial_id int unsigned not null auto_increment, segment varchar(128), country varchar(128), product varchar(128), discount_band varchar(128), units_sold float, manufacturing_price int, sale_price int, gross_sales float, discounts float, sales float, cogs float, profit float, date datetime, month_number int, month_name varchar(128), year int, primary key (financial_id) ); No newline at end of file |
There was a problem hiding this comment.
About schema design
month_name varchar(128)
We would never have any english month name exceed 128 length.
product varchar(128)
Excel has 32,767 characters limit per cells. These will easily lead to truncation issue if data in xlsx is change.
manufacturing_price int, sale_price int
Price values as Integer ???
date datetime
Why date time type if data will never specify the time?
Mostly Top set 128-char limit for every varchar fields. Some won't need that long, Some might exceed. So in point of foreseeing / scalable design, this design doesn't meet this scenario.
| @@ -0,0 +1,9 @@ | |||
| INSERT INTO mysql.user (User,Host,authentication_string,ssl_cipher,x509_issuer,x509_subject) VALUES('demouser','localhost',PASSWORD('demopassword'),'','',''); | |||
There was a problem hiding this comment.
PASSWORD() is compatible only for MySQL 5 or less. As the repository doesn't specific the version of MySQL (they released version 8) already. This simply breaks the execution.
| @@ -0,0 +1,9 @@ | |||
| INSERT INTO mysql.user (User,Host,authentication_string,ssl_cipher,x509_issuer,x509_subject) VALUES('demouser','localhost',PASSWORD('demopassword'),'','',''); | |||
There was a problem hiding this comment.
There's CREATE USER for simply user creation, but Top chose to use INSERT into user table. This is not wrong but it looks like he thinks too complex for simply requirement but too simply for detailed requirement. IMO 🤔
| for i, row in df['Sheet1'].iterrows(): | ||
| sql = 'INSERT INTO `'+table+'` (`'+cols +'`) VALUES ('+'%s,'*(len(row)-1)+'%s)' | ||
| # row represents the data points in DataFrame | ||
| cursor.execute(sql, tuple(row)) |
There was a problem hiding this comment.
If there's bulk statement compatibility or cursor.executemany, why opt to INSERT one by one??
| load_dotenv() # Load .env | ||
| dbhost = os.getenv('DBHOST', "") | ||
| dbuser = os.getenv('DBUSER', "") | ||
| dbpassword = os.getenv('DBPASSWORD', "") | ||
| database = os.getenv('DATABASE', "") | ||
| table = os.getenv('TABLE', "") |
There was a problem hiding this comment.
Talking in developer world, this one needs encapsulation. It's okay to do this for single script, but what in real world, we never have only single magic script. Every scripts will have to change this part if connection have to be changed.
| load_dotenv() # Load .env | ||
| dbhost = os.getenv('DBHOST', "") | ||
| dbuser = os.getenv('DBUSER', "") | ||
| dbpassword = os.getenv('DBPASSWORD', "") | ||
| database = os.getenv('DATABASE', "") | ||
| table = os.getenv('TABLE', "") |
There was a problem hiding this comment.
Environment variable for single table name ??
| @@ -0,0 +1,58 @@ | |||
| from dotenv import load_dotenv | |||
There was a problem hiding this comment.
About ETL Script
This script is fixed to one task. It will require some work to reuse this for other task.
| cursor.execute(sql, tuple(row)) | ||
|
|
||
| # The connection is not autocommitted by default, so we must commit to save our changes | ||
| connection.commit() |
There was a problem hiding this comment.
Txn will be commit but if this one failed it will never be rollback.
No description provided.