Hive ql code for 2022 A share market stock analytic cs476 Ruiming Shen Stock Project:
This project is divided into 5 parts.
- Data ingestion
- etl_code(Data cleaning)
- profiling_code(year data analysis)
- ana_code(Month data analysis)
- screenshot of all the output for every step on dataproc.
Must following this order for all the numbers and results.
There are two csv files for this project which is generated by Wind APP. To get the data, please log into this app with a valid account, and follow the 3 pictures instruction on creating a new csv file for intended stocks.(These 3 pictures are in data_ingestion directory) For intended stocks, type in stock ID or select the whole market if needed, columns included can also be freely edited. The first one I created is the 2022 whole market A-share shenzhen market stocks'daily data, including daily return, daily high, daily low, openprice, closeprice, date(YYYY/MM/DD) NOTE THAT Month is two digits when it is 12, 11,10. one digits when it is 1-9. This information is useful for part 4. The second one combines 31 green electricity stocks in this market. In this project we want to find relationship between these green electricity stocks and the whole shenzhen Market. I will compare daily return, daily closing price. Average of closing price, daily return in 1 year round and per month. Also I create a table to get difference between daily open and daily close. This is for get the comparison of volatility for green electricity stocks and the whole market. So we can use stocks price range, volatility, and daily return to evaluate green electricity stocks' performance. In the data_ingestion part I only upload one csv file which is green electricity stock file, the whole market stock one is too big to upload on GitHub. However, this green stock file can provide enough insight for what my cleaning code and other code are doing.
the first part is data ingestion: it creates directories on HDFS so hive table I created can have a place to store on hdfs and get data loaded from hdfs. Also tells how to access hive using personal info. Specific path for NYU students.
the second part is data cleaning: the original data contains chinese character and some null values. The chinese character is inside every column because the whole sheet is generated by each stocks' individual sheet so the first row of each stocks are chinese characters which are the column names of these individual sheets. The data is not clean so we need to clean it first. First, for loading the table, I set up all the columns to string. Then I use etl_code to get the cleaned version of both green stocks and whole market. The new versions are called filteredGreen and filteredAll. In 3,4,5 we will use these filtered version instead of the original one.
the third part is year data analysis, in this part, we will use one year data, which is the filtered version of these two datasets. I use hive command to get the mean, max, min of the whole market and green stocks. Then I want to get the difference between open and close price. This difference is only deisgned for one year part. Because one month' volatility is not a good index to use. So I will only use one year difference to examine green stocks' volatility. I create two diffrence table first, and get the average of difference in both side. Then I can know whether the volatility of market is higher or the green stocks' is higher.
The fourth part mainly focuses on month data. For getting month data, I divided these two filtered table into 12 pieces. Create 12 tables for each filtered table. So I created 24 tables in total. For every monthly table, I need the average return for each of month table for comparision so I can see in which month green stocks have better performance which month it has poor performance.