@@ -62,29 +62,46 @@ Data loading is a functionality separate from data analysis, so firstly
6262let's decouple the data loading part into a separate component (function).
6363
6464> ## Exercise: Decouple Data Loading from Data Analysis
65- > Separate out the data loading functionality from ` analyse_data() ` into a new function
66- > ` load_inflammation_data() ` that returns a list of 2D NumPy arrays with inflammation data
65+ >
66+ > Modify ` compute_data.py ` to separate out the data loading functionality from ` analyse_data() ` into a new function
67+ > ` load_inflammation_data() ` , that returns a list of 2D NumPy arrays with inflammation data
6768> loaded from all inflammation CSV files found in a specified directory path.
69+ > Then, change your ` analyse_data() ` function to make use of this new function instead.
70+ >
6871>> ## Solution
72+ >>
6973>> The new function ` load_inflammation_data() ` that reads all the inflammation data into the
7074>> format needed for the analysis could look something like:
75+ > .
7176>> ``` python
7277>> def load_inflammation_data (dir_path ):
73- >> data_file_paths = glob.glob(os.path.join(dir_path, ' inflammation*.csv' ))
74- >> if len (data_file_paths) == 0 :
75- >> raise ValueError (f " No inflammation CSV files found in path { dir_path} " )
76- >> data = map (models.load_csv, data_file_paths) # load inflammation data from each CSV file
77- >> return list (data) # return the list of 2D NumPy arrays with inflammation data
78+ >> data_file_paths = glob.glob(os.path.join(dir_path, ' inflammation*.csv' ))
79+ >> if len (data_file_paths) == 0 :
80+ >> raise ValueError (f " No inflammation CSV files found in path { dir_path} " )
81+ >> data = map (models.load_csv, data_file_paths) # Load inflammation data from each CSV file
82+ >> return list (data) # Return the list of 2D NumPy arrays with inflammation data
7883>> ```
79- >> This function can now be used in the analysis as follows:
84+ >>
85+ >> The new function `analyse_data()` could then look like:
86+ >>
8087>> ```python
8188>> def analyse_data (data_dir ):
82- >> data = load_inflammation_data(data_dir)
83- >> daily_standard_deviation = compute_standard_deviation_by_data(data)
84- >> ...
89+ >> data = load_inflammation_data(data_dir)
90+ >>
91+ >> means_by_day = map (models.daily_mean, data)
92+ >> means_by_day_matrix = np.stack(list (means_by_day))
93+ >>
94+ >> daily_standard_deviation = np.std(means_by_day_matrix, axis = 0 )
95+ >>
96+ >> graph_data = {
97+ >> ' standard deviation by day' : daily_standard_deviation,
98+ >> }
99+ >> views.visualize(graph_data)
85100>> ```
101+ >>
86102>> The code is now easier to follow since we do not need to understand the data loading part
87103>> to understand the statistical analysis part, and vice versa.
104+ >> In most cases, functions work best when they are short!
88105> {: .solution}
89106{: .challenge}
90107
@@ -185,13 +202,12 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
185202>> At the end of this exercise, the code in the ` analyse_data() ` function should look like:
186203>> ``` python
187204>> def analyse_data (data_source ):
188- >> data = data_source.load_inflammation_data()
189- >> daily_standard_deviation = compute_standard_deviation_by_data(data)
190- >> ...
205+ >> data = data_source.load_inflammation_data()
206+ >> ...
191207>> ```
192208>> The controller code should look like:
193209>> ```python
194- >> data_source = CSVDataSource(os.path.dirname(InFiles [0 ]))
210+ >> data_source = CSVDataSource(os.path.dirname(infiles [0 ]))
195211>> analyse_data(data_source)
196212>> ```
197213> {: .solution}
@@ -200,33 +216,32 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
200216>>
201217>> ```python
202218>> class CSVDataSource :
203- >> """
204- >> Loads all the inflammation CSV files within a specified directory.
205- >> """
206- >> def __init__ (self , dir_path ):
207- >> self .dir_path = dir_path
219+ >> """
220+ >> Loads all the inflammation CSV files within a specified directory.
221+ >> """
222+ >> def __init__ (self , dir_path ):
223+ >> self .dir_path = dir_path
208224>>
209- >> def load_inflammation_data (self ):
210- >> data_file_paths = glob.glob(os.path.join(self .dir_path, ' inflammation*.csv' ))
211- >> if len (data_file_paths) == 0 :
212- >> raise ValueError (f " No inflammation CSV files found in path { self .dir_path} " )
213- >> data = map (models.load_csv, data_file_paths)
214- >> return list (data)
225+ >> def load_inflammation_data (self ):
226+ >> data_file_paths = glob.glob(os.path.join(self .dir_path, ' inflammation*.csv' ))
227+ >> if len (data_file_paths) == 0 :
228+ >> raise ValueError (f " No inflammation CSV files found in path { self .dir_path} " )
229+ >> data = map (models.load_csv, data_file_paths)
230+ >> return list (data)
215231>> ```
216232>> In the controller, we create an instance of CSVDataSource and pass it
217233>> into the the statistical analysis function.
218234>>
219235>> ```python
220- >> data_source = CSVDataSource(os.path.dirname(InFiles [0 ]))
236+ >> data_source = CSVDataSource(os.path.dirname(infiles [0 ]))
221237>> analyse_data(data_source)
222238>> ```
223239>> The `analyse_data()` function is modified to receive any data source object (that implements
224240>> the `load_inflammation_data()` method) as a parameter.
225241>> ```python
226242>> def analyse_data (data_source ):
227- >> data = data_source.load_inflammation_data()
228- >> daily_standard_deviation = compute_standard_deviation_by_data(data)
229- >> ...
243+ >> data = data_source.load_inflammation_data()
244+ >> ...
230245>> ```
231246>> We have now fully decoupled the reading of the data from the statistical analysis and
232247>> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various
@@ -364,11 +379,11 @@ data sources with no extra work.
364379>> Additionally, in the controller we will need to select an appropriate DataSource instance to
365380>> provide to the analysis:
366381>>```python
367- >> _, extension = os.path.splitext(InFiles [0])
382+ >> _, extension = os.path.splitext(infiles [0])
368383>> if extension == '.json':
369- >> data_source = JSONDataSource(os.path.dirname(InFiles [0]))
384+ >> data_source = JSONDataSource(os.path.dirname(infiles [0]))
370385>> elif extension == '.csv':
371- >> data_source = CSVDataSource(os.path.dirname(InFiles [0]))
386+ >> data_source = CSVDataSource(os.path.dirname(infiles [0]))
372387>> else:
373388>> raise ValueError(f'Unsupported data file format: {extension}')
374389>> analyse_data(data_source)
0 commit comments