Skip to content

Commit 0330e5f

Browse files
authored
Update pandas.rst
finishing up
1 parent 13f03b4 commit 0330e5f

1 file changed

Lines changed: 81 additions & 32 deletions

File tree

docs/day3/pandas.rst

Lines changed: 81 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -321,13 +321,16 @@ In most reader functions, including ``index_col=0`` sets the first column as the
321321

322322
.. challenge::
323323

324-
Code along! Open your preferred IDE and load the provided file ``exoplanets_5250_EarthUnits_fixed.csv`` into DataFrame ``df``. Then, save ``df`` to a text (.txt) file with a tab (``\t``) separator.
325-
326-
.. code-block:: python
324+
Open your preferred IDE and load the provided file ``exoplanets_5250_EarthUnits_fixed.csv`` into DataFrame ``df`` (optionally, you can set the first column to be the indices). Then, save ``df`` to a text (.txt) file with a tab (``\t``) separator.
325+
326+
.. solution:: Solution
327+
:class: dropdown
327328

328-
import pandas as pd
329-
df = pd.read_csv('exoplanets_5250_EarthUnits_fixed.csv',index_col=0)
330-
df.to_csv('./docs/day3/exoplanets_5250_EarthUnits.txt', sep='\t',index=True)
329+
.. code-block:: python
330+
331+
import pandas as pd
332+
df = pd.read_csv('exoplanets_5250_EarthUnits_fixed.csv',index_col=0)
333+
df.to_csv('./docs/day3/exoplanets_5250_EarthUnits.txt', sep='\t',index=True)
331334
332335
333336
**Creating DataFrames in Python.** Building a DataFrame or Series from scratch is also easy. Lists and arrays can be converted directly to Series and DataFrames, respectively.
@@ -338,14 +341,17 @@ In most reader functions, including ``index_col=0`` sets the first column as the
338341

339342
.. challenge::
340343

341-
Code along! In your preferred IDE, recreate the DataFrame shown below and view it with a print statement.
344+
In your preferred IDE or at the command line, create a DataFrame with 4 rows labeled ``['w','x','y','z']`` and 3 columns labeled ``['a','b','c']``. The contents of the cells are up to you. Print the result and verify that it has the right shape.
342345

343-
.. jupyter-execute::
346+
.. solution:: Solution
347+
:class: dropdown
344348

345-
import numpy as np
346-
import pandas as pd
347-
df = pd.DataFrame( np.arange(1,13).reshape((4,3)), index=['w','x','y','z'], columns=['a','b','c'] )
348-
print(df)
349+
.. jupyter-execute::
350+
351+
import numpy as np
352+
import pandas as pd
353+
df = pd.DataFrame( np.arange(1,13).reshape((4,3)), index=['w','x','y','z'], columns=['a','b','c'] )
354+
print(df)
349355

350356
It is also possible (and occasionally necessary) to convert DataFrames and Series to NumPy arrays, dictionaries, record arrays, or strings with the methods ``.to_numpy()``, ``.to_dict()``, ``to_records()``, and ``to_string()``, respectively.
351357

@@ -466,18 +472,26 @@ Efficient Data Types
466472

467473
- When an order is asserted, it becomes possible to use ``.min()`` and ``.max()`` on the categories.
468474

469-
.. jupyter-execute::
475+
.. challenge::
470476

471-
import pandas as pd
472-
import numpy as np
473-
df = pd.read_csv('./docs/day3/exoplanets_5250_EarthUnits_fixed.csv',index_col=0)
474-
print("Before:\n", df['planet_type'].memory_usage(deep=True))
475-
# Convert planet_type to Categorical
476-
ptypes=df['planet_type'].astype('category')
477-
print("After:\n", ptypes.memory_usage(deep=True))
478-
# assert order (coincidentally alphabetical order is also reverse mass-order)
479-
ptypes = ptypes.cat.reorder_categories(ptypes.cat.categories[::-1], ordered=True)
480-
print(ptypes)
477+
Take the exoplanets DataFrame that you loaded before and print the memory usage with ``deep=True``. Then, convert the ``'planet_type'`` column to datatype ``category``. Print the memory usage again and compare the changed column's estimated size in memory to the original.
478+
479+
480+
.. solution:: Solution
481+
:class: dropdown
482+
483+
.. jupyter-execute::
484+
485+
import pandas as pd
486+
import numpy as np
487+
df = pd.read_csv('./docs/day3/exoplanets_5250_EarthUnits_fixed.csv',index_col=0)
488+
print("Before:\n", df['planet_type'].memory_usage(deep=True))
489+
# Convert planet_type to Categorical
490+
ptypes=df['planet_type'].astype('category')
491+
print("After:\n", ptypes.memory_usage(deep=True))
492+
# assert order (coincidentally alphabetical order is also reverse mass-order)
493+
ptypes = ptypes.cat.reorder_categories(ptypes.cat.categories[::-1], ordered=True)
494+
print(ptypes)
481495

482496
Numerical data can be recast as categorical by binning it with ``pd.cut()`` or ``pd.qcut()``, and these bins can be used to create GroupBy objects. Bins created like this are automatically assumed to be in ascending order. However, some mathematical operations may no longer work on the results.
483497

@@ -508,7 +522,7 @@ This example shows the difference in memory usage between a 1000x1000 identity m
508522
Numba (JIT Compilation)
509523
^^^^^^^^^^^^^^^^^^^^^^^
510524

511-
If Numba is installed, setting ``engine=numba`` in most built-in functions can boost performance if the function has to be run multiple times over several columns, particularly if you can set ``engine_kwargs={"parallel": True}``. Numba uses Just-In-Time (JIT) compilation to compile pure Python code to a machine-optimized form at runtime, automatically incorporating multi-threading across available CPU cores if parallelism is enabled. Types of functions that this works for include:
525+
If the Numba module is installed, setting ``engine=numba`` in most built-in functions can boost performance if the function has to be run multiple times over several columns, particularly if you can set ``engine_kwargs={"parallel": True}``. Numba uses Just-In-Time (JIT) compilation to compile pure Python code to a machine-optimized form at runtime, automatically incorporating multi-threading across available CPU cores if parallelism is enabled. Types of functions that this works for include:
512526

513527
* Statistical functions like ``mean()``, ``median()``, and ``std()``, which can be applied to the whole data set or to rolling windows.
514528
* Complex functions, or user-defined functions decorated with ``@jit``, applied via ``.agg()``, ``.transform()``, ``.map()``, or ``.apply()``.
@@ -519,18 +533,18 @@ Parallel function evaluation occurs column-wise, so **performance will be booste
519533

520534
Since JIT-compiled functions are parallelized column-wise, make sure that the number of threads allocated for any interactive session or slurm script and the number of threads passed to Numba are all equal to the number of columns you want to process in parallel. Assuming you have imported Numba as ``numba``, the way to tell Numba the number of threads to use is: ``numba.set_num_threads(ncols)`` where ``ncols`` is the number of columns to apply the function to in parallel.
521535

522-
Here is a (somewhat scientifically nonsensical) example using the exoplanets DataFrame to show the speed-up for 5 columns.
536+
Here is a (somewhat scientifically nonsensical) example using a DataFrame of various COVID19 statistics across Italy's 21 administrative regions over the first 2 years of the pandemic. Columns 6-15 are statistics that a rolling mean might make sense for (if we normalized by Region population).
523537

524538
.. jupyter-execute::
525539

526540
import numpy as np
527541
import pandas as pd
528-
df = pd.read_csv('./docs/day3/exoplanets_5250_EarthUnits_fixed.csv',index_col=0)
529-
import numba
530-
numba.set_num_threads(4)
531-
stuff = df.iloc[:,4:9].sample(n=250000, replace=True, ignore_index=True)
532-
%timeit stuff.rolling(500).mean()
533-
%timeit stuff.rolling(500).mean(engine='numba', engine_kwargs={"parallel": True})
542+
df = pd.read_csv('./docs/day3/covid19_italy_region.csv',index_col=0)
543+
import numba # on Cosmos, this requires a conda environment with Numba installed
544+
numba.set_num_threads(10)
545+
stuff = df.iloc[:,6:]
546+
%timeit stuff.rolling(630).mean() #30-day rolling average
547+
%timeit stuff.rolling(630).mean(engine='numba', engine_kwargs={"parallel": True})
534548

535549
.. tip::
536550

@@ -589,8 +603,43 @@ While loaded, chunks can be indexed and manipulated like full-sized DataFrames.
589603

590604
Workflows that can be applied to chunks can also be used to aggregate over multiple files, so it may also be worth breaking a single out-of-memory file into logical subsections that individually fit in memory. `The Pandas documentation on chunking chooses this method of demonstration <https://pandas.pydata.org/docs/user_guide/scale.html#use-chunking>`__ rather than showing how to iterate over chunks loaded from an individual file.
591605

592-
The following example uses the table ``covid19_italy_region.csv``, which is not out-of-memory for a typical HPC cluster but is fairly large. The data are split over Italy's 21 adminstrative Regions. Let's say we want to tally up the NewPositiveCases
606+
The following example uses the table ``global_disaster_response_2018-2024.csv``, which is not out-of-memory for a typical HPC cluster but is fairly large. The data were not in any particular order, but there are 50000 rows spread fairly evenly over the time period, so this example uses chunks of 5000 rows.
607+
608+
.. jupyter-execute::
609+
610+
import pandas as pd
611+
import numpy as np
612+
loss_sum = 0
613+
for chunk in pd.read_csv('./docs/day3/global_disaster_response_2018-2024.csv',
614+
chunksize=5000):
615+
loss_sum+=chunk['economic_loss_usd'].sum()
616+
print('total loss over all disasters in this database: $', np.round(loss_sum/10**9,2), 'billion USD')
593617

594618
.. caution::
595619

596620
Chunking with Pandas alone works only when no coordination is required between chunks. Functions that apply independently to every row are ideal. Some aggregate statistics can be calculated if care is taken to make sure that either all chunks are of identical size or that different-sized chunks are reweighted appropriately. However, if your data have natural groupings where group membership is not known by position a priori, or where each group is itself larger than memory, you may be better off using Dask or other libraries.
621+
622+
.. challenge
623+
624+
Use chunks of 10000 rows to accumulate a sum over the ``'casualties'`` column of the ``global_disaster_response_2018-2024.csv`` file.
625+
626+
.. solution:: Solution
627+
:class: dropdown
628+
629+
.. code-block:: python
630+
631+
cas_sum = 0
632+
for chunk in pd.read_csv('/home/rlpitts/Documents/global_disaster_response_2018-2024.csv',
633+
chunksize=10000):
634+
cas_sum+=chunk['casualties'].sum()
635+
636+
.. keypoints::
637+
638+
- Pandas lets you construct list- or table-like data structures with mixed data types, the contents of which can be indexed by arbitrary row and column labels
639+
- The main data structures are Series (1D) and DataFrames (2D). Each column of a DataFrame is a Series.
640+
- Data is selected primarily using ``.loc[]`` and ``.iloc[]``, unless you're grabbing whole columns (then the syntax is dict-like).
641+
- There are hundreds of attributes and methods that can be called on Pandas data structures to inspect, clean, organize, combine, and apply functions to them, including nearly all NumPy ufuncs (universal functions).
642+
- ``Categorical`` and ``SparseDtype`` datatypes can help you reduce the memory footprint of your data.
643+
- Most Pandas methods that apply a function can be sped up by multithreading with Numba, if they are applied over multiple columns. Just set ``engine=numba`` and ``engine_kwargs={"parallel": True}`` in the kwargs.
644+
- Pandas includes a built-in function to convert categorical data columns to dummy variables for Machine Learning input.
645+
- Several Pandas reader/writer functions support chunking, i.e., loading subsets of data files that would otherwise not fit in memory.

0 commit comments

Comments
 (0)