You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/day3/pandas.rst
+81-32Lines changed: 81 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -321,13 +321,16 @@ In most reader functions, including ``index_col=0`` sets the first column as the
321
321
322
322
.. challenge::
323
323
324
-
Code along! Open your preferred IDE and load the provided file ``exoplanets_5250_EarthUnits_fixed.csv`` into DataFrame ``df``. Then, save ``df`` to a text (.txt) file with a tab (``\t``) separator.
325
-
326
-
.. code-block:: python
324
+
Open your preferred IDE and load the provided file ``exoplanets_5250_EarthUnits_fixed.csv`` into DataFrame ``df`` (optionally, you can set the first column to be the indices). Then, save ``df`` to a text (.txt) file with a tab (``\t``) separator.
**Creating DataFrames in Python.** Building a DataFrame or Series from scratch is also easy. Lists and arrays can be converted directly to Series and DataFrames, respectively.
@@ -338,14 +341,17 @@ In most reader functions, including ``index_col=0`` sets the first column as the
338
341
339
342
.. challenge::
340
343
341
-
Code along! In your preferred IDE, recreate the DataFrame shown below and view it with a print statement.
344
+
In your preferred IDE or at the command line, create a DataFrame with 4 rows labeled ``['w','x','y','z']`` and 3 columns labeled ``['a','b','c']``. The contents of the cells are up to you. Print the result and verify that it has the right shape.
It is also possible (and occasionally necessary) to convert DataFrames and Series to NumPy arrays, dictionaries, record arrays, or strings with the methods ``.to_numpy()``, ``.to_dict()``, ``to_records()``, and ``to_string()``, respectively.
351
357
@@ -466,18 +472,26 @@ Efficient Data Types
466
472
467
473
- When an order is asserted, it becomes possible to use ``.min()`` and ``.max()`` on the categories.
Take the exoplanets DataFrame that you loaded before and print the memory usage with ``deep=True``. Then, convert the ``'planet_type'`` column to datatype ``category``. Print the memory usage again and compare the changed column's estimated size in memory to the original.
Numerical data can be recast as categorical by binning it with ``pd.cut()`` or ``pd.qcut()``, and these bins can be used to create GroupBy objects. Bins created like this are automatically assumed to be in ascending order. However, some mathematical operations may no longer work on the results.
483
497
@@ -508,7 +522,7 @@ This example shows the difference in memory usage between a 1000x1000 identity m
508
522
Numba (JIT Compilation)
509
523
^^^^^^^^^^^^^^^^^^^^^^^
510
524
511
-
If Numba is installed, setting ``engine=numba`` in most built-in functions can boost performance if the function has to be run multiple times over several columns, particularly if you can set ``engine_kwargs={"parallel": True}``. Numba uses Just-In-Time (JIT) compilation to compile pure Python code to a machine-optimized form at runtime, automatically incorporating multi-threading across available CPU cores if parallelism is enabled. Types of functions that this works for include:
525
+
If the Numba module is installed, setting ``engine=numba`` in most built-in functions can boost performance if the function has to be run multiple times over several columns, particularly if you can set ``engine_kwargs={"parallel": True}``. Numba uses Just-In-Time (JIT) compilation to compile pure Python code to a machine-optimized form at runtime, automatically incorporating multi-threading across available CPU cores if parallelism is enabled. Types of functions that this works for include:
512
526
513
527
* Statistical functions like ``mean()``, ``median()``, and ``std()``, which can be applied to the whole data set or to rolling windows.
514
528
* Complex functions, or user-defined functions decorated with ``@jit``, applied via ``.agg()``, ``.transform()``, ``.map()``, or ``.apply()``.
@@ -519,18 +533,18 @@ Parallel function evaluation occurs column-wise, so **performance will be booste
519
533
520
534
Since JIT-compiled functions are parallelized column-wise, make sure that the number of threads allocated for any interactive session or slurm script and the number of threads passed to Numba are all equal to the number of columns you want to process in parallel. Assuming you have imported Numba as ``numba``, the way to tell Numba the number of threads to use is: ``numba.set_num_threads(ncols)`` where ``ncols`` is the number of columns to apply the function to in parallel.
521
535
522
-
Here is a (somewhat scientifically nonsensical) example using the exoplanets DataFrame to show the speed-up for 5 columns.
536
+
Here is a (somewhat scientifically nonsensical) example using a DataFrame of various COVID19 statistics across Italy's 21 administrative regions over the first 2 years of the pandemic. Columns 6-15 are statistics that a rolling mean might make sense for (if we normalized by Region population).
@@ -589,8 +603,43 @@ While loaded, chunks can be indexed and manipulated like full-sized DataFrames.
589
603
590
604
Workflows that can be applied to chunks can also be used to aggregate over multiple files, so it may also be worth breaking a single out-of-memory file into logical subsections that individually fit in memory. `The Pandas documentation on chunking chooses this method of demonstration <https://pandas.pydata.org/docs/user_guide/scale.html#use-chunking>`__ rather than showing how to iterate over chunks loaded from an individual file.
591
605
592
-
The following example uses the table ``covid19_italy_region.csv``, which is not out-of-memory for a typical HPC cluster but is fairly large. The data are split over Italy's 21 adminstrative Regions. Let's say we want to tally up the NewPositiveCases
606
+
The following example uses the table ``global_disaster_response_2018-2024.csv``, which is not out-of-memory for a typical HPC cluster but is fairly large. The data were not in any particular order, but there are 50000 rows spread fairly evenly over the time period, so this example uses chunks of 5000 rows.
607
+
608
+
.. jupyter-execute::
609
+
610
+
import pandas as pd
611
+
import numpy as np
612
+
loss_sum = 0
613
+
for chunk in pd.read_csv('./docs/day3/global_disaster_response_2018-2024.csv',
614
+
chunksize=5000):
615
+
loss_sum+=chunk['economic_loss_usd'].sum()
616
+
print('total loss over all disasters in this database: $', np.round(loss_sum/10**9,2), 'billion USD')
593
617
594
618
.. caution::
595
619
596
620
Chunking with Pandas alone works only when no coordination is required between chunks. Functions that apply independently to every row are ideal. Some aggregate statistics can be calculated if care is taken to make sure that either all chunks are of identical size or that different-sized chunks are reweighted appropriately. However, if your data have natural groupings where group membership is not known by position a priori, or where each group is itself larger than memory, you may be better off using Dask or other libraries.
621
+
622
+
.. challenge
623
+
624
+
Use chunks of 10000 rows to accumulate a sum over the ``'casualties'`` column of the ``global_disaster_response_2018-2024.csv`` file.
625
+
626
+
.. solution:: Solution
627
+
:class: dropdown
628
+
629
+
.. code-block:: python
630
+
631
+
cas_sum =0
632
+
for chunk in pd.read_csv('/home/rlpitts/Documents/global_disaster_response_2018-2024.csv',
633
+
chunksize=10000):
634
+
cas_sum+=chunk['casualties'].sum()
635
+
636
+
.. keypoints::
637
+
638
+
- Pandas lets you construct list- or table-like data structures with mixed data types, the contents of which can be indexed by arbitrary row and column labels
639
+
- The main data structures are Series (1D) and DataFrames (2D). Each column of a DataFrame is a Series.
640
+
- Data is selected primarily using ``.loc[]`` and ``.iloc[]``, unless you're grabbing whole columns (then the syntax is dict-like).
641
+
- There are hundreds of attributes and methods that can be called on Pandas data structures to inspect, clean, organize, combine, and apply functions to them, including nearly all NumPy ufuncs (universal functions).
642
+
- ``Categorical`` and ``SparseDtype`` datatypes can help you reduce the memory footprint of your data.
643
+
- Most Pandas methods that apply a function can be sped up by multithreading with Numba, if they are applied over multiple columns. Just set ``engine=numba`` and ``engine_kwargs={"parallel": True}`` in the kwargs.
644
+
- Pandas includes a built-in function to convert categorical data columns to dummy variables for Machine Learning input.
645
+
- Several Pandas reader/writer functions support chunking, i.e., loading subsets of data files that would otherwise not fit in memory.
0 commit comments