1- #io.rst
21.. _io :
32
3+
44.. currentmodule :: pandas
55
66
7+
78===============================
89IO tools (text, CSV, HDF5, ...)
910===============================
1011
12+
1113The pandas I/O API is a set of top level ``reader `` functions accessed like
1214:func: `pandas.read_csv ` that generally return a pandas object. The corresponding
1315``writer `` functions are object methods that are accessed like
1416:meth: `DataFrame.to_csv `. Below is a table containing available ``readers `` and
1517``writers ``.
1618
19+
1720.. csv-table ::
1821 :header: "Format Type", "Data Description", "Reader", "Writer"
1922 :widths: 30, 100, 60, 60
2023
24+
2125 text,`CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`__, :ref: `read_csv<io.read_csv_table> `, :ref: `to_csv<io.store_in_csv> `
2226 text,Fixed-Width Text File, :ref: `read_fwf<io.fwf_reader> `, NA
2327 text,`JSON <https://www.json.org/>`__, :ref: `read_json<io.json_reader> `, :ref: `to_json<io.json_writer> `
24- text,`HTML <https://en.wikipedia.org/wiki/HTML>`__, :ref: `read_html<io.read_html > `, :ref: `to_html<io.html> `
28+ text,`HTML <https://en.wikipedia.org/wiki/HTML>`__, :ref: `read_html<io.html > `, :ref: `to_html<io.html> `
2529 text,`LaTeX <https://en.wikipedia.org/wiki/LaTeX>`__, NA, :ref: `Styler.to_latex<io.latex> `
26- text,`XML <https://www.w3.org/standards/xml/core>`__, :ref: `read_xml<io.read_xml > `, :ref: `to_xml<io.xml> `
30+ text,`XML <https://www.w3.org/standards/xml/core>`__, :ref: `read_xml<io.xml > `, :ref: `to_xml<io.xml> `
2731 text, Local clipboard, :ref: `read_clipboard<io.clipboard> `, :ref: `to_clipboard<io.clipboard> `
2832 binary,`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__ , :ref: `read_excel<io.excel_reader> `, :ref: `to_excel<io.excel_writer> `
2933 binary,`OpenDocument <http://opendocumentformat.org>`__, :ref: `read_excel<io.ods> `, NA
@@ -38,28 +42,73 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
3842 binary,`Python Pickle Format <https://docs.python.org/3/library/pickle.html>`__, :ref: `read_pickle<io.pickle> `, :ref: `to_pickle<io.pickle> `
3943 SQL,`SQL <https://en.wikipedia.org/wiki/SQL>`__, :ref: `read_sql<io.sql> `,:ref: `to_sql<io.sql> `
4044
45+
46+ .. _io.google_colab :
47+
48+ Google Colab
49+ ^^^^^^^^^^^^
50+
51+ Google Colab provides several ways to load data into pandas DataFrames:
52+
53+ **Upload files directly **
54+ .. code-block :: python
55+
56+ from google.colab import files
57+ uploaded = files.upload()
58+ df = pd.read_csv(' your_file.csv' )
59+
60+ **Mount Google Drive **
61+ .. code-block :: python
62+
63+ from google.colab import drive
64+ drive.mount(' /content/drive' )
65+ df = pd.read_csv(' /content/drive/MyDrive/your_file.csv' )
66+
67+ **URLs work normally **
68+ .. code-block :: python
69+
70+ df = pd.read_csv(' https://example.com/data.csv' )
71+
72+ **Save/download **
73+ .. code-block :: python
74+
75+ df.to_csv(' /content/drive/MyDrive/output.csv' , index = False )
76+ files.download(' output.csv' )
77+
78+ Files in `/content/ ` are temporary. Use Drive for persistence.
79+
80+ See Colab's `official IO notebook <https://colab.research.google.com/notebooks/io.ipynb >`_.
81+
4182:ref: `Here <io.perf >` is an informal performance comparison for some of these IO methods.
4283
84+
4385.. note ::
4486 For examples that use the ``StringIO `` class, make sure you import it
4587 with ``from io import StringIO `` for Python 3.
4688
89+
4790.. _io.read_csv_table :
4891
92+
4993CSV & text files
5094----------------
5195
96+
5297The workhorse function for reading text files (a.k.a. flat files) is
5398:func: `read_csv `. See the :ref: `cookbook<cookbook.csv> ` for some advanced strategies.
5499
100+
55101Parsing options
56- '''''''''''''''
102+ ***************
103+
57104
58105:func: `read_csv ` accepts the following common arguments:
59106
107+
60108Basic
61109+++++
62110
111+
63112filepath_or_buffer : various
64113 Either a path to a file (a :class: `python:str `, :class: `python:pathlib.Path `)
65114 URL (including http, ftp, and S3
@@ -76,9 +125,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl
76125delimiter : str, default ``None ``
77126 Alternative argument name for sep.
78127
128+
79129Column and index locations and names
80130++++++++++++++++++++++++++++++++++++
81131
132+
82133header : int or list of ints, default ``'infer' ``
83134 Row number(s) to use as the column names, and the start of the
84135 data. Default behavior is to infer the column names: if no names are
@@ -88,6 +139,7 @@ header : int or list of ints, default ``'infer'``
88139 ``header=None ``. Explicitly pass ``header=0 `` to be able to replace
89140 existing names.
90141
142+
91143 The header can be a list of ints that specify row locations
92144 for a MultiIndex on the columns e.g. ``[0,1,3] ``. Intervening rows
93145 that are not specified will be skipped (e.g. 2 in this example is
@@ -102,21 +154,25 @@ index_col : int, str, sequence of int / str, or False, optional, default ``None`
102154 string name or column index. If a sequence of int / str is given, a
103155 MultiIndex is used.
104156
157+
105158 .. note ::
106159 ``index_col=False `` can be used to force pandas to *not * use the first
107160 column as the index, e.g. when you have a malformed file with delimiters at
108161 the end of each line.
109162
163+
110164 The default value of ``None `` instructs pandas to guess. If the number of
111165 fields in the column header row is equal to the number of fields in the body
112166 of the data file, then a default index is used. If it is larger, then
113167 the first columns are used as index so that the remaining number of fields in
114168 the body are equal to the number of fields in the header.
115169
170+
116171 The first row after the header is used to determine the number of columns,
117172 which will go into the index. If the subsequent rows contain less columns
118173 than the first row, they are filled with ``NaN ``.
119174
175+
120176 This can be avoided through ``usecols ``. This ensures that the columns are
121177 taken as is and the trailing data are ignored.
122178usecols : list-like or callable, default ``None ``
@@ -127,32 +183,40 @@ usecols : list-like or callable, default ``None``
127183 header row(s) are not taken into account. For example, a valid list-like
128184 ``usecols `` parameter would be ``[0, 1, 2] `` or ``['foo', 'bar', 'baz'] ``.
129185
186+
130187 Element order is ignored, so ``usecols=[0, 1] `` is the same as ``[1, 0] ``. To
131188 instantiate a DataFrame from ``data `` with element order preserved use
132189 ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] `` for columns
133190 in ``['foo', 'bar'] `` order or
134191 ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] `` for
135192 ``['bar', 'foo'] `` order.
136193
194+
137195 If callable, the callable function will be evaluated against the column names,
138196 returning names where the callable function evaluates to True:
139197
198+
140199 .. ipython :: python
141200
201+
142202 import pandas as pd
143203 from io import StringIO
144204
205+
145206 data = " col1,col2,col3\n a,b,1\n a,b,2\n c,d,3"
146207 pd.read_csv(StringIO(data))
147208 pd.read_csv(StringIO(data), usecols = lambda x : x.upper() in [" COL1" , " COL3" ])
148209
210+
149211 Using this parameter results in much faster parsing time and lower memory usage
150212 when using the c engine. The Python engine loads the data first before deciding
151213 which columns to drop.
152214
215+
153216General parsing configuration
154217+++++++++++++++++++++++++++++
155218
219+
156220dtype : Type name or dict of column -> type, default ``None ``
157221 Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'} ``
158222 Use ``str `` or ``object `` together with suitable ``na_values `` settings to preserve
@@ -161,16 +225,20 @@ dtype : Type name or dict of column -> type, default ``None``
161225 the default determines the dtype of the columns which are not explicitly
162226 listed.
163227
228+
164229dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
165230 Which dtype_backend to use, e.g. whether a DataFrame should have NumPy
166231 arrays, nullable dtypes are used for all dtypes that have a nullable
167232 implementation when "numpy_nullable" is set, pyarrow is used for all
168233 dtypes if "pyarrow" is set.
169234
235+
170236 The dtype_backends are still experimental.
171237
238+
172239 .. versionadded :: 2.0
173240
241+
174242engine : {``'c' ``, ``'python' ``, ``'pyarrow' ``}
175243 Parser engine to use. The C and pyarrow engines are faster, while the python engine
176244 is currently more feature-complete. Multithreading is currently only supported by
@@ -183,7 +251,8 @@ true_values : list, default ``None``
183251 Values to consider as ``True ``.
184252false_values : list, default ``None ``
185253 Values to consider as ``False ``.
186- skipinitialspace : boolean, default ``False ``
254+ skipinitialspace : boolean,
255+ default ``False ``
187256 Skip spaces after delimiter.
188257skiprows : list-like or integer, default ``None ``
189258 Line numbers to skip (0-indexed) or number of lines to skip (int) at the start
0 commit comments