# pandas: Reading & Writing Data¶

Ha Khanh Nguyen (hknguyen)

## 1. Reading & Writing Data in Text Format¶

• pandas features a number of functions for reading tabular data as a DataFrame object.
• We have used read_csv() which is likely the one you will use the most.

• The main purpose of these functions is to convert text data into a DataFrame.
• The optional arguments for these functions may fall into a few categories:
• Indexing: can treat one or more columns as the index of returned DataFrame, and whether to get the column names from the file, the user, or not at all.
• Type inference & data conversion: includes the user-defined value conversions and custom list of missing value markers.
• Datetime parsing: combining data & time information spread over multiple columns into a single column in the result.
• Iterating: support for iterating over chunks of very large files.
• Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

### 1.1 read_csv()¶

• The read_csv() functions has many (mostly optional) arguments.
• See the pandas documentation for read_csv() here.
• The only required argument for read_csv() is the file path.
• Often, the file might not have a header. Since the default option is to read the first row of the file as the header, we need to set it to None:
• To specify the column names, assign a list of column names to the names argument:
• Suppose we want to use the Variety column as the index of the returned DataFrame.
• We can also skip the rows of the input file:
• Be careful with files containing header!

### 1.2 read_table()¶

• The read_table() is very similar to the read_csv() function. The only big difference is that the default delimiter used to read the file is \t tab instead of , comma.
• It is also used to read files with some abnormal delimiters (later).
• The arguments discussed in 1.1 also work the same for read_table().
• Now consider the case where the table might not have a fixed delimiter, using whitespace or some other pattern to separate fields.

Notes: I used the Unix (terminal/shell) command cat here to print the raw content of ex.txt file on Mac. You can do the same thing using the type command on Windows.

• The fields (columns) are separated by a variable number of whitespace!
• In this case, we need to use regular expression (which we will learn in depth next week) as the delimiter for read_table().
• \s+ is the regular expression for at least 1 whitespace. The + sign is to signal that there might be more than 1.
• Also note that pandas used the first column as the index of the df DataFrame! How did it know?
• Since there was one fewer column name than the number of data rows, read_table() infers that the first column should be the DataFrame's index in this special case.

### 1.3 Handling missing values¶

• Missing data is usually either not present (empty strings) or marked by some sentinel value.
• By default, pandas uses a set of commonly occurring sentinels, such as NA and NULL:
• The isnull() function returns a DataFrame of the same size as the original DataFrame with Boolean values: True if a value is missing.
• The na_values option of the read_csv() function can take either a list or set of strings to consider missing values.
• This is very useful in case the missing data is recorded in a specific way: -1 instead of NULL, or "NA" as a string, etc.

## 2. Writing Data to Text Format¶

• Data can also be exported to a delimited format.
• Using to_csv() method, we can write the data out to a comma-separated file:
• A lot of times, we actually don't want to output the index column (since often they're just row numbers).
• By default, the missing values are coded as empty strings.
• We can also denote them by some other chosen sentinel value:

## 3. JSON Data¶

• JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications.
• It is a much more free-form data format than a tabular text form like csv.

This lecture notes reference materials from Chapter 6 of Wes McKinney's Python for Data Analysis 2nd Ed.