DataFrames
Basics
A DataFrame
is star of the pandas
package — many of our pandas
guides are simply building blocks for understanding DataFrames
.
THe standard practice for DataFrames
is reading a file and saving it, taking a glimpse at its contents, and using a wide variety of methods to manipulate the data to achieve whatever goal you have.
Viewing Data
Since we have another page devoted to reading .csv data, we’ll start by creating a simple DataFrame
for analysis:
import pandas as pd
fruits = pd.DataFrame([[1, 'pear'], [2, 'apple'], [3, 'orange']], columns=['count', 'fruit'])
head()
The head
function lets us look at the first n rows of the DataFrame
, with the default being 5. head
is one of the best ways to get a feeling for the data you’ll be analyzing. In our case, fruits
has less than 5 lines, so the whole DataFrame
will be displayed.
Unlike R’s |
print(fruits.head())
count fruit 0 1 pear 1 2 apple 2 3 orange
shape
In most cases, your DataFrame
will be more than 5 lines long (otherwise it wouldn’t be very useful), and sometimes there are so many columns that standard output will not include all of them. We can use the shape
attribute to list how many rows and columns we have:
print(fruits.shape)
(3, 2)
Functions will have parentheses after the codeword, while attributes will not. There isn’t a great pneumonic for knowing which are attributes and which are functions, but there is always the official documentation (or simply trial and error). |
Manipulating Data
Generally speaking, data manipulation happens with functions and not attributes, given the fact that most manipulations require parameters to specify what is being changed or how to change the data.
rename()
What if we read some data and the column names aren’t in our desired format, or just plain unhelpful? Fortunately, we can use rename()
to replace the names of certain columns, as defined by the columns
dictionary parameter.
fruits = fruits.rename(columns={'fruit': 'groceries'})
We can now introduce the important inplace=True
argument to make the change directly to the DataFrame
instead of creating a new variable:
fruits.rename(columns={'fruit': 'groceries'}, inplace=True)
print(fruits.columns)
Index(['count', 'groceries'], dtype='object')
Keep in mind that |
DataFrame() Constructor
It’s a common occurrence in DataFrame
tutorials to use a dictionary to create a DataFrame
. When this is done, the keys become columns and the values become entries. The top example on the pandas DataFrame
documentation is the following:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
print(df)
col1 col2 0 1 3 1 2 4
Very easy! One benefit is that the constructor is adaptable, able to take many multi-dimensional array objects and convert it to a DataFrame
. Our chosen example is a list of dictionaries:
list_of_dicts = []
list_of_dicts.append({'columnA': 1, 'columnB': 2})
list_of_dicts.append({'columnB': 4, 'columnA': 1})
list_of_dicts.append({'columnA': 3, 'columnB': 1, 'columnC': 'hello'})
construction = pd.DataFrame(list_of_dicts)
print(construction.head())
columnA columnB columnC 0 1 2 NaN 1 1 4 NaN 2 3 1 hello
As you can see, the constructor adapted our data with little difficulty — instead of throwing errors when the new columnC
was introduced, it filled the other rows with NaN
, and it was able to handle the dictionary entries being in different orders.