import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data
The pandas library is used to deal with structured data stored in tables. You might aquire the structured data from CSV files, TSV files, SQL database tables, or spreadsheets. You can also create pandas DataFrames. It can be thought of as a dictionary-like container for Series. Below, I will go over ways to both create DataFrame objects and read in data to a pandas DataFrame.
We saw that the pandas Series object is a one-dimensional, labeled array; the pandas DataFrame object is a two-dimensional labled data structure with columns of the same or different data types. A DataFrame is like a sequence of Series objects aligned by the same index. There are three main components that make up a pandas DataFrame: the index, the columns, and the data.
Let's create a DataFrame, so we can examine these components below.
The pandas DataFrame function defaults.
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Does this look familiar?
shopping_cart = {
"tax": .08,
"items": [
{
"title": "orange juice",
"price": 3.99,
"quantity": 1
},
{
"title": "rice",
"price": 1.99,
"quantity": 3
},
{
"title": "beans",
"price": 0.99,
"quantity": 3
},
{
"title": "chili sauce",
"price": 2.99,
"quantity": 1
},
{
"title": "chocolate",
"price": 0.75,
"quantity": 9
}
]
}
shopping_cart
is a dictionary with two keys, tax
and items
, but the value for items happens to be a list of dictionaries, so depending on what I want from my data, I may need to unpack the items
column. I can do that!
df = pd.DataFrame(shopping_cart)
df
I can unpack the values of the items
column by applying the pd.Series()
method to the items
column. This returns a new DataFrame.
df['items'].apply(pd.Series)
I can unpack the items
column and combine the original and new DataFrames using the pd.concat()
function if I want the tax
column and new unpacked columns in the same DataFrame. I don't need the original items
column anymore, so I can drop that column at the same time using the .drop()
method.
pd.concat([df1, df2], axis=0, join='outer')
# I can transform my two DataFrames and then concatenate or do it all at once like below.
big_df = pd.concat([df.drop(['items'], axis=1), df['items'].apply(pd.Series)], axis=1)
big_df
# `items` IS already a list of dictionaries, so if I don't need tax, I can do this.
items = shopping_cart['items']
items
# I can pass my list of dictionaries as the data argument to `pd.DataFrame`.
cart_items = pd.DataFrame(items)
cart_items
type(cart_items)
Now that I have a pandas DataFrame to work with, I can look at the components of the DataFrame object using the .index
, the .columns
, and the .values
attributes.
# Here is my default `RangeIndex` object.
cart_items.index
# My column labels are also an index object.
cart_items.columns
# My values are a two-dimensional NumPy array.
cart_items.values
fam = {'name':['Milla', 'Steve', 'Faith', 'Declan'],
'signs':['Virgo', 'Gemini', 'Aquarius', 'Aries'],
'age': [14, 44, 34, 7]}
fam
type(fam)
fam_df = pd.DataFrame(fam, index =['kane_1', 'kane_2', 'kane_3', 'kane_4'])
fam_df
fam_df.info()
I can use the pandas pd.read_sql()
function to read the results of a SQL query into a pandas DataFrame.
df = pd.read_sql(sql_query, connection_url)
To do this, I will need to import host, password, and user from my env file.
from env import host, password, user
I can create a handy function to get my connection_url
argument for my function. My function will take in a string for the name of the db, and the user, host, and password parameters have default arguments set to match the imports from my env file.
def get_connection(db, user=user, host=host, password=password):
return f'mysql+pymysql://{user}:{password}@{host}/{db}'
I need to write a valid SQL query.
sql_query = 'SELECT * FROM employees'
Putting it all together, I will end up with something like this:
def get_employees_data():
return pd.read_sql(sql_query, get_connection('employees'))
Call my function and assign my data to the variable employees
.
# This is commented out because after reading in data, I wrote it to a CSV file.
# employees = get_employees_data()
When I import a large dataset using a SQL query, it only takes me about 3 or 4 times of restarting my kernel and waiting minutes for my query to run to decide I need to stop and write my new DataFrame to a CSV file that I can access instantly.
I only need to run this code once to create a new CSV file in my current directory, and then I can read in my data as shown in the next section. I can comment out the code I was using to read in my data and write to a CSV file after I have my CSV file.
df.to_csv('new_file_name.csv')
# This is commented out because I only needed to create my CSV file once.
# employees.to_csv('employees_df.csv')
# Create my DataFrame reading from my own CSV file; way faster now.
employees = pd.read_csv('employees_df.csv', index_col=0)
employees.head()
I can use the pandas pd.read_csv()
function to read the data from a CSV file into a pandas DataFrame.
df = pd.read_csv('your_data.csv', index_col=0)
student_df = pd.read_csv('interesting_data.csv')
student_df.head()
I have to import the following:
from pydataset import data
This code snippet will show me the information doc on the dataset.
data(df_string_name, show_doc=True)
This code snippet will load the dataset for use as a pandas DataFrame.
df = data(df_string_name)
There are 757 available datasets using pydataset. Running the following code snippet in a cell wil return a DataFrame with all of your options.
data()
colors = data('HairEyeColor')
colors.head()
This is a super cool pandas function that allows you to create a DataFrame from data that you have just copied to your clipboard. Depending on the format of the data you copy, you may need to tweak some parameters to get your DataFrame exactly as you want. Instead of going into a bunch of detail here, check out this simple medium article on how to use this method.
Already have a header? This works.
pd.read_clipboard()
No header on the data you copied? No problem, you can create your own with this.
pd.read_clipboard(headers=None, names=desired_column_names)
A DataFrame can be composed of columns of different data types, it allows for indexing and subsetting of data, and comes with many useful attributes and methods adding an enormous array of functionality to this pandas object. One of the most impressive features of the pandas library is that it offers vectorized functions optimized to operate on Series and DataFrame objects. I explored what this means for the pandas Series object in the Series Review Notebook; this notebook will dig into what this means for a DataFrame object.
I'll explore all of these awesome features in depth throughout the rest of this notebook.
Most of the methods and attributes covered in my pandas Series review notebook also belong to pandas DataFrame objects. They may function slightly differently, have additional parameters and/or functionality, or return different objects, but you will find most of them useful whether working with Series or DataFrames. In this notebook, I will expand upon some of the methods and attributes previously covered, but for the most part my goal is to add to my existing toolbox.
# There are 433 methods and attributes of a pandas Series.
series_attribute_methods = set(dir(pd.Series))
len(series_attribute_methods)
# There are 430 methods and attributes of a pandas DataFrame.
df_attribute_methods = set(dir(pd.DataFrame))
len(df_attribute_methods)
# There are 377 methods and attributes that belong to BOTH pandas Series AND DataFrames.
len(series_attribute_methods & df_attribute_methods)
Like the pandas Series object, the pandas DataFrame object supports both position- and label-based indexing using the indexing operator []
.
I will demonstrate concrete examples of indexing using the indexing operator []
alone and with the .loc
and .iloc
attributes below.
[]
¶I can use a list of columns from a DataFrame with the indexing operator (use bracket notation) and return a subset of my original DataFrame.
# Peek at student_df
student_df.head(1)
student_df.shape
# Create a df that is a subset of the original student_df.
student_subset = student_df[['Country', 'Region']]
student_subset.head()
# I can choose columns and rows like this.
student_df[['Country', 'Region']][10:20]
# I can create a subset using a boolean Series; I will use this to subset my original df.
female_bool_series = student_df.Gender == 'Female'
female_bool_series.head()
# I use my boolean Series to select only observations where `Gender == Female`.
female_subset = student_df[female_bool_series]
female_subset.head()
.loc[]
¶I can use the .loc
attribute to select specific rows AND columns by index label. My index label can be a number, but it can also be a string label. This method offers a lot of flexibility! The .loc
attribute's indexing is inclusive and uses an index label, not position.
df.loc[row_indexer, column_indexer]
# I want the rows from start through label 5 and columns from Gender through Armspan_cm labels.
loc_subset = student_df.loc[:5, 'Gender': 'Armspan_cm']
loc_subset
loc_subset.shape
# Create a bool Series to select a subset of ambidextrous students using `.loc`
ambi_bool = student_df.Handed == 'Ambidextrous'
ambi_bool.head()
# Create subset of ambidextrous students using my bool Series
student_df[ambi_bool]
# I can select only students who are ambidextrous using my bool Series AND select specific columns using labels.
student_df.loc[ambi_bool, ['Gender', 'ClassGrade' ]]
Use a lambda function with with the .loc
attribute to subset my data.
# By popular request, lambda functions!! This is nice, nothing scary here, haha!
student_df.loc[lambda df: df['Region'] == 'TX']
.iloc[]
¶I can use the .iloc
attribute to select specific rows and colums by index position; it is not used with boolean Series. It takes in integers representing index position and is NOT inclusive.
df.iloc[row_indexer, column_indexer]
student_df.iloc[90:, :3 ]
.set_index()
¶This method allows me to set my DataFrame index using an existing column. I can also pass a Series or Index object the same length as my DataFrame as an argument to create an new index. This will not change my original DataFrame because the default is inplace=True
.
df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
# My original fam_df.
fam_df.head(1)
# I can set the `name` column to be my new index.
fam_df.set_index('name')
# Create a new index using a pandas Index object of the same length.
fam_df.set_index(pd.Index([1, 2, 3, 4]))
.reset_index()
¶This method will come in handy a lot as we move into methodologies, but for now I'll at least introduce it. This method does not change your original DataFrame unless you pass inplace=True
; otherwise, just reassign or assign the copy.
df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
# This resets the fam_df index to default and add my original index as a column.
fam_df.reset_index()
# I reset my index and rename original index column if I want to keep it.
fam_df.reset_index().rename(columns={'index': 'id'})
# If I don't want the original index as a column, I can set `drop=True`.
fam_df.reset_index(drop=True)
.groupby()
¶This powerful method allows you to group your data by one or more columns and apply any type of function to each group returning the calculations in a Series or DataFrame.
I can use a single grouping column, a single aggregating column, and a single aggregating function.
df.groupby('grouping_column').agg_column.agg_func()
student_df.head(1)
# Use a `groupby()` to calculate the average age by Gender; return a df by using [['double_brackets']].
student_df.groupby('Gender')[['Ageyears']].mean()
I can use a list of grouping columns and a single aggregating function.
df.groupby(['list', 'of', 'grouping', 'columns']).agg_func()
# Perform a multi-level groupby aggregation.
student_df.groupby(['Gender', 'Handed']).size()
.agg()
¶Chaining the .agg()
method with a .groupby()
provides more flexibility when I'm aggregating.
I can use a list of grouping columns and perform more than one function on my aggregating column.
df.groupby(['list', 'of', 'grouping', 'columns']).agg_column.agg_func(['func', 'other_func'])
student_df.groupby(['Gender', 'Handed']).Ageyears.agg(['mean', 'median'])
I can use a list of grouping columns and pass a dictionary to the .agg()
method to use different aggregating functions on different columns.
student_df.groupby(['Gender', 'Handed']).agg({'Ageyears': 'mean', 'Text_Messages_Sent_Yesterday': 'median'})
.concat()
¶This function takes in a list or dictionary of Series or DataFrame objects and joins them along a particular axis, row or column.
When concatenating all Series along the index, axis=0
, a Series is returned.
When objs
contains at least one DataFrame, a DataFrame is returned.
When concatenating Series or DataFrames along the columns, axis=1
, a DataFrame is returned.
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None,
levels=None, names=None, verify_integrity=False, copy=True)
Combine two DataFrame objects with identical columns:
fam_df
# Create a list of dictionaries to be new rows in a DataFrame concatenated to my original fam_df.
new = [{'name': 'Penny', 'signs': 'Libra', 'age': '0'},
{'name': 'Betty', 'signs': 'Libra', 'age': '1'},
{'name': 'Pris', 'signs': 'Scorpio', 'age': '2'}]
# Create new_df using my list of dictionaries above.
new_df = pd.DataFrame(new, index=['kane_5', 'kane_6', 'kane_7'])
new_df
# Concatenate my new_df to my original fam_df; the default, `axis=0`, will stack these dfs.
fam_df = pd.concat([fam_df, new_df])
fam_df
Combine two DataFrame objects with identical index labels:
new_cols_df = pd.DataFrame({'eyes': ['brown', 'brown', 'blue', 'brown', 'amber', 'brown', 'hazel'],
'hair': ['brown', 'black', 'blonde', 'red', 'red', 'black', 'red']},
index=['kane_1', 'kane_2', 'kane_3','kane_4', 'kane_5', 'kane_6', 'kane_7'])
new_cols_df
fam_df = pd.concat([fam_df, new_cols_df], axis=1, sort=False)
fam_df
df.merge()
¶This method is similar to a SQL join. Here's a cool read making a comparison between the two, if you're interested.
left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)
How does changing the default argument of the how
parameter change my resulting DataFrame?
Type of merge to be performed.
how=left
: use only keys from left frame, similar to a SQL left outer join; preserve key order.
how=right
: use only keys from right frame, similar to a SQL right outer join; preserve key order.
how=outer
: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
how=inner
: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
Merge on
Label or List
The default argument for the on
parameter is None
, so if I am not merging on the indexes of the Series or DataFrame objects I'm joining, this defaults to the intersection of the columns in both objects. Otherwise, I can pass the column(s) name or index level to join on.
# Read in some data from a CSV file to create my `titles` DataFrame.
titles = pd.read_csv('titles.csv', index_col=0)
titles.head(2)
# Peek at columns in table I want to merge with.
employees.head(2)
# Merge employees and titles DataFrames on `emp_no` column.
all_emp_titles = employees.merge(titles, on='emp_no')
all_emp_titles.head()
.drop()
¶The .drop()
method allows me to drop rows or columns from a DataFrame. I can specify a single label or list of labels to drop; the default for the axis parameter is axis=0
(rows).
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
If I want to drop columns, I can adjust the axis parameter or, simpler, I can just pass a single label or list of labels as an argument to columns
. Again, the inplace=False
, so I have to either save the resulting DataFrame to a new variable or reassign it to my original variable; this WILL mutate my original data, so be careful.
If I reassign my new DataFrame, thus changing my original, and I try to rerun that code, I WILL get an error because that column(s) is gone. Don't be surprised when you hit this error; you will.
student_df.head(1)
# Country and Region are gone in the resulting df copy; I did not reassign, so temporarily.
student_df.drop(columns=['Country', 'Region']).head(3)
# Drop a row by index position or list of positions
student_df.drop(index = student_df.index[[2, 5]]).head()
.rename()
¶I can rename my columns using this method, but unless I'm changing column names as a part of method chaining or I don't want to change my original DataFrame, I will rename my columns using the .columns
attribute. It's good to know how to use both, so you can use the best one for your context.
df.rename(columns={'original_name': 'new_name', 'original_name': 'new_name'})
OR
df.columns = ['new_name_1', 'new_name_2', 'new_name_3']
# If I want to keep these changes, I need to reassign.
student_df.rename(columns={'DataYear': 'year', 'ClassGrade': 'grade'}).head(1)
fam_df
# Use `.rename()` method as a part of method chaining. It's a good thing.
rename_fam_df = fam_df.drop(columns='signs').rename(columns={'name': 'family_member'})
rename_fam_df
.columns
¶Here's a fast way to rename a lot of columns by accessing and reassigning to the .columns
attribute.
df.columns = ['new_col_name', 'new_col_name', 'new_col_name'...]
employees.head(1)
# Print out a list of my original column names to steal for my code below.
employees.columns.to_list()
# Copy and paste list and change what I want. Then, reassign new list to df.columns
employees.columns = ['emp_no', 'birth', 'first', 'last', 'gender',
'hire_date']
employees.head(1)
.assign()
¶The .assign()
method is used to create new columns from existing columns in my DataFrame. I can also create new columns using bracket notation, so I'll demonstrate both ways here, and as always, my specific context will most likely dictate which method I employ in the wild.
If I'm making one new column, and I want to add it to my original DataFrame, I will usually just use bracket notation, but when I want to create multiple columns, I'm method chaining, or I don't want to change my existing DataFrame, .assign()
is the right tool.
df.assign(new_column = some_expression_or_calculation,
another_new_column = some_other_expression_if_I_want)
fam_df
# I can make more than one new column at once! I have to assign or reassign this to keep it.
fam_df.assign(shout = fam_df.name.str.upper(),
name_letter_count = fam_df.name.apply(len),
sign_letter_count = fam_df.signs.apply(len))
# My original DataFrame is unchanged because I did not reassign above.
fam_df.head(2)
# I can use bracket notation to add one column at a time to my original DataFrame.
fam_df['shout'] = fam_df.name.str.upper()
fam_df
# I'm changing my original DataFrame each time I create a new column using bracket notation.
fam_df['name_letter_count'] = fam_df.name.apply(len)
fam_df
np.where()
¶This is a really useful NumPy function that I find myself reaching for all the time! I like to use it to create a new column in my DataFrame with values based on another column in my DataFrame. I can base my new column values on whether the values in the existing column meet my condition. There is a df.where()
pandas method that is very similar, but it does not function in the same way; don't get confused by this.
np.where(condition, this_where_True, this_where_False)
fam_df.info()
# I need to convert my age column to be integers if I want to use it as such.
fam_df['age'] = fam_df.age.astype(int)
# Create a new column with values based on a conditional test on another column.
fam_df['age_group'] = np.where(fam_df.age > 18, 'adult', 'minor')
fam_df.head()
.sort_values()
¶This is a very useful method of both pandas Series and DataFrames. When I call this method on a DataFrame, I have to pass an argument to the by
parameter to specifiy which column(s) to sort my DataFrame by. I can pass a string value (column name) or a list (column_names) as the argument to the by
parameter, and I can also pass a single boolean value or a list of boolean values to the ascending
parameter to curate my sort.
The defaul arguments can be seen below...
df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
# Here I am sorting by first name in ascending order and last name in descending order!
employees.sort_values(by=['first', 'last'], ascending=[True, False]).head()
.sort_index()
¶DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
Just as I can sort by the values in my Series or DataFrame, I can also sort by the index. I also use this method all the time; keep it in mind.
fam_df.head(3)
# I can reverse the order of my fam_df by the index if I want.
fam_df.sort_index(ascending=False)
# I can also sort my DataFrame columns by setting `axis=1`.
fam_df.sort_index(axis=1)
.T
¶I can access this property of my DataFrame to transpose its indexes.
fam_df.T
.pivot_table()
¶This pandas function allows me to create a spreadsheet-style pivot table as a DataFrame. I'll demonstrate a very simple pivot here, but as we deal with more complex data, pivots can do a lot more.
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=True, margins_name=’All’)
# View original DataFrame.
cart_items
# Create a new column calculating price and quantity for use below.
cart_items['total'] = cart_items.price * cart_items.quantity
cart_items
# A simple pivot table setting only `values` and `columns`.
pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title')
.crosstab()
¶This function basically creates a Frequency Table.
pd.crosstab(index_series, col_series)
# Here I'm reading in a CSV file I created to create my DataFrame.
dept_titles = pd.read_csv('dept_titles.csv', index_col=0)
dept_titles.head()
# Create a frequency table of titles by department
all_titles_crosstab = pd.crosstab(dept_titles.dept_name, dept_titles.title)
all_titles_crosstab
Make it easier to read your DataFrame by using the style.background_gradient()
method of the pandas DataFrame! You can find great advice about choosing the right colormap for your purpose here.
By default, colors are computed based on the row values in each column.
df.style.background_gradient(cmap='PuBu' , axis=0)
If you want to compute the colors based on the entire matrix, axis=None
.
I'll show a couple of useful built-in styling methods here, and if you want to read more about styling, check out this styling guide.
Here, I want colors computed based on the column values in each row (departments), so I set axis=1
.
all_titles_crosstab.style.background_gradient(cmap='PuBuGn', axis=1)
Maybe you just want to find the max number in each row (department name) quickly.
all_titles_crosstab.style.highlight_max(axis=1)
I can chain methods onto my Series or DataFrame as long as my method returns a Series or DataFrame object.
series.method().method().method()
dataframe.method().method().method()
.pipe()
¶I can also chain functions. The .pipe()
method is like .apply()
for an entire DataFrame; I can chain functions together to make a pipline where the output from running one function serves as the input for the next function.
df.pipe.(function).pipe(function)