STAT 19000: Project 7 — Spring 2021
Motivation: There is one pretty major topic that we have yet to explore in Python — functions! A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code.
Context: We are taking a small hiatus from our pandas
and numpy
focused series to learn about and write our own functions in Python!
Scope: python, functions, pandas
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/yelp/data/parquet
Questions
Question 1
You’ve been given a path to a folder for a dataset. Explore the files. Give a brief description of the files and what each file contains.
Take a look at the size of each of the files. If you are interested in experimenting, try using |
-
Python code used to solve the problem.
-
Output from running your code.
-
The name of each dataset and a brief summary of each dataset. No more than 1-2 sentences about each dataset.
Question 2
Read the businesses.parquet
file into a pandas
DataFrame called businesses
. Take a look to the hours
and attributes
columns. If you look closely, you’ll observe that both columns contain a lot more than a single feature. In fact, the attributes
column contains 39 features and the hours
column contains 7!
len(businesses.loc[:, "attributes"].iloc[0].keys()) # 39
len(businesses.loc[:, "hours"].iloc[0].keys()) # 7
Let’s start by writing a simple function. Create a function called has_attributes
that takes a business_id
as an argument, and returns True
if the business has any attributes
and False
otherwise. Test it with the following code:
print(has_attributes('f9NumwFMBDn751xgFiRbNA')) # True
print(has_attributes('XNoUzKckATkOD1hP6vghZg')) # False
print(has_attributes('Yzvjg0SayhoZgCljUJRF9Q')) # True
print(has_attributes('7uYJJpwORUbCirC1mz8n9Q')) # False
While this is useful to get whether or not a single business has any attributes, if you wanted to apply this function to the entire attributes
column/Series, you would just use the notna
method:
businesses.loc[:, "attributes"].notna()
Make sure your return value is of type
|
-
Python code used to solve the problem.
-
Output from running the provided "test" code.
Question 3
Take a look at the attributes
of the first business:
businesses.loc[:, "attributes"].iloc[0]
What is the type of the value? Let’s assume the company you work for gets data formatted like businesses
each week, but your boss wants the 39 features in attributes
and the 7 features in hours
to become their own columns. Write a function called fix_businesses_data
that accepts an argument called data_path
(of type str
) that is a full path to a parquet file that is in the exact same format as businesses.parquet
. In addition to the data_path
argument, fix_businesses_data
should accept another argument called output_dir
(of type str
). output_dir
should contain the path where you want your "fixed" parquet file to output. fix_businesses_data
should return None
.
The result of your function, should be a new file called new_businesses.parquet
saved in the output_dir
, the data in this file should no longer contain either the attributes
or hours
columns. Instead, each row should contain 39+7 new columns. Test your function out:
from pathlib import Path
my_username = "kamstut" # replace "kamstut" with YOUR username
fix_businesses_data(data_path="/class/datamine/data/yelp/data/parquet/businesses.parquet", output_dir=f"/scratch/scholar/{my_username}")
# see if output exists
p = Path(f"/scratch/scholar/{my_username}").glob('**/*')
files = [x for x in p if x.is_file()]
print(files)
Make sure that either |
from pathlib import Path
def fix_businesses_data(data_path: str, output_dir: str) -> None:
"""
fix_data accepts a parquet file that contains data in a specific format.
fix_data "explodes" the attributes and hours columns into 39+7=46 new
columns.
Args:
data_path (str): Full path to a file in the same format as businesses.parquet.
output_dir (str): Path to a directory where new_businesses.parquet should be output.
"""
# read in original parquet file
businesses = pd.read_parquet(data_path)
# unnest the attributes column
# unnest the hours column
# output new file
businesses.to_parquet(str(Path(f"{output_dir}").joinpath("new_businesses.parquet")))
return None
Check out the code below, notice how using |
from pathlib import Path
print(Path("/class/datamine/data/").joinpath("my_file.txt"))
print(Path("/class/datamine/data").joinpath("my_file.txt"))
You can test out your function on |
If we were using R and the |
This stackoverflow post should be very useful! Specifically, run this code and take a look at the output:
|
Notice that some rows have json, and others have None
:
businesses.loc[0, "attributes"] # has json
businesses.loc[2, "attributes"] # has None
This method allows us to handle both cases. If the row has json it converts the values, if it has None
it just puts each column with a value of None
.
Here is an example that shows you how to concatenate (combine) dataframes. |
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
That’s a pretty powerful function, and could definitely be useful. What if, instead of working on just our specifically formatted parquet file, we wrote a function that worked for any pandas
DataFrame? Write a function called unnest
that accepts a pandas
DataFrame as an argument (let’s call this argument myDF
), and a list of columns (let’s call this argument columns
), and returns a DataFrame where the provided columns are unnested.
You may write |
The following should work:
|
-
Python code used to solve the problem.
-
Output from running the provided code.
Question 5
Try out the code below. If a provided column isn’t already nested, the column name is ruined and the data is changed. If the column doesn’t already exist, a KeyError is thrown. Modify our function from question (4) to skip unnesting if the column doesn’t exist. In addition, modify the function from question (4) to skip the column if the column isn’t nested. Let’s consider a column nested if the value of the column is a dict
, and not nested otherwise.
businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
new_businesses_df = unnest(businesses, ["doesntexist",]) # KeyError
new_businesses_df = unnest(businesses, ["postal_code",]) # not nested
To test your code, run the following. The result should be a DataFrame where attributes
has been unnested, and that is it.
businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
results = unnest(businesses, ["doesntexist", "postal_code", "attributes"])
results.shape # (209393, 39)
results.head()
To see if a variable is a
|
-
Python code used to solve the problem.
-
Output from running the provided code.