STAT 29000: Project 12 — Spring 2022
Motivation: As we mentioned before, data wrangling is a big part in any data driven project. "Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis." Therefore, it is worth to spend some time mastering how to best tidy up our data.
Context: We are continuing to practice using various tidyverse
packages, in order to wrangle data.
Scope: R, tidyverse, ggplot
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/consumer_complaints/complaints.csv
Questions
Question 1
As usual, we will start by loading tidyverse
package (using the read_csv
function) and reading the processed data complaints.csv
into a tibble called dat
.
The first step in many projects is to define the problem statement and goal. That is, to understand why the project is important and what are our desired delivarables. For this project and the next, we will assume that we want to improve consumer satisfaction, and to achieve that we will provide them with data-driven tips and plots to make informed decisions.
What is the type of data in the date_sent_to_company
and date_received
columns? Do you think that these columns are in a good format to calculate the wait time between receiving and sending the complaint to the company? No need to overthink anything — from your perspective, how simple/complicated would be the steps to calculate the number of days between received and sent to the company, if the data remaining in the current format?
The Also, try to keep using the pipes ( |
-
Code used to solve this problem.
-
Output from running the code.
-
1 sentence answering the type of columns
date_sent_to_company
anddate_received
. -
1-2 sentences commenting on if you think it would be easy to calculate calculate the wait time between receiving and sending the complaint to the company if the type of data remained as is.
Question 2
Tidyverse has a few "data type"-specific packages. For example, the package lubridate
is fantastic when dealing with dates, and the package stringr
was made to help when dealing with text (chr
). Although lubridate
is a part of the tidyverse
universe, it does not get loaded when we run library(tidyverse)
.
Begin this question by loading the lubridate
package. Use the appropriate function to convert the columns refering to dates to a date format.
There are lots of really great (and "official") cheat sheets for Here are the cheat sheets. Here is the |
Try to solve this question using the mutate_at
function.
You will notice a pattern within the |
Take a look at the functions: |
If you are using
|
-
Code used to solve this problem.
-
Output from running the code.
-
Result from running
glimpse(dat)
.
Question 3
Add a new column called wait_time
that is the difference between date_sent_to_company
and date_received
in days.
You may want to use the argument |
Remember that |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Frequently, we need to perform many data wrangling tasks: changing data types, performing operations more easily, summarizing the data, and pivoting to be enable plotting certain types of plots.
So far, we spent most of the project doing just that! Now that we have the wait_time
in a format that allows us to plot it, let’s start creating some tips and plots to increase costumer satisfaction. One thing we may want to do is give the user information on the wait_time
based on how the complaint was submitted (submitted_via
). Compare wait_time
values by how they are submitted.
Keep in mind that we want to present this information in a way that would be helpful to costumers. For example, if you summarized the data, you could present the information as a tip and include the summarized |
Be sure to explain your reasoning for each step of your analysis. If you are summarizing, why did you pick this method, and why are you summarizing the way you are (for example, are you using the average time, the median time, the maximum time, the mean(wait_time) + 3*std(wait_time)
)? You may also want to create 3 categories of wait_time
(small, medium, high) and do a table
between the categorical wait time and submission types. Why are you presenting the information the way you are?
Figuring out how to present the information to help someone make a decision is an important step in any project! You may very well be presenting to someone that is not as familiar with data science/statistics/computer science as you are. |
If you are creating categorical wait time, take a look at the |
One example could be: The plot below shows the average time it takes for the company to receive your complaint after you sent it based on _how_ you sent it. Note that, on average, it takes XX days to get a response if you submitted via YY. Alternatively, it takes, on avaerage, twice as long to receive a response if you submit a complain via ZZ. Be sure to keep this in mind when submitting a complaint. |
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaning your reasoning for how you presented the information.
-
Information for costumer to make decision (plot, tip, etc).
Question 5
Note that we have a column called timely_response
in our dat
. It may or may not (in reality) be related to wait_time
, however, we would expect it to be. What would you expect to see? Compare wait_time
to timely_response
using any technique you’d like. You can use the same idea/technique from question 4, or you can pick something else entirely different. It is completely up to you!
Would this information be relevant to include in a tip or dashboard for a costumer to make their decision? Why or why not? Would you combine this information with the one for wait_time
? If so, how?
Sometimes there are many ways to present similar pieces of information, and we must decide what we believe makes most sense, and what will be most helpful when making a decision.
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences comparing
wait_time
for timely and not timely responses. -
1-2 sentences explaining whether you would include this information for costumers, and why or why not? If so, how would you include it?
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |