STAT 19000: Project 5 — Fall 2020
Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops.
Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.
Scope: r, data.frames, recycling, factors, if/else, for
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/fars
To get more information on the dataset, see here.
Questions
Question 1
The fars
dataset contains a series of folders labeled by year. In each year folder there is (at least) the files ACCIDENT.CSV
, PERSON.CSV
, and VEHICLE.CSV
. If you take a peek at any ACCIDENT.CSV
file in any year, you’ll notice that the column YEAR
only contains the last two digits of the year. Add a new YEAR
column that contains the full year. Use the rbind
function to create a data.frame called accidents
that combines the ACCIDENT.CSV
files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that accidents
data frame, change the values in the YEAR
column from two digits to four digits (i.e., paste a 19 onto each year value).
Here is a video to walk you through the method of solving Question 1.
Here is another video, using two functions you have not (yet) learned, namely, lapply
and do.call
. You do not need to understand these yet. It is just a glimpse of some powerful functions to come later in the course!
-
R code used to solve the problem/comments explaining what the code does.
-
The result of
unique(accidents$YEAR)
.
Question 2
Using the new accidents
data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?
Look at the variables |
Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers).
-
R code used to solve the problem/comments explaining what the code does.
-
The result/answer itself.
Question 3
Again using the accidents
data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?
Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year.
-
R code used to solve the problem/comments explaining what the code does.
-
The results.
-
Which year had the most qualifying accidents.
Question 4
Again using the accidents
data frame: Calculate the mean number of motorists involved in an accident (variable PERSON
) with i drunk drivers, where i takes the values from 0 through 6.
It is OK that there are no accidents involving just 5 drunk drivers. |
You can use either a |
Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with i
drunk drivers, where i
takes the values from 0 through 6.
-
R code used to solve the problem/comments explaining what the code does.
-
The output from running your code.
Question 5
Again using the accidents
data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which STATE
is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by STATE
and by month (MONTH
) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not?
We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States.
-
R code used to solve the problem/comments explaining what the code does.
-
The output (plot) from running your code.
-
1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you.
OPTIONAL QUESTION
Spruce up your plot from (5). Do any of the following:
-
Add vibrant (and preferably colorblind friendly) colors to your plot
-
Add a title
-
Add a legend
-
Add month names or abbreviations instead of numbers
Here is a resource to get you started. |
-
R code used to solve the problem/comments explaining what the code does.
-
The output (plot) from running your code.