STAT 19000: Project 15 — Fall 2020
Motivation: Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester.
Context: This is the final project of the semester. We sincerely hope that you’ve learned something, and that we’ve provided you with first hand experience digging through data.
Scope: r
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/donorschoose/
Questions
Question 1
Read the data /class/datamine/data/donorschoose/Projects.csv
into a data.frame called projects
. Make sure you use the function you learned in Project 13 (fread
) from the data.table
package to read the data. Don’t forget to then convert the data.table
into a data.frame
. Let’s do an initial exploration of this data. What types of projects (Project.Type
) are there? How many resource categories (Project.Resource.Category
) are there?
-
R code used to solve the question.
-
1-2 sentences containing the project’s types and how many resource categories are in the dataset.
Question 2
Create two new variables in projects
, the number of days a project lasted and the number of days until the project was fully funded. Name those variables project_duration
and time_until_funded
, respectively. To calculate them use the project’s posted date (Project.Posted.Date
), expiration date (Project.Expiration.Date
), and fully funded date (Project.Fully.Funded.Date
). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project’s duration. If so, how many?
You may find the argument |
Be sure to pay attention to the order of operations of |
Note that if you used the |
It is not required that you use |
-
R code used to solve the question.
-
Shortest and longest times until a project is fully funded.
-
1-2 sentences answering whether we have if we have negative project’s duration, and if so how many.
Question 3
As you noted in question (2) there may be some project’s with negative duration time. As we may have some concerns for the data regarding these projects, filter the projects
data to exclude the projects with negative duration, and call this filtered data selected_projects
. With that filtered data, make a dotchart
for mean time until the project is fully funded (time_until_funded
) for the various resource categories (Project.Resource.Category
). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one.
You will first need to average time until fully funded for the different categories before making your plot. |
To make your |
-
R code used to solve the question.
-
Resulting dotchart.
-
1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don’t forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples.
Question 4
Read /class/datamine/data/donorschoose/Schools.csv
into a data.frame called schools
. Combine selected_projects
and schools
by School.ID
keeping only School.ID`s present in both datasets. Name the combined data.frame `selected_projects
. Use the newly combined data to determine the percentage of already fully funded projects (Project.Current.Status
) for schools in West Lafayette, IN. In addition, determine the state (School.State
) with the highest number of projects. Be sure to specify the number of projects this state has.
West Lafayette, IN zip codes are 47906 and 47907. |
-
R code used to solve the question.
-
1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has.
Question 5
Using the combined selected_projects
data, get the school(s) (School.Name
), city/cities (School.City
) and state(s) (School.State
) for the teacher with the highest percentage of fully funded projects (Project.Current.Status
).
There are many ways to solve this problem. For example, one option to get the teacher’s ID is to create a variable indicating whether or not the project is fully funded and use |
Note that each row in the data corresponds to a unique project ID. |
Once you have the teacher’s ID, consider filtering |
To get only certain columns when subetting, you may find the argument |
-
R code used to solve the question.
-
Output of your code containing school(s), city(s) and state(s) of the selected teacher.