From the previous data exploration and data analysis, we know that there exists some interesting relationship between bus delay and weather, location, and a variety of reasons. Here, we want to explore the dataset deeper to predict the number of daily delays in Manhattan as our outcome and month, weekday, daily precipitation, daily snow depth, and minimum temperature as predictors.
Our interested dependent variable is the number of delays across three school years: the number of delays in Manhattan. Based on the graph below, the daily delay across three consecutive school year have declined.
daily_df = df %>%
filter(school_year!="2021-2022",
boro =="Manhattan") %>%
group_by(occur_date,school_year) %>%
summarize(n=n()) %>%
ggplot(aes(x=occur_date, y=n,fill=school_year))+
geom_bar(stat='identity')+
facet_grid(~school_year)+
labs(
title = "Number of Daily Delay",
x = "Date of Delay",
y = "Count"
)
daily_df
Here’s the predictors in our model:
·month: month of the incident occurred;
·day: weekday of the incident occurred;
·prcp:precipitation (tenths of mm)
·snwd: snow depth (mm)
·tmin Minimum temperature (tenths of degrees C)
We propose 3 models for prediction:
1. Linear Model of daily_delay ~ month + day + snwd + tmin
2. Linear Model of daily_delay ~ month + day + snwd + tmin+ tmin*day,assuming that there are interaction
3. Linear Model of daily_delay ~ month + day + prcp
Model_1
daily_delay=df %>%
group_by(occur_date) %>%
mutate(daily_delay=n())
lm_1=lm(daily_delay ~ month + day + snwd + tmin,data=daily_delay)
lm_1 %>% broom::tidy() %>% knitr::kable(digits = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 131.71 | 1.15 | 114.35 | 0 |
monthDecember | 73.71 | 1.27 | 58.06 | 0 |
monthFebruary | 71.29 | 1.29 | 55.31 | 0 |
monthJanuary | 56.93 | 1.28 | 44.57 | 0 |
monthJune | 30.11 | 1.54 | 19.58 | 0 |
monthMarch | 53.46 | 1.25 | 42.80 | 0 |
monthMay | 38.77 | 1.31 | 29.65 | 0 |
monthNovember | 94.69 | 1.18 | 80.57 | 0 |
monthOctober | 110.83 | 1.16 | 95.51 | 0 |
monthSeptember | 106.81 | 1.33 | 80.12 | 0 |
dayMonday | 14.99 | 0.76 | 19.69 | 0 |
dayThursday | 2.90 | 0.75 | 3.89 | 0 |
dayTuesday | 9.36 | 0.75 | 12.47 | 0 |
dayWednesday | 4.08 | 0.74 | 5.48 | 0 |
snwd | -0.51 | 0.01 | -44.45 | 0 |
tmin | -3.51 | 0.06 | -61.90 | 0 |
Model_2
daily_delay=df %>%
group_by(occur_date) %>%
mutate(daily_delay=n())
lm_2=lm(daily_delay ~ month + day+ snwd + tmin+tmin*day,data=daily_delay)
lm_2%>% broom::tidy() %>% knitr::kable(digits = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 125.22 | 1.21 | 103.61 | 0 |
monthDecember | 72.58 | 1.26 | 57.44 | 0 |
monthFebruary | 71.68 | 1.28 | 55.87 | 0 |
monthJanuary | 56.12 | 1.27 | 44.14 | 0 |
monthJune | 30.30 | 1.54 | 19.73 | 0 |
monthMarch | 52.43 | 1.24 | 42.18 | 0 |
monthMay | 38.03 | 1.30 | 29.22 | 0 |
monthNovember | 94.20 | 1.17 | 80.56 | 0 |
monthOctober | 111.27 | 1.15 | 96.36 | 0 |
monthSeptember | 106.22 | 1.33 | 79.94 | 0 |
dayMonday | 24.35 | 1.02 | 23.87 | 0 |
dayThursday | 6.81 | 0.95 | 7.18 | 0 |
dayTuesday | 24.06 | 0.96 | 25.01 | 0 |
dayWednesday | 11.61 | 0.94 | 12.34 | 0 |
snwd | -0.51 | 0.01 | -44.77 | 0 |
tmin | -2.45 | 0.08 | -29.42 | 0 |
dayMonday:tmin | -1.42 | 0.10 | -14.03 | 0 |
dayThursday:tmin | -0.58 | 0.09 | -6.33 | 0 |
dayTuesday:tmin | -2.30 | 0.09 | -24.26 | 0 |
dayWednesday:tmin | -1.16 | 0.09 | -12.76 | 0 |
Model_3
daily_delay=df %>%
group_by(occur_date) %>%
mutate(daily_delay=n())
lm_3=lm(daily_delay ~ month + day+ prcp,data=daily_delay)
lm_3%>% broom::tidy() %>% knitr::kable(digits = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 104.95 | 1.13 | 93.02 | 0 |
monthDecember | 95.89 | 1.26 | 76.40 | 0 |
monthFebruary | 85.93 | 1.27 | 67.76 | 0 |
monthJanuary | 90.24 | 1.21 | 74.78 | 0 |
monthJune | -8.15 | 1.47 | -5.54 | 0 |
monthMarch | 69.26 | 1.26 | 54.98 | 0 |
monthMay | 19.62 | 1.32 | 14.84 | 0 |
monthNovember | 101.28 | 1.21 | 83.53 | 0 |
monthOctober | 95.53 | 1.18 | 81.12 | 0 |
monthSeptember | 69.04 | 1.24 | 55.53 | 0 |
dayMonday | 13.89 | 0.79 | 17.62 | 0 |
dayThursday | 3.72 | 0.77 | 4.81 | 0 |
dayTuesday | 8.62 | 0.78 | 11.05 | 0 |
dayWednesday | 5.19 | 0.78 | 6.70 | 0 |
prcp | 0.06 | 0.00 | 19.14 | 0 |
cv_df=df %>%
sample_n(size = 100) %>%
group_by(occur_date) %>%
mutate(daily_delay=n()) %>%
modelr::crossv_mc(n = 100) %>%
mutate_if(is.character, as.factor) %>%
mutate(train = map(train, as.tibble),
test = map(test, as.tibble)) %>%
mutate(lm_1 = map(.x = train,~ lm(daily_delay ~ month + day + snwd + tmin,data = .x)),
lm_2 =map(.x = train,~ lm(daily_delay ~ month + day+ snwd + tmin+tmin*day,data = .x)),
lm_3 =map(.x = train,~ lm(daily_delay ~ month + day+ prcp, data = .x)))
cv=cv_df %>%
select(starts_with("lm")) %>%
pivot_longer(
starts_with("lm"),
names_to = "model",
values_to = "data") %>%
mutate(data = map(data,broom::tidy))
The predictors we chose above are statistically significant.However, we notice that interaction terms are not contributing to model predictability. We would say lm_3 will be better than the other two models because its root means square error seems to be smaller. Therefore, lm_3 could be useful for predicting the total number of bus delays in Manhattan.
cv_df %>%
mutate(
rmse_lm1 = map2_dbl(lm_1, test, ~rmse(model = .x, data = .y)),
rmse_lm2 = map2_dbl(lm_2, test, ~rmse(model = .x, data = .y)),
rmse_lm3 = map2_dbl(lm_3, test, ~rmse(model = .x, data = .y))) %>%
select(starts_with("rmse")) %>%
pivot_longer(
everything(),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_") %>%
mutate(model = fct_inorder(model)) %>%
ggplot(aes(x = model, y = rmse)) + geom_violin()
Month, rainfall, snow depth, and day are good predictors for anticipating the number of daily delays in Manhattan.