Regression Analysis

Predicting Model for Daily Delay Count

From the previous data exploration and data analysis, we know that there exists some interesting relationship between bus delay and weather, location, and a variety of reasons. Here, we want to explore the dataset deeper to predict the number of daily delays in Manhattan as our outcome and month, weekday, daily precipitation, daily snow depth, and minimum temperature as predictors.

Our interested dependent variable is the number of delays across three school years: the number of delays in Manhattan. Based on the graph below, the daily delay across three consecutive school year have declined.

daily_df = df %>% 
  filter(school_year!="2021-2022",
         boro =="Manhattan") %>% 
   group_by(occur_date,school_year) %>% 
   summarize(n=n()) %>% 
    ggplot(aes(x=occur_date, y=n,fill=school_year))+
    geom_bar(stat='identity')+
    facet_grid(~school_year)+
      labs(
        title = "Number of Daily Delay",
        x = "Date of Delay",
        y = "Count"
      )
daily_df

Here’s the predictors in our model:

·month: month of the incident occurred;

·day: weekday of the incident occurred;

·prcp:precipitation (tenths of mm)

·snwd: snow depth (mm)

·tmin Minimum temperature (tenths of degrees C)

Fitting Model

We propose 3 models for prediction:

1. Linear Model of daily_delay ~ month + day + snwd + tmin

2. Linear Model of daily_delay ~ month + day + snwd + tmin+ tmin*day,assuming that there are interaction

3. Linear Model of daily_delay ~ month + day + prcp

Model_1

daily_delay=df %>% 
  group_by(occur_date) %>% 
  mutate(daily_delay=n()) 
  
lm_1=lm(daily_delay ~ month + day  + snwd + tmin,data=daily_delay)
lm_1 %>% broom::tidy() %>% knitr::kable(digits = 2)

term	estimate	std.error	statistic
(Intercept)	131.71	1.15	114.35
monthDecember	73.71	1.27	58.06
monthFebruary	71.29	1.29	55.31
monthJanuary	56.93	1.28	44.57
monthJune	30.11	1.54	19.58
monthMarch	53.46	1.25	42.80
monthMay	38.77	1.31	29.65
monthNovember	94.69	1.18	80.57
monthOctober	110.83	1.16	95.51
monthSeptember	106.81	1.33	80.12
dayMonday	14.99	0.76	19.69
dayThursday	2.90	0.75	3.89
dayTuesday	9.36	0.75	12.47
dayWednesday	4.08	0.74	5.48
snwd	-0.51	0.01	-44.45
tmin	-3.51	0.06	-61.90

Model_2

daily_delay=df %>% 
  group_by(occur_date) %>% 
  mutate(daily_delay=n()) 
lm_2=lm(daily_delay ~ month + day+ snwd + tmin+tmin*day,data=daily_delay)
lm_2%>% broom::tidy() %>% knitr::kable(digits = 2)

term	estimate	std.error	statistic
(Intercept)	125.22	1.21	103.61
monthDecember	72.58	1.26	57.44
monthFebruary	71.68	1.28	55.87
monthJanuary	56.12	1.27	44.14
monthJune	30.30	1.54	19.73
monthMarch	52.43	1.24	42.18
monthMay	38.03	1.30	29.22
monthNovember	94.20	1.17	80.56
monthOctober	111.27	1.15	96.36
monthSeptember	106.22	1.33	79.94
dayMonday	24.35	1.02	23.87
dayThursday	6.81	0.95	7.18
dayTuesday	24.06	0.96	25.01
dayWednesday	11.61	0.94	12.34
snwd	-0.51	0.01	-44.77
tmin	-2.45	0.08	-29.42
dayMonday:tmin	-1.42	0.10	-14.03
dayThursday:tmin	-0.58	0.09	-6.33
dayTuesday:tmin	-2.30	0.09	-24.26
dayWednesday:tmin	-1.16	0.09	-12.76

Model_3

daily_delay=df %>% 
  group_by(occur_date) %>% 
  mutate(daily_delay=n()) 
lm_3=lm(daily_delay ~ month + day+ prcp,data=daily_delay)
lm_3%>% broom::tidy() %>% knitr::kable(digits = 2)

term	estimate	std.error	statistic
(Intercept)	104.95	1.13	93.02
monthDecember	95.89	1.26	76.40
monthFebruary	85.93	1.27	67.76
monthJanuary	90.24	1.21	74.78
monthJune	-8.15	1.47	-5.54
monthMarch	69.26	1.26	54.98
monthMay	19.62	1.32	14.84
monthNovember	101.28	1.21	83.53
monthOctober	95.53	1.18	81.12
monthSeptember	69.04	1.24	55.53
dayMonday	13.89	0.79	17.62
dayThursday	3.72	0.77	4.81
dayTuesday	8.62	0.78	11.05
dayWednesday	5.19	0.78	6.70
prcp	0.06	0.00	19.14

cv_df=df %>% 
  sample_n(size = 100) %>% 
  group_by(occur_date) %>% 
  mutate(daily_delay=n()) %>% 
  modelr::crossv_mc(n = 100) %>%
  mutate_if(is.character, as.factor) %>% 
  mutate(train = map(train, as.tibble), 
         test = map(test, as.tibble)) %>% 
  mutate(lm_1 = map(.x = train,~ lm(daily_delay ~ month + day  + snwd + tmin,data = .x)),
         lm_2 =map(.x = train,~ lm(daily_delay ~ month + day+ snwd + tmin+tmin*day,data = .x)),
         lm_3 =map(.x = train,~ lm(daily_delay ~ month + day+ prcp, data = .x)))

cv=cv_df %>% 
  select(starts_with("lm")) %>% 
  pivot_longer(
    starts_with("lm"),
    names_to = "model",
    values_to = "data") %>% 
  mutate(data = map(data,broom::tidy))

Cross Validation

The predictors we chose above are statistically significant.However, we notice that interaction terms are not contributing to model predictability. We would say lm_3 will be better than the other two models because its root means square error seems to be smaller. Therefore, lm_3 could be useful for predicting the total number of bus delays in Manhattan.

cv_df %>% 
  mutate(
    rmse_lm1 = map2_dbl(lm_1, test, ~rmse(model = .x, data = .y)),
    rmse_lm2 = map2_dbl(lm_2, test, ~rmse(model = .x, data = .y)),
    rmse_lm3 = map2_dbl(lm_3, test, ~rmse(model = .x, data = .y))) %>% 
  select(starts_with("rmse")) %>% 
  pivot_longer(
    everything(),
    names_to = "model", 
    values_to = "rmse",
    names_prefix = "rmse_") %>% 
  mutate(model = fct_inorder(model)) %>% 
  ggplot(aes(x = model, y = rmse)) + geom_violin()

Model Conclusion

Month, rainfall, snow depth, and day are good predictors for anticipating the number of daily delays in Manhattan.