While looking at the reason for school bus delays, we see that heavy traffic is the main reason and all other reasons play a minor part. This leads us to ask the question, whether the distribution of school bus delay reasons is the same across the 2018-2021 school years. Though we include data in 2021-2022 Fall, yet the counts are not comparable with other school years. So we only look at the 3 school years including, 2018-2019, 2019-2020, 2020-2021.
To answer this question, we conducted a chi-square test, with:
\(H_0\) : The distribution of delay reasons are same across the school years
\(H_1\) : The distribution of delay reasons aren’t all the same across the school years.
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 214. 1.90e-35 18 Pearson's Chi-squared test
school_year | 2018-2019 | 2019-2020 | 2020-2021 |
Accident | 214 | 144 | 66 |
Delayed by School | 64 | 28 | 9 |
Flat Tire | 428 | 208 | 88 |
Heavy Traffic | 32426 | 14859 | 5780 |
Late return from Field Trip | 87 | 40 | 1 |
Mechanical Problem | 1752 | 932 | 421 |
Other | 1351 | 699 | 287 |
Problem Run | 223 | 217 | 1 |
Weather Conditions | 202 | 71 | 46 |
Won`t Start | 456 | 179 | 69 |
We can see from the Chi-square test above, we reject the null hypothesis at 99% significant level. This means that the reasons for delay have a different distribution across the school years, though they seem about the same in our bar plots. Our guess is that though the leading reason remains to be heavy traffic, the percentage of heavy traffic being the reason is actually decreasing over the years.
Given that the distribution of delay reasons varies from year to year, we decide to look at the nearest pre-pandemic school year. The reason we focus on the 2019-2020 school year is that we assume the post-pandemic situation should be similar to the nearest pre-pandemic situation.
Looking at the plot in the distribution of delay reasons, we believe that the frequency of different reasons should not be the same. And now we want to verify our assumption using a chi-square test.
\(H_0\) : The distribution of delay reasons are same within 2019-2020 school years
\(H_1\) : The distribution of delay reasons aren’t all the same within 2019-2020 the school years.
#creating a dataframe
df_2 = df %>%
janitor::clean_names() %>%
filter(school_year =="2019-2020") %>%
select(year,school_year,reason) %>%
na.omit() %>%
group_by(reason,school_year) %>%
summarize(frequency=n())%>%
mutate(reason = as.factor(reason),
reason = fct_reorder(reason,frequency,.desc = TRUE)
) %>%
select(reason,frequency)
#chi-square test
chisq.test(df_2[,-1]) %>%
broom::tidy()
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <dbl> <chr>
## 1 110549. 0 9 Chi-squared test for given probabilities
#print the table
knitr::kable(df_2,caption = "Distribution of Delay Reasons in 2019-2020")
reason | frequency |
---|---|
Accident | 144 |
Delayed by School | 28 |
Flat Tire | 208 |
Heavy Traffic | 14859 |
Late return from Field Trip | 40 |
Mechanical Problem | 932 |
Other | 699 |
Problem Run | 217 |
Weather Conditions | 71 |
Won`t Start | 179 |
We can see from the chi-square test result we should reject the null hypothesis, the frequency of delay reasons is not equally distributed in the 2019-2020 school year.
We can see from the above analysis that the delays of each year is different. Since this year is not over and the data is not complete, we will consider last year’s situation. In the past two years, the number has decreased compared with the previous years. This may be due to the Covid-19, or the New York City government has taken certain measures. So data in 2019- 2020 is of a great reference value.
From data exploration, we have noticed that the delay counts vary among months. We see from the table that the number of delays in October is different from that in March by almost 3000, and thus, we propose the hypothesis that there is no homogeneity in delay counts in each month. Because of Covid-19, schools were closed from April to June so there were no school bus services. We do not take these three months into consideration.
month | frequency |
---|---|
January | 2615 |
February | 1788 |
March | 916 |
September | 2685 |
October | 3836 |
November | 3031 |
December | 2506 |
We use Chi-squared test for homogeneity of months.
\(H_0\) : there’s no difference of delay counts between months.
\(H_1\) : at least two delays counts of months are not equal.
month_df =
month_df %>%
data.matrix()
chisq.test(month_df)
##
## Pearson's Chi-squared test
##
## data: month_df
## X-squared = 10, df = 6, p-value = 0.1
According to above chi-square test result and the x critical value ( = 12.592), We fail to reject the null hypothesis and conclude that there’s no difference in delay counts between months. at 0.05 significant level. The counter-intuitive result may be caused by the lockdown from April to May of 2020.We don’t have any information on these 2 months. Therefore we are only testing the “winter month”.
We know there’s no difference in delay counts between months. But what about different weekdays? I remembered that I was always late for school on Monday. Is it my personal problem? Maybe because of more school bus delays on Monday? To find out the real reason, we use Chi-squared test to see whether there is homogeneity in delay counts for each weekday.
day | frequency |
---|---|
Monday | 13126 |
Tuesday | 13406 |
Wednesday | 13970 |
Thursday | 13829 |
Friday | 13444 |
We use Chi-squared test for homogeneity of weekdays.
\(H_0\) : there’s no difference of delay counts for each weekdays.
\(H_1\) : at least two delays counts of weekdays are not equal.
weekday_df =
weekday_df %>%
data.matrix()
chisq.test(weekday_df)
##
## Pearson's Chi-squared test
##
## data: weekday_df
## X-squared = 3, df = 4, p-value = 0.5
According to above chi-square test result and the x critical value ( = 9.488). We fail to reject the null hypothesis and conclude that there’s no statistical difference among weekdays at 0.05 significant level. Unfortunately, there is homogeneity in the delay counts for each weekday. It seems that the reason that I was always late on Monday was I still missed my weekend.