Hypothesis test on daily delay counts over school years

Chi-Square Testing Homogeneity between School Years and Delay Reasons

While looking at the reason for school bus delays, we see that heavy traffic is the main reason and all other reasons play a minor part. This leads us to ask the question, whether the distribution of school bus delay reasons is the same across the 2018-2021 school years. Though we include data in 2021-2022 Fall, yet the counts are not comparable with other school years. So we only look at the 3 school years including, 2018-2019, 2019-2020, 2020-2021.

To answer this question, we conducted a chi-square test, with:

\(H_0\) : The distribution of delay reasons are same across the school years

\(H_1\) : The distribution of delay reasons aren’t all the same across the school years.

## # A tibble: 1 x 4
##   statistic  p.value parameter method                    
##       <dbl>    <dbl>     <int> <chr>                     
## 1      214. 1.90e-35        18 Pearson's Chi-squared test
Distribution of Delay Reasons across the Years
school_year 2018-2019 2019-2020 2020-2021
Accident 214 144 66
Delayed by School 64 28 9
Flat Tire 428 208 88
Heavy Traffic 32426 14859 5780
Late return from Field Trip 87 40 1
Mechanical Problem 1752 932 421
Other 1351 699 287
Problem Run 223 217 1
Weather Conditions 202 71 46
Won`t Start 456 179 69

We can see from the Chi-square test above, we reject the null hypothesis at 99% significant level. This means that the reasons for delay have a different distribution across the school years, though they seem about the same in our bar plots. Our guess is that though the leading reason remains to be heavy traffic, the percentage of heavy traffic being the reason is actually decreasing over the years.

Chi-Square Testing Homogeneity between Delay Reasons in 2019-2020

Given that the distribution of delay reasons varies from year to year, we decide to look at the nearest pre-pandemic school year. The reason we focus on the 2019-2020 school year is that we assume the post-pandemic situation should be similar to the nearest pre-pandemic situation.

Looking at the plot in the distribution of delay reasons, we believe that the frequency of different reasons should not be the same. And now we want to verify our assumption using a chi-square test.

\(H_0\) : The distribution of delay reasons are same within 2019-2020 school years

\(H_1\) : The distribution of delay reasons aren’t all the same within 2019-2020 the school years.

#creating a dataframe
df_2 = df %>% 
   janitor::clean_names() %>% 
  filter(school_year =="2019-2020") %>%
  select(year,school_year,reason) %>% 
  na.omit() %>% 
  group_by(reason,school_year) %>% 
  summarize(frequency=n())%>%
  mutate(reason = as.factor(reason),
    reason = fct_reorder(reason,frequency,.desc = TRUE)
    ) %>% 
  select(reason,frequency)
  

#chi-square test
 chisq.test(df_2[,-1]) %>% 
  broom::tidy()
## # A tibble: 1 x 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1   110549.       0         9 Chi-squared test for given probabilities
#print the table
knitr::kable(df_2,caption = "Distribution of Delay Reasons in 2019-2020")
Distribution of Delay Reasons in 2019-2020
reason frequency
Accident 144
Delayed by School 28
Flat Tire 208
Heavy Traffic 14859
Late return from Field Trip 40
Mechanical Problem 932
Other 699
Problem Run 217
Weather Conditions 71
Won`t Start 179

We can see from the chi-square test result we should reject the null hypothesis, the frequency of delay reasons is not equally distributed in the 2019-2020 school year.

Hypothesis test on monthly delay counts

We can see from the above analysis that the delays of each year is different. Since this year is not over and the data is not complete, we will consider last year’s situation. In the past two years, the number has decreased compared with the previous years. This may be due to the Covid-19, or the New York City government has taken certain measures. So data in 2019- 2020 is of a great reference value.

Chi-squared test between monthly delay counts and different months

From data exploration, we have noticed that the delay counts vary among months. We see from the table that the number of delays in October is different from that in March by almost 3000, and thus, we propose the hypothesis that there is no homogeneity in delay counts in each month. Because of Covid-19, schools were closed from April to June so there were no school bus services. We do not take these three months into consideration.

Delay counts in months
month frequency
January 2615
February 1788
March 916
September 2685
October 3836
November 3031
December 2506

We use Chi-squared test for homogeneity of months.

\(H_0\) : there’s no difference of delay counts between months.

\(H_1\) : at least two delays counts of months are not equal.

month_df =
  month_df %>% 
  data.matrix()

chisq.test(month_df)
## 
##  Pearson's Chi-squared test
## 
## data:  month_df
## X-squared = 10, df = 6, p-value = 0.1

According to above chi-square test result and the x critical value ( = 12.592), We fail to reject the null hypothesis and conclude that there’s no difference in delay counts between months. at 0.05 significant level. The counter-intuitive result may be caused by the lockdown from April to May of 2020.We don’t have any information on these 2 months. Therefore we are only testing the “winter month”.

Hypothesis test on daily delay counts

Chi-squared test between daily delay counts and different weekdays

We know there’s no difference in delay counts between months. But what about different weekdays? I remembered that I was always late for school on Monday. Is it my personal problem? Maybe because of more school bus delays on Monday? To find out the real reason, we use Chi-squared test to see whether there is homogeneity in delay counts for each weekday.

Daily delay counts
day frequency
Monday 13126
Tuesday 13406
Wednesday 13970
Thursday 13829
Friday 13444

We use Chi-squared test for homogeneity of weekdays.

\(H_0\) : there’s no difference of delay counts for each weekdays.

\(H_1\) : at least two delays counts of weekdays are not equal.

weekday_df =
  weekday_df %>% 
  data.matrix()

chisq.test(weekday_df)
## 
##  Pearson's Chi-squared test
## 
## data:  weekday_df
## X-squared = 3, df = 4, p-value = 0.5

According to above chi-square test result and the x critical value ( = 9.488). We fail to reject the null hypothesis and conclude that there’s no statistical difference among weekdays at 0.05 significant level. Unfortunately, there is homogeneity in the delay counts for each weekday. It seems that the reason that I was always late on Monday was I still missed my weekend.