Hypothesis test

Hypothesis test on daily delay counts over school years

Chi-Square Testing Homogeneity between School Years and Delay Reasons

While looking at the reason for school bus delays, we see that heavy traffic is the main reason and all other reasons play a minor part. This leads us to ask the question, whether the distribution of school bus delay reasons is the same across the 2018-2021 school years. Though we include data in 2021-2022 Fall, yet the counts are not comparable with other school years. So we only look at the 3 school years including, 2018-2019, 2019-2020, 2020-2021.

To answer this question, we conducted a chi-square test, with:

\(H_0\) : The distribution of delay reasons are same across the school years

\(H_1\) : The distribution of delay reasons aren’t all the same across the school years.

## # A tibble: 1 x 4
##   statistic  p.value parameter method                    
##       <dbl>    <dbl>     <int> <chr>                     
## 1      214. 1.90e-35        18 Pearson's Chi-squared test

Distribution of Delay Reasons across the Years
school_year	2018-2019	2019-2020	2020-2021
Accident	214	144	66
Delayed by School	64	28	9
Flat Tire	428	208	88
Heavy Traffic	32426	14859	5780
Late return from Field Trip	87	40	1
Mechanical Problem	1752	932	421
Other	1351	699	287
Problem Run	223	217	1
Weather Conditions	202	71	46
Won`t Start	456	179	69

We can see from the Chi-square test above, we reject the null hypothesis at 99% significant level. This means that the reasons for delay have a different distribution across the school years, though they seem about the same in our bar plots. Our guess is that though the leading reason remains to be heavy traffic, the percentage of heavy traffic being the reason is actually decreasing over the years.

Chi-Square Testing Homogeneity between Delay Reasons in 2019-2020

Given that the distribution of delay reasons varies from year to year, we decide to look at the nearest pre-pandemic school year. The reason we focus on the 2019-2020 school year is that we assume the post-pandemic situation should be similar to the nearest pre-pandemic situation.

Looking at the plot in the distribution of delay reasons, we believe that the frequency of different reasons should not be the same. And now we want to verify our assumption using a chi-square test.

\(H_0\) : The distribution of delay reasons are same within 2019-2020 school years

\(H_1\) : The distribution of delay reasons aren’t all the same within 2019-2020 the school years.

#creating a dataframe
df_2 = df %>% 
   janitor::clean_names() %>% 
  filter(school_year =="2019-2020") %>%
  select(year,school_year,reason) %>% 
  na.omit() %>% 
  group_by(reason,school_year) %>% 
  summarize(frequency=n())%>%
  mutate(reason = as.factor(reason),
    reason = fct_reorder(reason,frequency,.desc = TRUE)
    ) %>% 
  select(reason,frequency)
  

#chi-square test
 chisq.test(df_2[,-1]) %>% 
  broom::tidy()

## # A tibble: 1 x 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1   110549.       0         9 Chi-squared test for given probabilities

#print the table
knitr::kable(df_2,caption = "Distribution of Delay Reasons in 2019-2020")

Distribution of Delay Reasons in 2019-2020
reason	frequency
Accident	144
Delayed by School	28
Flat Tire	208
Heavy Traffic	14859
Late return from Field Trip	40
Mechanical Problem	932
Other	699
Problem Run	217
Weather Conditions	71
Won`t Start	179

We can see from the chi-square test result we should reject the null hypothesis, the frequency of delay reasons is not equally distributed in the 2019-2020 school year.

Hypothesis test on monthly delay counts

We can see from the above analysis that the delays of each year is different. Since this year is not over and the data is not complete, we will consider last year’s situation. In the past two years, the number has decreased compared with the previous years. This may be due to the Covid-19, or the New York City government has taken certain measures. So data in 2019- 2020 is of a great reference value.

Chi-squared test between monthly delay counts and different months

From data exploration, we have noticed that the delay counts vary among months. We see from the table that the number of delays in October is different from that in March by almost 3000, and thus, we propose the hypothesis that there is no homogeneity in delay counts in each month. Because of Covid-19, schools were closed from April to June so there were no school bus services. We do not take these three months into consideration.

Delay counts in months
month	frequency
January	2615
February	1788
March	916
September	2685
October	3836
November	3031
December	2506

We use Chi-squared test for homogeneity of months.

\(H_0\) : there’s no difference of delay counts between months.

\(H_1\) : at least two delays counts of months are not equal.

month_df =
  month_df %>% 
  data.matrix()

chisq.test(month_df)

## 
##  Pearson's Chi-squared test
## 
## data:  month_df
## X-squared = 10, df = 6, p-value = 0.1

According to above chi-square test result and the x critical value ( = 12.592), We fail to reject the null hypothesis and conclude that there’s no difference in delay counts between months. at 0.05 significant level. The counter-intuitive result may be caused by the lockdown from April to May of 2020.We don’t have any information on these 2 months. Therefore we are only testing the “winter month”.

Hypothesis test on daily delay counts

Chi-squared test between daily delay counts and different weekdays

We know there’s no difference in delay counts between months. But what about different weekdays? I remembered that I was always late for school on Monday. Is it my personal problem? Maybe because of more school bus delays on Monday? To find out the real reason, we use Chi-squared test to see whether there is homogeneity in delay counts for each weekday.

Daily delay counts
day	frequency
Monday	13126
Tuesday	13406
Wednesday	13970
Thursday	13829
Friday	13444

We use Chi-squared test for homogeneity of weekdays.

\(H_0\) : there’s no difference of delay counts for each weekdays.

\(H_1\) : at least two delays counts of weekdays are not equal.

weekday_df =
  weekday_df %>% 
  data.matrix()

chisq.test(weekday_df)

## 
##  Pearson's Chi-squared test
## 
## data:  weekday_df
## X-squared = 3, df = 4, p-value = 0.5

According to above chi-square test result and the x critical value ( = 9.488). We fail to reject the null hypothesis and conclude that there’s no statistical difference among weekdays at 0.05 significant level. Unfortunately, there is homogeneity in the delay counts for each weekday. It seems that the reason that I was always late on Monday was I still missed my weekend.