The goal of the analyses is to understand what factors may contribute to the school bus delay in Manhattan from Fall 2018 to Fall 2021. We hypothesized that the heavy traffic and weather conditions may be important factors that lead to bus delays. Because of the changeable weather in fall and winter, rains and snows may lead to the bad road conditions and more accidents. Therefore, heavy traffic is prone to occur, causing bus delay. We use the data collected from school bus vendors operating out in the field in real-time. Within the dataset, the contributing factors include like accident, delayed by school, flat tire, heavy traffic, late return from field trip, mechanical problem,problem run, weather conditions,bus won’t start, and other.
As seen in the graphs below, the leading contributing factors for bus delay was heavy traffic. It accounts for roughly 87% of reasons for delay. School year 2018-2019 had the greater number of school bus delays and had delayed for more than 30,000 times. A mechanical problem is the second reason for bus delay and it accounts for 6% of all reasons of bus delays. Overall, the problem of bus delay has improved in the subsequent school year.
#Import data
df=read_csv("data/clean_data.csv")
#plot data: drop NA rows
plot_df = df %>%
filter(boro =="Manhattan",
year %in% c(2018,2019,2020,2021),
school_year!="2021-2022") %>%
select(year,school_year,reason, day,month,prcp:tmin) %>%
na.omit()
The table below shows the number of delays for specific factors in each school year.
plot_df %>%
filter(school_year!="2021-2022") %>%
group_by(reason,school_year) %>%
summarize(frequency=n()) %>%
pivot_wider(
names_from = school_year,
values_from = frequency) %>%
knitr::kable(digits = 2)
reason | 2018-2019 | 2019-2020 | 2020-2021 |
---|---|---|---|
Accident | 214 | 144 | 66 |
Delayed by School | 64 | 28 | 9 |
Flat Tire | 428 | 208 | 88 |
Heavy Traffic | 32426 | 14859 | 5780 |
Late return from Field Trip | 87 | 40 | 1 |
Mechanical Problem | 1752 | 932 | 421 |
Other | 1351 | 699 | 287 |
Problem Run | 223 | 217 | 1 |
Weather Conditions | 202 | 71 | 46 |
Won`t Start | 456 | 179 | 69 |
The stacked bar graph below shows the number of delays in each school year.
plot_2=plot_df %>%
mutate(reason=fct_infreq(reason)) %>%
ggplot(aes(x = school_year, fill = reason))+
geom_bar()+
labs(title = "The Contributing Factors for Bus Delay in each School Year")
ggplotly(plot_2)
The pie chart below shows the proportion of the specific factors across three years.
plot_3=plot_df %>%
filter(school_year!="2021-2022") %>%
group_by(school_year,reason) %>%
summarize(frequency=n()) %>%
ungroup(reason) %>%
mutate(school_year=factor(school_year),
school_year=fct_relevel(school_year,"2018-2019","2019-2020","2020-2021"),
proportion=round(frequency/sum(frequency), 2),
reason=factor(reason),
reason=fct_reorder(reason,proportion,.desc = TRUE))
plot_3 %>%
plot_ly(labels = ~reason, values = ~proportion, type = 'pie') %>%
layout(title = 'Top Reasons for Bus Delay')
Based on the graph above, we know that heavy traffic accounts for the majority of delays in Manhattan. Thus, we only look at the relationship between heavy traffic and other possible factors.
Based on the graph above, we know that heavy traffic accounts for the majority of delays in Manhattan. Thus, we only look at the relationship between heavy traffic and other possible factors. After looking at the graph, we can see there is no association between heavy traffic and heavy rain. Weather may not be the main reason leading to the bus delay. It is interesting to see more traffic occurring on average of 100mm of rainfall, which is considered as light rain or moderate rain.
plot_6=plot_df %>%
filter(school_year!="2021-2022",
reason==c("Heavy Traffic","Mechanical Problem","Other"),
prcp!= "0") %>%
group_by(school_year,reason,prcp) %>%
summarize(count=n()) %>%
ggplot(aes(x=prcp,y=count,color=school_year))+
geom_point()+
geom_smooth(se=FALSE)+
labs(
title = "The Distribution for Rain across the School Years",
x = "Precipitation (mm)",
y = "Number of delay")
ggplotly(plot_6)
It is interesting to see there is an inverse relationship between snow depth and the number of cases of heavy traffic in 2018-2019 school year. However, it seems that there is no association between snow depth and heavy traffic in 2020-2021. We are unable to analyze the heavy traffic and snow depth in 2019-2020 because there are not many snowy days.
plot_6=plot_df %>%
filter(reason =="Heavy Traffic",
snwd!=0) %>%
group_by(school_year,snwd) %>%
summarize(freq=n()) %>%
ggplot(aes(x=snwd,y=freq,color=school_year))+
geom_point(alpha=0.4)+
geom_smooth(se=FALSE)+
labs(
title = "The Scatterplot for Snow Depth across The School Years",
x = "Snow depth (mm)",
y = "Number of delay")
ggplotly(plot_6)
Based on the graph below, we can see the heavy traffic is more frequently happen on Thursday across three school years. It is less likely to occur on Monday in 2018-2019 and 2019-2020. Due to the interruption of COVID-19, people spend more time at home or for quarantine. Therefore, it may be the reason why the number of heavy traffic has sufficiently declined. It also can explain why the number of delays is equally distributed across the weekdays.
plot_df %>%
filter(reason== "Heavy Traffic") %>%
group_by(school_year,day) %>%
summarise(n = n()) %>%
mutate(amount=case_when(school_year=="2020-2021"~ n/50,
school_year!="2020-2021"~ n/52)) %>%
plot_ly(x = ~day, y = ~amount, type = 'scatter',mode = 'line', color= ~ school_year) %>%
layout(
title = 'Heavy Traffic Per Weekday Across School Years',
xaxis = list(
type = 'category',
title = 'Weekday'),
yaxis = list(
title = 'Count of Heavy Traffic per Weekday'))