if (!require("tidyverse")) install.packages("tidyverse")
# Let's start with Fare Revenue
library(tidyverse)
<- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
FARES select(
-`State/Parent NTD ID`,
-`Reporter Type`,
-`Reporting Module`,
-`TOS`,
-`Passenger Paid Fares`,
-`Organization Paid Fares`
|>
) filter(`Expense Type` == "Funds Earned During Period") |>
select(-`Expense Type`) |>
group_by(
`NTD ID`, # Sum over different `TOS` for the same `Mode`
`Agency Name`, # These are direct operated and sub-contracted
`Mode`
|> # of the same transit modality
) # Not a big effect in most munis (significant DO
# tends to get rid of sub-contractors), but we'll sum
# to unify different passenger experiences
summarize(`Total Fares` = sum(`Total Fares`)) |>
ungroup()
# Next, expenses
<- readr::read_csv("2022_expenses.csv") |>
EXPENSES select(
`NTD ID`,
`Agency`,
`Total`,
`Mode`
|>
) mutate(`NTD ID` = as.integer(`NTD ID`)) |>
rename(Expenses = Total) |>
group_by(`NTD ID`, `Mode`) |>
summarize(Expenses = sum(Expenses)) |>
ungroup()
<- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`))
FINANCIALS
# Monthly Transit Numbers
library(tidyverse)
<- readxl::read_xlsx("ridership.xlsx", sheet = "UPT") |>
TRIPS filter(`Mode/Type of Service Status` == "Active") |>
select(
-`Legacy NTD ID`,
-`Reporter Type`,
-`Mode/Type of Service Status`,
-`UACE CD`,
-`TOS`
|>
) pivot_longer(-c(`NTD ID`:`3 Mode`),
names_to = "month",
values_to = "UPT"
|>
) drop_na() |>
mutate(month = my(month)) # Parse _m_onth _y_ear date specs
<- readxl::read_xlsx("ridership.xlsx", sheet = "VRM") |>
MILES filter(`Mode/Type of Service Status` == "Active") |>
select(
-`Legacy NTD ID`,
-`Reporter Type`,
-`Mode/Type of Service Status`,
-`UACE CD`,
-`TOS`
|>
) pivot_longer(-c(`NTD ID`:`3 Mode`),
names_to = "month",
values_to = "VRM"
|>
) drop_na() |>
group_by(
`NTD ID`, `Agency`, `UZA Name`,
`Mode`, `3 Mode`, month
|>
) summarize(VRM = sum(VRM)) |>
ungroup() |>
mutate(month = my(month)) # Parse _m_onth _y_ear date specs
<- inner_join(TRIPS, MILES) |>
USAGE mutate(`NTD ID` = as.integer(`NTD ID`))
# End of data ingestion and setup
Mini-Project #01: Fiscal Characteristics of Major US Public Transit Systems
By Chris Liu
Introduction
The public transit systems found nationwide in the United States play an important role in mobilizing people in their daily lives. This analysis will give an overview of how the transit systems perform relative to another by examining key metrics such as farebox revenues, total number of trips, total vehicle miles traveled and total revenues and expenses by source.
This analysis will examine information from 2022 that utilizes data from fare revenue, monthly ridership, and operating expense reports. By evaluating the metrics mentioned earlier, this analysis will identify trends, expose common challenges, and offer insight into transit performances. Various transit performance metrics will be analyzed to gain a general overview of the data. Based on these findings, transit system efficiency will be evaluated on a comparative basis. The working data sets are provided by the Federal Transit Administration. The latter half of the analysis will examine metrics that can define a transit system as efficient. The evaluation of efficiency will depend on the interpretation of what makes something efficient.
Analysis
Preparing, Cleaning & Loading the Dataset
The relevant data sets used in the analysis can be found here:
The first step in the analysis is to ingest the relevant data tables and prepare them for data analysis using R. The following code will clean and join the tables into relevant dataframes used in the analysis. The output will create the following dataframes named: FARES
, EXPENSES
, FINANCIALS
, TRIPS
, MILES
, and USAGE
.
Here, a summary table of USAGE
is created to get an introductory visualization of the table that will be used for analysis.
if (!require("DT")) install.packages("DT")
library(DT)
# Initialize a table to begin analysis
sample_n(USAGE, 1000) |>
mutate(month = as.character(month)) |>
::datatable() DT
Table 1: Quick Overview of USAGE
Transforming Data Table
The analysis will be initially conducted using the dataframe USAGE
. Some of the provided labels are cumbersome to work with in R. It is doable, but we can make our lives easier by renaming them. The first task at hand is to rename the column UZA Name
to metro_area
. The following code will show how that is done.
<- USAGE |>
USAGE rename(metro_area = `UZA Name`)
Each transportation Mode
is represented by a two letter code, for example HR = Heavy Rail. The two letter codes aren’t meaningful to us, as it’s impossible to guess what they are. The first thing I did to clean this portion up was to find all the codes used by the Federal Transit Administration. Running the following will give us the list of codes found in the data set.
library(dplyr)
<- USAGE |>
unique_modes distinct(Mode)
print(unique_modes)
# A tibble: 18 × 1
Mode
<chr>
1 DR
2 FB
3 MB
4 SR
5 TB
6 VP
7 CB
8 RB
9 LR
10 YR
11 MG
12 CR
13 AR
14 TR
15 HR
16 IP
17 PB
18 CC
Once all the codes have been identified, the meanings can be found in the National Transit Database glossary. Using the mutate
function, all the codes can be changed into meaningful definitions.
<- USAGE |>
USAGE mutate(Mode = case_when(
== "DR" ~ "Demand Response",
Mode == "FB" ~ "Ferryboat",
Mode == "MB" ~ "Bus",
Mode == "SR" ~ "Streetcar Rail",
Mode == "TB" ~ "Trolleybus",
Mode == "VP" ~ "Vanpool",
Mode == "CB" ~ "Commuter Bus",
Mode == "RB" ~ "Bus Rapid Transit",
Mode == "LR" ~ "Light Rail",
Mode == "YR" ~ "Hybrid Rail",
Mode == "MG" ~ "Monorail and Automated Guideway modes",
Mode == "CR" ~ "Commuter Rail",
Mode == "AR" ~ "Alaska Railroad",
Mode == "TR" ~ "Aerial Tramways",
Mode == "HR" ~ "Heavy Rail",
Mode == "IP" ~ "Inclined Plane",
Mode == "PB" ~ "Publico",
Mode == "CC" ~ "Cable Car",
Mode TRUE ~ "Unknown"
))
Now that the data has been cleaned up to a new extent, let’s create a new summary table from USAGE
. To make the outputted table as clean as possible, I’ve opted to get rid of irrelevant columns and change the acronyms to display meaningful words. UPT
stands for unlinked passenger trips and VRM
stands for vehicle revenue miles.
datatable(
sample_n(USAGE, 1000) |>
mutate(month = as.character(month)) |>
select(-`NTD ID`, -`3 Mode`) |> # exclude ntd id and 3 mode in visual table
rename(
`Metro Area` = metro_area, # rename for table output to look cleaner
`Unlinked Passenger Trips` = UPT, # rename acronym in visual table
`Vehicle Revenue Miles` = VRM # rename acronym in visual table
) )
Table 2: Cleaned Up Version of USAGE
<- sum(is.na(USAGE))
na_count print(na_count)
[1] 0
This code returns 0, which lets me know there are no missing values in the data I’m working with. With this reassurance, I won’t be using the na.rm=TRUE
statement in any of my code. However, if the opposite were true instead, I would make sure to use the above statement while utilizing aggregate functions.
Initial Metrics of Interest
Now that there is a clean table to work with, some questions of interest about the data can be explored. The following questions will explore the use of the following functions filter
, group_by
, summarize
, and arrange
.
- What transit agency had the most total VRM in this sample?
- What transit mode had the most total VRM in this sample?
- How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?
- How much did NYC subway ridership fall between April 2019 and April 2020?
To find the transit agency with the most total VRM from the sample, I need to group the data based on the
Agency
and its respectiveVRM
total. It turns out the MTA New York City Transit has reign over total VRM among the agencies with 10,832,855,350 miles.<- USAGE |> most_vrm_agency group_by(Agency) |> summarize(total_vrm = sum(VRM)) |> arrange(desc(total_vrm)) print(most_vrm_agency)
# A tibble: 677 × 2 Agency total_vrm <chr> <dbl> 1 MTA New York City Transit 1.08e10 2 New Jersey Transit Corporation 5.65e 9 3 Los Angeles County Metropolitan Transportation Authority 4.35e 9 4 Washington Metropolitan Area Transit Authority 2.82e 9 5 Chicago Transit Authority 2.81e 9 6 Southeastern Pennsylvania Transportation Authority 2.67e 9 7 Massachusetts Bay Transportation Authority 2.38e 9 8 Pace, the Suburban Bus Division of the Regional Transportation Aut… 2.38e 9 9 Metropolitan Transit Authority of Harris County, Texas 2.27e 9 10 Denver Regional Transportation District 1.99e 9 # ℹ 667 more rows
Alternative code block regardingslice_head
:I purposely chose not to include a
slice_head
function to get a comparative overview of the data. Here, the MTA had an overwhelming total over the other agencies, which was an interesting finding. I stuck with the same philosophy throughout most of this analysis since I was interested in comparing the sheer numbers as well, not just the specific metric I was inquiring about.<- USAGE |> most_vrm_agency1 group_by(Agency) |> summarize(total_vrm = sum(VRM)) |> arrange(desc(total_vrm)) |> slice_head(n = 1) print(most_vrm_agency1)
# A tibble: 1 × 2 Agency total_vrm <chr> <dbl> 1 MTA New York City Transit 10832855350
To find the transit mode with the most total VRM from the sample, I need to group the data based on the
Mode
and its respectiveVRM
total. By a large margin of 49,444,494,088 miles, the bus(MB)Mode
had the most total VRM from the sample.<- USAGE |> most_vrm_mode group_by(Mode) |> summarize(total_vrm = sum(VRM)) |> arrange(desc(total_vrm)) print(most_vrm_mode)
# A tibble: 18 × 2 Mode total_vrm <chr> <dbl> 1 Bus 49444494088 2 Demand Response 17955073508 3 Heavy Rail 14620362107 4 Commuter Rail 6970644241 5 Vanpool 3015783362 6 Light Rail 2090094714 7 Commuter Bus 1380948975 8 Publico 1021270808 9 Trolleybus 236840288 10 Bus Rapid Transit 118425283 11 Ferryboat 65589783 12 Streetcar Rail 63389725 13 Monorail and Automated Guideway modes 37879729 14 Hybrid Rail 37787608 15 Alaska Railroad 13833261 16 Cable Car 7386019 17 Inclined Plane 705904 18 Aerial Tramways 292860
To find how many trips were taken on the NYC Subway in May 2024, there were multiple criteria to consider here. A filter needs to be used in order to address the transit Mode, month, and Agency. In this case, I made the assumption that the NYC Subway is only operated by the MTA New York City Transit. In May 2024, there were a total of 180,000,000 (1.80e8) trips taken.
<- USAGE |> nyc_subway_trips filter( == "MTA New York City Transit", Agency == "Heavy Rail", Mode >= as.Date("2024-05-01") & month <= as.Date("2024-05-31") month )print(nyc_subway_trips)
# A tibble: 1 × 8 `NTD ID` Agency metro_area Mode `3 Mode` month UPT VRM <int> <chr> <chr> <chr> <chr> <date> <dbl> <dbl> 1 20008 MTA New York City… New York-… Heav… Rail 2024-05-01 1.80e8 3.00e7
Information Regarding themonth
ColumnAfter going through this question at hand, I was able to identify that the
month
is stored as just the first of the month each year. I confirmed that there is only one entry per month for each respective agency, metro area, and mode. Moving forward with any month filters, I don’t have to worry about including the full month as it will always only be in the format of the first of the month in each year. This understanding will be seen in later examples.To find the ridership difference between April 2019 and April 2020, I need to find the amount of trips taken in each month-year and subtract from one another. Again, I made the assumption that the NYC Subway is only operated by the MTA New York City Transit. In the following code, you can see the difference when filtering the
month
as referenced in the call-out above! The ridership fell by 211,969,660 trips between April 2019 and April 2020. I interpreted this metric as the difference of trips between each respective month-year, not the total difference in between.<- USAGE |> ride_fall filter(Mode == "Heavy Rail") |> filter(Agency == "MTA New York City Transit") |> # this is the agency that runs nyc subway filter(month %in% c(as.Date("2019-04-01"), as.Date("2020-04-01"))) |> group_by(month) |> summarize(total_rides = sum(UPT)) |> summarize(difference = total_rides[month == as.Date("2020-04-01")] - == as.Date("2019-04-01")]) total_rides[month print(ride_fall)
# A tibble: 1 × 1 difference <dbl> 1 -211969660
Additional Metrics of Interest
Asides from the metrics explored above, there are a variety of other questions that can be asked from the data. In this section, I will explore three other areas of interest. The data is not limited to the following questions discussed, there are a multitude of statistics that can be uncovered. For the following questions I asked, I am trying to utilize as many R
functions as possible.
- What
UZA Name
/metro_area
had the mostUPT
in January 2022? - What month and year had the most
UPT
through thebus (MB)
in the entire sample? - What is the average amount of trips taken in the New York–Jersey City–Newark, NY–NJ area based on the season from 2018 to 2022?
The areas of interest require filtering a date, grouping by a variable, and aggregating a variable. In January 2022, the New York City–Jersey City–Newark, NY–NJ area recorded the most trips with 173,719,501 trips. The second-ranked area had only 26,158,306 trips, accounting for just 15% of the total for New York–New Jersey.
library(dplyr) <- USAGE |> popular_area filter(month %in% c(as.Date("2022-01-01"))) |> group_by(metro_area) |> summarize(total_trips = sum(UPT)) |> arrange(desc(total_trips)) print(popular_area)
# A tibble: 295 × 2 metro_area total_trips <chr> <dbl> 1 New York--Jersey City--Newark, NY--NJ 173719501 2 Los Angeles--Long Beach--Anaheim, CA 26158306 3 Chicago, IL--IN 16569817 4 San Francisco--Oakland, CA 13571654 5 Boston, MA--NH 13220711 6 Philadelphia, PA--NJ--DE--MD 12972351 7 Washington--Arlington, DC--VA--MD 11229936 8 Miami--Fort Lauderdale, FL 8318327 9 Seattle--Tacoma, WA 8250602 10 San Diego, CA 4602265 # ℹ 285 more rows
This metric references the second question from the previous question. Now we’re taking a look at history, seeing exactly when this mode peaked. The data tells us that this happened in October 2018, with 478,806,384 trips. Evidently, the lower ranked
months
had similar values too. This shows strong consistency for the bus transit mode across the US.<- USAGE |> bus_trips filter(Mode == "Bus") |> group_by(month) |> summarize(total_bus_trips = sum(UPT)) |> arrange(desc(total_bus_trips)) print(bus_trips)
# A tibble: 271 × 2 month total_bus_trips <date> <dbl> 1 2008-10-01 478806384 2 2014-10-01 457089165 3 2013-10-01 456214396 4 2007-10-01 455193568 5 2008-09-01 454077576 6 2006-10-01 450496480 7 2006-03-01 450386143 8 2012-10-01 448572088 9 2008-05-01 442961523 10 2009-10-01 441007281 # ℹ 261 more rows
The code required for this was a challenge, but it utilized functions already explored earlier and putting them together intricately. The months were assigned to a season within a
case_when
function within amutate
function. Additionally a filter, group_by, summarize, arrange, and mean function were used as well. From 2018 to 2022, the average amount of trips taken in NY-NJ was:Season Average UPT Fall 4,960,514 Winter 4,772,907 Summer 4,609,142 Spring 4,508,331
Table 3: Average UPT by Season, NY-NJ, 2018-2022
<- USAGE |> seasonal_variation filter(metro_area == "New York--Jersey City--Newark, NY--NJ") |> filter(month >= as.Date("2018-01-01") & month <= as.Date("2022-12-01")) |> mutate( month_num = as.numeric(format(month, "%m")), # Extract the month as a number from the date column season = case_when( # Use case_when to categorize into seasons %in% c(12, 1, 2) ~ "Winter", month_num %in% c(3, 4, 5) ~ "Spring", month_num %in% c(6, 7, 8) ~ "Summer", month_num %in% c(9, 10, 11) ~ "Fall", month_num TRUE ~ "Unknown" )|> ) group_by(season) |> summarize(avg_trips = mean(UPT)) |> arrange(desc(avg_trips)) print(seasonal_variation)
This concludes the first half of the analysis. A variety of transit metric data was unearthed. A better understanding of the R functions were explored through data analysis. Now that preliminary data has been identified, we can move forward to the next half of the analysis. The fare data available to use is from 2022. In order to do a deeper analysis, the USAGE
table will need to be converted to a 2022 version in order to join the fare data information together. Once we have a combined table, we can uncover what farebox recovery looked like in 2022.
Farebox Recovery
The first task at hand is to extract only the 2022 information from USAGE. The parameters of interest kept are NTD ID
, Agency
, metro_area
, Mode
, UPT
, VRM
. Normally, filtering
just the year and selecting
the parameters would be straightforward. However, the UPT
and VRM
need to be aggregated for the new joined table. Additionally, the mutate
function is used to convert NTD ID
to a double
type in order to match the same type as the NT ID
in the FINANCIALS
table we will be joining to later. The new table is called USAGE_2022_ANNUAL
. For the farebox recovery analysis, the sample will focus solely on major transit systems, which is defined as those with 400,000 UPT per annum. For this definition of major transit systems, the total UPT per Agency was considered (the mode is not taken into consideration, just the Agency as a whole itself).
# Calculate UPT per agency per year to only consider agencies with UPT of 400,000 or more per year
<- USAGE |>
agencies_400k_upt mutate(Year = year(month)) |> # Extract year from month
group_by(`NTD ID`, Agency, Year) |> # Group by agency and year
summarize(Total_UPT = sum(UPT, na.rm = TRUE), .groups = 'drop') |> # Summarize total UPT per Agency, aggregating the UPT across all modes
filter(Total_UPT >= 400000) |> # Keep agencies with total UPT >= 400,000 annum
ungroup() |> # Ungroup to prepare for next operation
distinct(Agency) # Get distinct agencies
# Filter the 2022 data for only those agencies
<- USAGE |>
USAGE_2022_ANNUAL filter(year(month) == 2022) |> # Only data from 2022
filter(Agency %in% agencies_400k_upt$Agency) |> # Filter agencies that meet avg UPT condition
group_by(`NTD ID`, Agency, metro_area, Mode) |> # Group by relevant columns
summarize(
UPT = sum(UPT), # Sum UPT for 2022
VRM = sum(VRM) # Sum VRM for 2022
|>
) ungroup() |>
mutate(`NTD ID` = as.double(`NTD ID`)) # Convert NTD ID to double for joining
print(USAGE_2022_ANNUAL) # Output the filtered table
# A tibble: 1,023 × 6
`NTD ID` Agency metro_area Mode UPT VRM
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 King County Seattle--… Bus 5.40e7 6.16e7
2 1 King County Seattle--… Dema… 6.63e5 1.29e7
3 1 King County Seattle--… Ferr… 4.00e5 5.12e4
4 1 King County Seattle--… Stre… 1.12e6 1.80e5
5 1 King County Seattle--… Trol… 9.58e6 2.64e6
6 1 King County Seattle--… Vanp… 7.03e5 4.41e6
7 2 Spokane Transit Authority Spokane, … Bus 6.60e6 6.49e6
8 2 Spokane Transit Authority Spokane, … Dema… 3.10e5 4.04e6
9 2 Spokane Transit Authority Spokane, … Vanp… 9.06e4 9.06e5
10 3 Pierce County Transportation Benefit… Seattle--… Bus 4.95e6 4.23e6
# ℹ 1,013 more rows
Before we can join USAGE_2022_ANNUAL
onto FINANCIALS
, we need to revisit the mode conversion we did earlier in the data cleaning. The FINANCIALS
table follows the same format with the mode being a code as seen earlier.
<- FINANCIALS |>
FINANCIALS mutate(Mode = case_when(
== "DR" ~ "Demand Response",
Mode == "FB" ~ "Ferryboat",
Mode == "MB" ~ "Bus",
Mode == "SR" ~ "Streetcar Rail",
Mode == "TB" ~ "Trolleybus",
Mode == "VP" ~ "Vanpool",
Mode == "CB" ~ "Commuter Bus",
Mode == "RB" ~ "Bus Rapid Transit",
Mode == "LR" ~ "Light Rail",
Mode == "YR" ~ "Hybrid Rail",
Mode == "MG" ~ "Monorail and Automated Guideway modes",
Mode == "CR" ~ "Commuter Rail",
Mode == "AR" ~ "Alaska Railroad",
Mode == "TR" ~ "Aerial Tramways",
Mode == "HR" ~ "Heavy Rail",
Mode == "IP" ~ "Inclined Plane",
Mode == "PB" ~ "Publico",
Mode == "CC" ~ "Cable Car",
Mode TRUE ~ "Unknown"
))
Finally, we can join the USAGE_2022_ANNUAL
and FINANCIALS
tables together. USAGE_AND_FINANCIALS
will be used to conduct the final analysis on farebox recovery in 2022. An innjer_join
is used since some values were dropped when filtering out for only major transit systems. In order to join the data properly, an inner_join
matches the values that are only present in the USAGE_2022_ANNUAL
table.
<- inner_join(
USAGE_AND_FINANCIALS
USAGE_2022_ANNUAL,
FINANCIALS,join_by(`NTD ID`, Mode)
)
datatable(
sample_n(USAGE_AND_FINANCIALS, 1000) |>
select(-`Agency Name`) |> # exclude extra agency name column from financials table
rename(
`Metro Area` = metro_area, # rename for table output to look cleaner
`Unlinked Passenger Trips` = UPT, # rename acronym in visual table
`Vehicle Revenue Miles` = VRM # rename acronym in visual table
) )
Table 4: Visual of USAGE_AND_FINANCIALS
- Which transit system (agency and mode) had the most UPT in 2022?
- Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of Total Fares to Expenses?
- Which transit system (agency and mode) has the lowest expenses per UPT?
- Which transit system (agency and mode) has the highest total fares per UPT?
- Which transit system (agency and mode) has the lowest expenses per VRM?
- Which transit system (agency and mode) has the highest total fares per VRM?
In 2022, the MTA New York City Transit had the most UPT via the heavy rail (subway). There was a total of 1,793,073,801 trips taken. This result is not suprising given the sheer population size of the NYC tri-state area, as well as the vast amount of public transportation accessibility throughout the city.
<- USAGE_AND_FINANCIALS |> most_UPT_2022 group_by(Agency, Mode) |> summarize(most_UPT = UPT, .groups = 'drop') |> arrange(desc(most_UPT)) print(most_UPT_2022)
# A tibble: 1,016 × 3 Agency Mode most_UPT <chr> <chr> <dbl> 1 MTA New York City Transit Heavy Rail 1.79e9 2 MTA New York City Transit Bus 4.59e8 3 Los Angeles County Metropolitan Transportation Authority Bus 1.94e8 4 Chicago Transit Authority Bus 1.40e8 5 New Jersey Transit Corporation Bus 1.13e8 6 Chicago Transit Authority Heavy Rail 1.04e8 7 MTA Bus Company Bus 1.00e8 8 Washington Metropolitan Area Transit Authority Heavy Rail 9.84e7 9 Southeastern Pennsylvania Transportation Authority Bus 9.66e7 10 Washington Metropolitan Area Transit Authority Bus 8.99e7 # ℹ 1,006 more rows
In 2022, the County of Miami-Dade via vanpool had the highest farebox recovery with a ratio of 1.67. I found this result intriguing as vanpool would not have been my first assumption for this metric. Living in Brooklyn, the concept of vanpooling is unfamiliar to me. However, this did give me insight to how the rest of the country can greatly differ depending on the area of interest.
<- USAGE_AND_FINANCIALS |> highest_farebox group_by(Agency, Mode) |> summarize(total_fares = `Total Fares`, total_expenses = Expenses, .groups = 'drop') |> mutate(recovery = total_fares / total_expenses) |> arrange(desc(recovery)) print(highest_farebox)
# A tibble: 1,016 × 5 Agency Mode total_fares total_expenses recovery <chr> <chr> <dbl> <dbl> <dbl> 1 County of Miami-Dade Vanp… 1987879 1191874 1.67 2 Yuma County Intergovernmental Publ… Vanp… 411216 279585 1.47 3 Port Imperial Ferry Corporation Ferr… 33443241 23417248 1.43 4 Hyannis Harbor Tours, Inc. Ferr… 25972659 18383764 1.41 5 Trans-Bridge Lines, Inc. Comm… 11325199 8495611 1.33 6 Chattanooga Area Regional Transpor… Incl… 3005198 2290714 1.31 7 Municipality of Anchorage Vanp… 1400709 1105911 1.27 8 Regional Transportation Commission… Vanp… 3561776 2876745 1.24 9 Fort Worth Transportation Authority Vanp… 1410877 1141477 1.24 10 Hampton Jitney, Inc. Comm… 21539188 17957368 1.20 # ℹ 1,006 more rows
In 2022, North Carolina State University via bus had the lowest expenses per trip with a ratio of 1.18. More insight into the expenses of this transportation mode could provide better context to the performance. A university funded transportation system could potentially have lower operation costs compared to a metropolitan transit system due to less logistical hurdles to overcome.
<- USAGE_AND_FINANCIALS |> low_expense_UPT group_by(Agency, Mode) |> summarize(total_expenses = Expenses, total_UPT = UPT, .groups = 'drop') |> mutate(lowestUPT = total_expenses/total_UPT) |> arrange(lowestUPT) print(low_expense_UPT)
# A tibble: 1,016 × 5 Agency Mode total_expenses total_UPT lowestUPT <chr> <chr> <dbl> <dbl> <dbl> 1 North Carolina State University Bus 2727412 2313091 1.18 2 Anaheim Transportation Network Bus 9751600 7635011 1.28 3 Valley Metro Rail, Inc. Stre… 542700 364150 1.49 4 University of Iowa Bus 3751241 2437750 1.54 5 Chatham Area Transit Authority Ferr… 935249 582988 1.60 6 Texas State University Bus 4825081 2348943 2.05 7 South Florida Regional Transportati… Bus 731643 322155 2.27 8 University of Georgia Bus 6267845 2714941 2.31 9 Hillsborough Area Regional Transit … Stre… 2780595 1137177 2.45 10 University of Michigan Parking and … Bus 11990864 4754836 2.52 # ℹ 1,006 more rows
In 2022, the Altoona Metro Transit via demand rail had the highest total fares per trip with a ratio of 660. Note that the ratio is absurdly large, there are indications as to why this number is so high. The output shows that there were only a total of 26 unlinked passenger trips for this mode of transportation. The lack of total trips is something to consider when answering this question. I would postulate setting a minimum number of unlinked passenger trips when asking this question if I wanted to explore the data more in depth.
<- USAGE_AND_FINANCIALS |> high_fare_UPT group_by(Agency, Mode) |> summarize(total_fares = `Total Fares`, total_UPT = UPT, .groups = 'drop') |> mutate(fare_UPT_ratio = total_fares/total_UPT) |> arrange(desc(fare_UPT_ratio)) print(high_fare_UPT)
# A tibble: 1,016 × 5 Agency Mode total_fares total_UPT fare_UPT_ratio <chr> <chr> <dbl> <dbl> <dbl> 1 Altoona Metro Transit Dema… 17163 26 660. 2 Central Pennsylvania Transportati… Dema… 14084946 280455 50.2 3 Hampton Jitney, Inc. Comm… 21539188 521577 41.3 4 County of Placer Comm… 40847 1054 38.8 5 Lane Transit District Dema… 10724805 314974 34.0 6 Pennsylvania Department of Transp… Comm… 14580664 452034 32.3 7 Hyannis Harbor Tours, Inc. Ferr… 25972659 878728 29.6 8 Trans-Bridge Lines, Inc. Comm… 11325199 403646 28.1 9 SeaStreak, LLC Ferr… 16584600 750392 22.1 10 Cambria County Transit Authority Dema… 520554 25831 20.2 # ℹ 1,006 more rows
In 2022, the VIA Metropolitan Transit via vanpool had the lowest expenses per vehicle revenue mile with a ratio of 0.37. Once again, the transportation mode of vanpool has been a key finding in the metrics observed. Somethings that could be further explored are the characteristics of the areas vanpools are popular in.
<- USAGE_AND_FINANCIALS |> low_expense_VRM group_by(Agency, Mode) |> summarize(total_expenses = Expenses, total_VRM = VRM, .groups = 'drop') |> mutate(lowestVRM = total_expenses/total_VRM) |> arrange(lowestVRM) print(low_expense_VRM)
# A tibble: 1,016 × 5 Agency Mode total_expenses total_VRM lowestVRM <chr> <chr> <dbl> <dbl> <dbl> 1 VIA Metropolitan Transit Vanp… 1298365 3505579 0.370 2 County of Miami-Dade Vanp… 1191874 3091052 0.386 3 County of Volusia Vanp… 87487 222484 0.393 4 Corpus Christi Regional Transportat… Vanp… 433951 1006399 0.431 5 Metropolitan Transportation Commiss… Vanp… 5491767 12341055 0.445 6 Central Midlands Regional Transport… Vanp… 195326 438557 0.445 7 Fort Worth Transportation Authority Vanp… 1141477 2372285 0.481 8 San Joaquin Council Vanp… 4629125 9297516 0.498 9 Salem Area Mass Transit District Vanp… 238952 468018 0.511 10 San Diego Association of Governments Vanp… 5264624 9740828 0.540 # ℹ 1,006 more rows
In 2022, the Chicago Water Taxi (Wendella) via Ferry had the highest total fares per vehicle revenue mile with a ratio of 237. Like the fourth farebox recovery metric, this data point also seems skewed because of the lack of vehicle revenue miles involved. I would suggest setting a minimum amount of vehicle revenue miles as well to gauge a better understanding of which transit system boasts the highest total fare per vehicle revenue mile.
<- USAGE_AND_FINANCIALS |> high_fare_VRM group_by(Agency, Mode) |> summarize(total_fares = `Total Fares`, total_VRM = VRM, .groups = 'drop') |> mutate(fare_VRM_ratio = total_fares/total_VRM) |> arrange(desc(fare_VRM_ratio)) print(high_fare_VRM)
# A tibble: 1,016 × 5 Agency Mode total_fares total_VRM fare_VRM_ratio <chr> <chr> <dbl> <dbl> <dbl> 1 Chicago Water Taxi (Wendella) Ferr… 142473 600 237. 2 Altoona Metro Transit Dema… 17163 75 229. 3 Jacksonville Transportation Autho… Ferr… 1432549 9084 158. 4 Chattanooga Area Regional Transpo… Incl… 3005198 20128 149. 5 Hyannis Harbor Tours, Inc. Ferr… 25972659 188694 138. 6 SeaStreak, LLC Ferr… 16584600 143935 115. 7 Cape May Lewes Ferry Ferr… 6663334 71640 93.0 8 Woods Hole, Martha's Vineyard and… Ferr… 33424462 364574 91.7 9 Washington State Ferries Ferr… 57644277 738094 78.1 10 County of Pierce Ferr… 2979914 44548 66.9 # ℹ 1,006 more rows
Conclusion
This analysis was ultimately inspired by how farebox recovery rates vary by transit system across the nation. The definition of efficiency I would like to use is the transit system with the highest farebox recovery rate. From this sample, the County of Miami-Dade vanpool transit system is the most efficient. The fares made on the transit system trips exceed the operation cost per trip. Not only does the return cover the operation costs, it also exceeds it slightly. If profit is not the goal for the transit system, the surplus revenue can be reinvested into infrastructure to ensure smooth operations. Continuously enhancing the rider experience can help sustain the system over the long term, with the aim of increasing ride share participation over time.
There are other data points that can be analyzed and incorporated into this analysis for further exploration. Some areas of interest I would explore are comparing trips taken to the total population the area serves. This can give a better idea about what percentage of the population is utilizing the public transportation system. Trends about public transportation usage based on the population available can highlight how popular public transit is depending on an area. Another area of interest I would like to explore is the carbon emission reduction provided by transit systems. A transit system’s financial stability could be easily offset by environmental impact depending on the mode of transportation. There are multitudes of other data points that can be extrapolated to explore how efficient a transit system is within the scope of defining what efficiency is.