Mini-Project #01: Fiscal Characteristics of Major US Public Transit Systems

By Chris Liu

Introduction

The public transit systems found nationwide in the United States play an important role in mobilizing people in their daily lives. This analysis will give an overview of how the transit systems perform relative to another by examining key metrics such as farebox revenues, total number of trips, total vehicle miles traveled and total revenues and expenses by source.

This analysis will examine information from 2022 that utilizes data from fare revenue, monthly ridership, and operating expense reports. By evaluating the metrics mentioned earlier, this analysis will identify trends, expose common challenges, and offer insight into transit performances. Various transit performance metrics will be analyzed to gain a general overview of the data. Based on these findings, transit system efficiency will be evaluated on a comparative basis. The working data sets are provided by the Federal Transit Administration. The latter half of the analysis will examine metrics that can define a transit system as efficient. The evaluation of efficiency will depend on the interpretation of what makes something efficient.

Analysis

Preparing, Cleaning & Loading the Dataset

The relevant data sets used in the analysis can be found here:

The first step in the analysis is to ingest the relevant data tables and prepare them for data analysis using R. The following code will clean and join the tables into relevant dataframes used in the analysis. The output will create the following dataframes named: FARES, EXPENSES, FINANCIALS, TRIPS, MILES, and USAGE.

if (!require("tidyverse")) install.packages("tidyverse")

# Let's start with Fare Revenue
library(tidyverse)
FARES <- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
  select(
    -`State/Parent NTD ID`,
    -`Reporter Type`,
    -`Reporting Module`,
    -`TOS`,
    -`Passenger Paid Fares`,
    -`Organization Paid Fares`
  ) |>
  filter(`Expense Type` == "Funds Earned During Period") |>
  select(-`Expense Type`) |>
  group_by(
    `NTD ID`, # Sum over different `TOS` for the same `Mode`
    `Agency Name`, # These are direct operated and sub-contracted
    `Mode`
  ) |> # of the same transit modality
  # Not a big effect in most munis (significant DO
  # tends to get rid of sub-contractors), but we'll sum
  # to unify different passenger experiences
  summarize(`Total Fares` = sum(`Total Fares`)) |>
  ungroup()

# Next, expenses
EXPENSES <- readr::read_csv("2022_expenses.csv") |>
  select(
    `NTD ID`,
    `Agency`,
    `Total`,
    `Mode`
  ) |>
  mutate(`NTD ID` = as.integer(`NTD ID`)) |>
  rename(Expenses = Total) |>
  group_by(`NTD ID`, `Mode`) |>
  summarize(Expenses = sum(Expenses)) |>
  ungroup()

FINANCIALS <- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`))

# Monthly Transit Numbers
library(tidyverse)
TRIPS <- readxl::read_xlsx("ridership.xlsx", sheet = "UPT") |>
  filter(`Mode/Type of Service Status` == "Active") |>
  select(
    -`Legacy NTD ID`,
    -`Reporter Type`,
    -`Mode/Type of Service Status`,
    -`UACE CD`,
    -`TOS`
  ) |>
  pivot_longer(-c(`NTD ID`:`3 Mode`),
    names_to = "month",
    values_to = "UPT"
  ) |>
  drop_na() |>
  mutate(month = my(month)) # Parse _m_onth _y_ear date specs
MILES <- readxl::read_xlsx("ridership.xlsx", sheet = "VRM") |>
  filter(`Mode/Type of Service Status` == "Active") |>
  select(
    -`Legacy NTD ID`,
    -`Reporter Type`,
    -`Mode/Type of Service Status`,
    -`UACE CD`,
    -`TOS`
  ) |>
  pivot_longer(-c(`NTD ID`:`3 Mode`),
    names_to = "month",
    values_to = "VRM"
  ) |>
  drop_na() |>
  group_by(
    `NTD ID`, `Agency`, `UZA Name`,
    `Mode`, `3 Mode`, month
  ) |>
  summarize(VRM = sum(VRM)) |>
  ungroup() |>
  mutate(month = my(month)) # Parse _m_onth _y_ear date specs

USAGE <- inner_join(TRIPS, MILES) |>
  mutate(`NTD ID` = as.integer(`NTD ID`))

# End of data ingestion and setup

Here, a summary table of USAGE is created to get an introductory visualization of the table that will be used for analysis.

if (!require("DT")) install.packages("DT")
library(DT)

# Initialize a table to begin analysis
sample_n(USAGE, 1000) |>
  mutate(month = as.character(month)) |>
  DT::datatable()

Table 1: Quick Overview of USAGE

Transforming Data Table

The analysis will be initially conducted using the dataframe USAGE. Some of the provided labels are cumbersome to work with in R. It is doable, but we can make our lives easier by renaming them. The first task at hand is to rename the column UZA Name to metro_area. The following code will show how that is done.

USAGE <- USAGE |>
  rename(metro_area = `UZA Name`)

Each transportation Mode is represented by a two letter code, for example HR = Heavy Rail. The two letter codes aren’t meaningful to us, as it’s impossible to guess what they are. The first thing I did to clean this portion up was to find all the codes used by the Federal Transit Administration. Running the following will give us the list of codes found in the data set.

library(dplyr)
unique_modes <- USAGE |>
  distinct(Mode)
print(unique_modes)
# A tibble: 18 × 1
   Mode 
   <chr>
 1 DR   
 2 FB   
 3 MB   
 4 SR   
 5 TB   
 6 VP   
 7 CB   
 8 RB   
 9 LR   
10 YR   
11 MG   
12 CR   
13 AR   
14 TR   
15 HR   
16 IP   
17 PB   
18 CC   

Once all the codes have been identified, the meanings can be found in the National Transit Database glossary. Using the mutate function, all the codes can be changed into meaningful definitions.

USAGE <- USAGE |>
  mutate(Mode = case_when(
    Mode == "DR" ~ "Demand Response",
    Mode == "FB" ~ "Ferryboat",
    Mode == "MB" ~ "Bus",
    Mode == "SR" ~ "Streetcar Rail",
    Mode == "TB" ~ "Trolleybus",
    Mode == "VP" ~ "Vanpool",
    Mode == "CB" ~ "Commuter Bus",
    Mode == "RB" ~ "Bus Rapid Transit",
    Mode == "LR" ~ "Light Rail",
    Mode == "YR" ~ "Hybrid Rail",
    Mode == "MG" ~ "Monorail and Automated Guideway modes",
    Mode == "CR" ~ "Commuter Rail",
    Mode == "AR" ~ "Alaska Railroad",
    Mode == "TR" ~ "Aerial Tramways",
    Mode == "HR" ~ "Heavy Rail",
    Mode == "IP" ~ "Inclined Plane",
    Mode == "PB" ~ "Publico",
    Mode == "CC" ~ "Cable Car",
    TRUE ~ "Unknown"
  ))

Now that the data has been cleaned up to a new extent, let’s create a new summary table from USAGE. To make the outputted table as clean as possible, I’ve opted to get rid of irrelevant columns and change the acronyms to display meaningful words. UPT stands for unlinked passenger trips and VRM stands for vehicle revenue miles.

datatable(
  sample_n(USAGE, 1000) |>
    mutate(month = as.character(month)) |>
    select(-`NTD ID`, -`3 Mode`) |> # exclude ntd id and 3 mode in visual table
    rename(
      `Metro Area` = metro_area, # rename for table output to look cleaner
      `Unlinked Passenger Trips` = UPT, # rename acronym in visual table
      `Vehicle Revenue Miles` = VRM # rename acronym in visual table
    )
)

Table 2: Cleaned Up Version of USAGE

For my own sanity, I checked if the table had any NA values I needed to consider.
na_count <- sum(is.na(USAGE))
print(na_count)
[1] 0

This code returns 0, which lets me know there are no missing values in the data I’m working with. With this reassurance, I won’t be using the na.rm=TRUE statement in any of my code. However, if the opposite were true instead, I would make sure to use the above statement while utilizing aggregate functions.

Initial Metrics of Interest

Now that there is a clean table to work with, some questions of interest about the data can be explored. The following questions will explore the use of the following functions filter, group_by, summarize, and arrange.

The first set of metrics of interest are:
  1. What transit agency had the most total VRM in this sample?
  2. What transit mode had the most total VRM in this sample?
  3. How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?
  4. How much did NYC subway ridership fall between April 2019 and April 2020?
  1. To find the transit agency with the most total VRM from the sample, I need to group the data based on the Agency and its respective VRM total. It turns out the MTA New York City Transit has reign over total VRM among the agencies with 10,832,855,350 miles.

    most_vrm_agency <- USAGE |>
      group_by(Agency) |>
      summarize(total_vrm = sum(VRM)) |>
      arrange(desc(total_vrm))
    print(most_vrm_agency)
    # A tibble: 677 × 2
       Agency                                                              total_vrm
       <chr>                                                                   <dbl>
     1 MTA New York City Transit                                             1.08e10
     2 New Jersey Transit Corporation                                        5.65e 9
     3 Los Angeles County Metropolitan Transportation Authority              4.35e 9
     4 Washington Metropolitan Area Transit Authority                        2.82e 9
     5 Chicago Transit Authority                                             2.81e 9
     6 Southeastern Pennsylvania Transportation Authority                    2.67e 9
     7 Massachusetts Bay Transportation Authority                            2.38e 9
     8 Pace, the Suburban Bus Division of the Regional Transportation Aut…   2.38e 9
     9 Metropolitan Transit Authority of Harris County, Texas                2.27e 9
    10 Denver Regional Transportation District                               1.99e 9
    # ℹ 667 more rows

    I purposely chose not to include a slice_head function to get a comparative overview of the data. Here, the MTA had an overwhelming total over the other agencies, which was an interesting finding. I stuck with the same philosophy throughout most of this analysis since I was interested in comparing the sheer numbers as well, not just the specific metric I was inquiring about.

    most_vrm_agency1 <- USAGE |>
      group_by(Agency) |>
      summarize(total_vrm = sum(VRM)) |>
      arrange(desc(total_vrm)) |>
      slice_head(n = 1)
    print(most_vrm_agency1)
    # A tibble: 1 × 2
      Agency                      total_vrm
      <chr>                           <dbl>
    1 MTA New York City Transit 10832855350
  2. To find the transit mode with the most total VRM from the sample, I need to group the data based on the Mode and its respective VRM total. By a large margin of 49,444,494,088 miles, the bus(MB) Mode had the most total VRM from the sample.

    most_vrm_mode <- USAGE |>
      group_by(Mode) |>
      summarize(total_vrm = sum(VRM)) |>
      arrange(desc(total_vrm))
    print(most_vrm_mode)
    # A tibble: 18 × 2
       Mode                                    total_vrm
       <chr>                                       <dbl>
     1 Bus                                   49444494088
     2 Demand Response                       17955073508
     3 Heavy Rail                            14620362107
     4 Commuter Rail                          6970644241
     5 Vanpool                                3015783362
     6 Light Rail                             2090094714
     7 Commuter Bus                           1380948975
     8 Publico                                1021270808
     9 Trolleybus                              236840288
    10 Bus Rapid Transit                       118425283
    11 Ferryboat                                65589783
    12 Streetcar Rail                           63389725
    13 Monorail and Automated Guideway modes    37879729
    14 Hybrid Rail                              37787608
    15 Alaska Railroad                          13833261
    16 Cable Car                                 7386019
    17 Inclined Plane                             705904
    18 Aerial Tramways                            292860
  3. To find how many trips were taken on the NYC Subway in May 2024, there were multiple criteria to consider here. A filter needs to be used in order to address the transit Mode, month, and Agency. In this case, I made the assumption that the NYC Subway is only operated by the MTA New York City Transit. In May 2024, there were a total of 180,000,000 (1.80e8) trips taken.

    nyc_subway_trips <- USAGE |>
      filter(
        Agency == "MTA New York City Transit",
        Mode == "Heavy Rail",
        month >= as.Date("2024-05-01") & month <= as.Date("2024-05-31")
      )
    print(nyc_subway_trips)
    # A tibble: 1 × 8
      `NTD ID` Agency             metro_area Mode  `3 Mode` month         UPT    VRM
         <int> <chr>              <chr>      <chr> <chr>    <date>      <dbl>  <dbl>
    1    20008 MTA New York City… New York-… Heav… Rail     2024-05-01 1.80e8 3.00e7
    Information Regarding the month Column

    After going through this question at hand, I was able to identify that the month is stored as just the first of the month each year. I confirmed that there is only one entry per month for each respective agency, metro area, and mode. Moving forward with any month filters, I don’t have to worry about including the full month as it will always only be in the format of the first of the month in each year. This understanding will be seen in later examples.

  4. To find the ridership difference between April 2019 and April 2020, I need to find the amount of trips taken in each month-year and subtract from one another. Again, I made the assumption that the NYC Subway is only operated by the MTA New York City Transit. In the following code, you can see the difference when filtering the month as referenced in the call-out above! The ridership fell by 211,969,660 trips between April 2019 and April 2020. I interpreted this metric as the difference of trips between each respective month-year, not the total difference in between.

    ride_fall <- USAGE |>
     filter(Mode == "Heavy Rail") |>
     filter(Agency == "MTA New York City Transit") |> # this is the agency that runs nyc subway
     filter(month %in% c(as.Date("2019-04-01"), as.Date("2020-04-01"))) |>
     group_by(month) |>
     summarize(total_rides = sum(UPT)) |>
     summarize(difference = total_rides[month == as.Date("2020-04-01")] -
       total_rides[month == as.Date("2019-04-01")])
    print(ride_fall)
    # A tibble: 1 × 1
      difference
           <dbl>
    1 -211969660

Additional Metrics of Interest

Asides from the metrics explored above, there are a variety of other questions that can be asked from the data. In this section, I will explore three other areas of interest. The data is not limited to the following questions discussed, there are a multitude of statistics that can be uncovered. For the following questions I asked, I am trying to utilize as many R functions as possible.

Additional metrics of interest are:
  1. What UZA Name / metro_area had the most UPT in January 2022?
  2. What month and year had the most UPT through the bus (MB) in the entire sample?
  3. What is the average amount of trips taken in the New York–Jersey City–Newark, NY–NJ area based on the season from 2018 to 2022?
  1. The areas of interest require filtering a date, grouping by a variable, and aggregating a variable. In January 2022, the New York City–Jersey City–Newark, NY–NJ area recorded the most trips with 173,719,501 trips. The second-ranked area had only 26,158,306 trips, accounting for just 15% of the total for New York–New Jersey.

    library(dplyr)
    popular_area <- USAGE |>
     filter(month %in% c(as.Date("2022-01-01"))) |>
     group_by(metro_area) |>
     summarize(total_trips = sum(UPT)) |>
     arrange(desc(total_trips))
    print(popular_area)
    # A tibble: 295 × 2
       metro_area                            total_trips
       <chr>                                       <dbl>
     1 New York--Jersey City--Newark, NY--NJ   173719501
     2 Los Angeles--Long Beach--Anaheim, CA     26158306
     3 Chicago, IL--IN                          16569817
     4 San Francisco--Oakland, CA               13571654
     5 Boston, MA--NH                           13220711
     6 Philadelphia, PA--NJ--DE--MD             12972351
     7 Washington--Arlington, DC--VA--MD        11229936
     8 Miami--Fort Lauderdale, FL                8318327
     9 Seattle--Tacoma, WA                       8250602
    10 San Diego, CA                             4602265
    # ℹ 285 more rows
  2. This metric references the second question from the previous question. Now we’re taking a look at history, seeing exactly when this mode peaked. The data tells us that this happened in October 2018, with 478,806,384 trips. Evidently, the lower ranked months had similar values too. This shows strong consistency for the bus transit mode across the US.

    bus_trips <- USAGE |>
     filter(Mode == "Bus") |>
     group_by(month) |>
     summarize(total_bus_trips = sum(UPT)) |>
     arrange(desc(total_bus_trips))
    print(bus_trips)
    # A tibble: 271 × 2
       month      total_bus_trips
       <date>               <dbl>
     1 2008-10-01       478806384
     2 2014-10-01       457089165
     3 2013-10-01       456214396
     4 2007-10-01       455193568
     5 2008-09-01       454077576
     6 2006-10-01       450496480
     7 2006-03-01       450386143
     8 2012-10-01       448572088
     9 2008-05-01       442961523
    10 2009-10-01       441007281
    # ℹ 261 more rows
  3. The code required for this was a challenge, but it utilized functions already explored earlier and putting them together intricately. The months were assigned to a season within a case_when function within a mutate function. Additionally a filter, group_by, summarize, arrange, and mean function were used as well. From 2018 to 2022, the average amount of trips taken in NY-NJ was:

    Season Average UPT
    Fall 4,960,514
    Winter 4,772,907
    Summer 4,609,142
    Spring 4,508,331

    Table 3: Average UPT by Season, NY-NJ, 2018-2022

    seasonal_variation <- USAGE |>
      filter(metro_area == "New York--Jersey City--Newark, NY--NJ") |>
      filter(month >= as.Date("2018-01-01") & month <= as.Date("2022-12-01")) |>
      mutate(
        month_num = as.numeric(format(month, "%m")), # Extract the month as a number from the date column
        season = case_when( # Use case_when to categorize into seasons
          month_num %in% c(12, 1, 2) ~ "Winter",
          month_num %in% c(3, 4, 5) ~ "Spring",
          month_num %in% c(6, 7, 8) ~ "Summer",
          month_num %in% c(9, 10, 11) ~ "Fall",
          TRUE ~ "Unknown"
        )
      ) |>
      group_by(season) |>
      summarize(avg_trips = mean(UPT)) |>
      arrange(desc(avg_trips))
    print(seasonal_variation)

This concludes the first half of the analysis. A variety of transit metric data was unearthed. A better understanding of the R functions were explored through data analysis. Now that preliminary data has been identified, we can move forward to the next half of the analysis. The fare data available to use is from 2022. In order to do a deeper analysis, the USAGE table will need to be converted to a 2022 version in order to join the fare data information together. Once we have a combined table, we can uncover what farebox recovery looked like in 2022.

Farebox Recovery

The first task at hand is to extract only the 2022 information from USAGE. The parameters of interest kept are NTD ID, Agency, metro_area, Mode, UPT, VRM. Normally, filtering just the year and selecting the parameters would be straightforward. However, the UPT and VRM need to be aggregated for the new joined table. Additionally, the mutate function is used to convert NTD ID to a double type in order to match the same type as the NT ID in the FINANCIALS table we will be joining to later. The new table is called USAGE_2022_ANNUAL. For the farebox recovery analysis, the sample will focus solely on major transit systems, which is defined as those with 400,000 UPT per annum. For this definition of major transit systems, the total UPT per Agency was considered (the mode is not taken into consideration, just the Agency as a whole itself).

# Calculate UPT per agency per year to only consider agencies with UPT of 400,000 or more per year
agencies_400k_upt <- USAGE |>
  mutate(Year = year(month)) |> # Extract year from month
  group_by(`NTD ID`, Agency, Year) |> # Group by agency and year
  summarize(Total_UPT = sum(UPT, na.rm = TRUE), .groups = 'drop') |> # Summarize total UPT per Agency, aggregating the UPT across all modes
  filter(Total_UPT >= 400000) |> # Keep agencies with total UPT >= 400,000 annum
  ungroup() |> # Ungroup to prepare for next operation
  distinct(Agency) # Get distinct agencies

# Filter the 2022 data for only those agencies
USAGE_2022_ANNUAL <- USAGE |>
  filter(year(month) == 2022) |> # Only data from 2022
  filter(Agency %in% agencies_400k_upt$Agency) |> # Filter agencies that meet avg UPT condition
  group_by(`NTD ID`, Agency, metro_area, Mode) |> # Group by relevant columns
  summarize(
    UPT = sum(UPT), # Sum UPT for 2022
    VRM = sum(VRM) # Sum VRM for 2022
  ) |>
  ungroup() |>
  mutate(`NTD ID` = as.double(`NTD ID`)) # Convert NTD ID to double for joining

print(USAGE_2022_ANNUAL) # Output the filtered table
# A tibble: 1,023 × 6
   `NTD ID` Agency                                metro_area Mode     UPT    VRM
      <dbl> <chr>                                 <chr>      <chr>  <dbl>  <dbl>
 1        1 King County                           Seattle--… Bus   5.40e7 6.16e7
 2        1 King County                           Seattle--… Dema… 6.63e5 1.29e7
 3        1 King County                           Seattle--… Ferr… 4.00e5 5.12e4
 4        1 King County                           Seattle--… Stre… 1.12e6 1.80e5
 5        1 King County                           Seattle--… Trol… 9.58e6 2.64e6
 6        1 King County                           Seattle--… Vanp… 7.03e5 4.41e6
 7        2 Spokane Transit Authority             Spokane, … Bus   6.60e6 6.49e6
 8        2 Spokane Transit Authority             Spokane, … Dema… 3.10e5 4.04e6
 9        2 Spokane Transit Authority             Spokane, … Vanp… 9.06e4 9.06e5
10        3 Pierce County Transportation Benefit… Seattle--… Bus   4.95e6 4.23e6
# ℹ 1,013 more rows
Additional transformation is required:

Before we can join USAGE_2022_ANNUAL onto FINANCIALS, we need to revisit the mode conversion we did earlier in the data cleaning. The FINANCIALS table follows the same format with the mode being a code as seen earlier.

FINANCIALS <- FINANCIALS |>
  mutate(Mode = case_when(
    Mode == "DR" ~ "Demand Response",
    Mode == "FB" ~ "Ferryboat",
    Mode == "MB" ~ "Bus",
    Mode == "SR" ~ "Streetcar Rail",
    Mode == "TB" ~ "Trolleybus",
    Mode == "VP" ~ "Vanpool",
    Mode == "CB" ~ "Commuter Bus",
    Mode == "RB" ~ "Bus Rapid Transit",
    Mode == "LR" ~ "Light Rail",
    Mode == "YR" ~ "Hybrid Rail",
    Mode == "MG" ~ "Monorail and Automated Guideway modes",
    Mode == "CR" ~ "Commuter Rail",
    Mode == "AR" ~ "Alaska Railroad",
    Mode == "TR" ~ "Aerial Tramways",
    Mode == "HR" ~ "Heavy Rail",
    Mode == "IP" ~ "Inclined Plane",
    Mode == "PB" ~ "Publico",
    Mode == "CC" ~ "Cable Car",
    TRUE ~ "Unknown"
  ))

Finally, we can join the USAGE_2022_ANNUAL and FINANCIALS tables together. USAGE_AND_FINANCIALS will be used to conduct the final analysis on farebox recovery in 2022. An innjer_join is used since some values were dropped when filtering out for only major transit systems. In order to join the data properly, an inner_join matches the values that are only present in the USAGE_2022_ANNUAL table.

USAGE_AND_FINANCIALS <- inner_join(
  USAGE_2022_ANNUAL,
  FINANCIALS,
  join_by(`NTD ID`, Mode)
  )

datatable( 
  sample_n(USAGE_AND_FINANCIALS, 1000) |>
    select(-`Agency Name`) |> # exclude extra agency name column from financials table
    rename(
      `Metro Area` = metro_area, # rename for table output to look cleaner
      `Unlinked Passenger Trips` = UPT, # rename acronym in visual table
      `Vehicle Revenue Miles` = VRM # rename acronym in visual table
          )
)

Table 4: Visual of USAGE_AND_FINANCIALS

Farebox recovery metrics to be analyzed:
  1. Which transit system (agency and mode) had the most UPT in 2022?
  2. Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of Total Fares to Expenses?
  3. Which transit system (agency and mode) has the lowest expenses per UPT?
  4. Which transit system (agency and mode) has the highest total fares per UPT?
  5. Which transit system (agency and mode) has the lowest expenses per VRM?
  6. Which transit system (agency and mode) has the highest total fares per VRM?
  1. In 2022, the MTA New York City Transit had the most UPT via the heavy rail (subway). There was a total of 1,793,073,801 trips taken. This result is not suprising given the sheer population size of the NYC tri-state area, as well as the vast amount of public transportation accessibility throughout the city.

    most_UPT_2022 <- USAGE_AND_FINANCIALS |>
      group_by(Agency, Mode) |>
      summarize(most_UPT = UPT, .groups = 'drop') |>
      arrange(desc(most_UPT)) 
    print(most_UPT_2022)
    # A tibble: 1,016 × 3
       Agency                                                   Mode        most_UPT
       <chr>                                                    <chr>          <dbl>
     1 MTA New York City Transit                                Heavy Rail    1.79e9
     2 MTA New York City Transit                                Bus           4.59e8
     3 Los Angeles County Metropolitan Transportation Authority Bus           1.94e8
     4 Chicago Transit Authority                                Bus           1.40e8
     5 New Jersey Transit Corporation                           Bus           1.13e8
     6 Chicago Transit Authority                                Heavy Rail    1.04e8
     7 MTA Bus Company                                          Bus           1.00e8
     8 Washington Metropolitan Area Transit Authority           Heavy Rail    9.84e7
     9 Southeastern Pennsylvania Transportation Authority       Bus           9.66e7
    10 Washington Metropolitan Area Transit Authority           Bus           8.99e7
    # ℹ 1,006 more rows
  2. In 2022, the County of Miami-Dade via vanpool had the highest farebox recovery with a ratio of 1.67. I found this result intriguing as vanpool would not have been my first assumption for this metric. Living in Brooklyn, the concept of vanpooling is unfamiliar to me. However, this did give me insight to how the rest of the country can greatly differ depending on the area of interest.

    highest_farebox <- USAGE_AND_FINANCIALS |>
      group_by(Agency, Mode) |>
      summarize(total_fares = `Total Fares`, total_expenses = Expenses, .groups = 'drop') |>
      mutate(recovery = total_fares / total_expenses) |>
      arrange(desc(recovery))
    print(highest_farebox)
    # A tibble: 1,016 × 5
       Agency                              Mode  total_fares total_expenses recovery
       <chr>                               <chr>       <dbl>          <dbl>    <dbl>
     1 County of Miami-Dade                Vanp…     1987879        1191874     1.67
     2 Yuma County Intergovernmental Publ… Vanp…      411216         279585     1.47
     3 Port Imperial Ferry Corporation     Ferr…    33443241       23417248     1.43
     4 Hyannis Harbor Tours, Inc.          Ferr…    25972659       18383764     1.41
     5 Trans-Bridge Lines, Inc.            Comm…    11325199        8495611     1.33
     6 Chattanooga Area Regional Transpor… Incl…     3005198        2290714     1.31
     7 Municipality of Anchorage           Vanp…     1400709        1105911     1.27
     8 Regional Transportation Commission… Vanp…     3561776        2876745     1.24
     9 Fort Worth Transportation Authority Vanp…     1410877        1141477     1.24
    10 Hampton Jitney, Inc.                Comm…    21539188       17957368     1.20
    # ℹ 1,006 more rows
  3. In 2022, North Carolina State University via bus had the lowest expenses per trip with a ratio of 1.18. More insight into the expenses of this transportation mode could provide better context to the performance. A university funded transportation system could potentially have lower operation costs compared to a metropolitan transit system due to less logistical hurdles to overcome.

    low_expense_UPT <- USAGE_AND_FINANCIALS |>
      group_by(Agency, Mode) |>
      summarize(total_expenses = Expenses, total_UPT = UPT, .groups = 'drop') |>
      mutate(lowestUPT = total_expenses/total_UPT) |>
      arrange(lowestUPT)
    print(low_expense_UPT)
    # A tibble: 1,016 × 5
       Agency                               Mode  total_expenses total_UPT lowestUPT
       <chr>                                <chr>          <dbl>     <dbl>     <dbl>
     1 North Carolina State University      Bus          2727412   2313091      1.18
     2 Anaheim Transportation Network       Bus          9751600   7635011      1.28
     3 Valley Metro Rail, Inc.              Stre…         542700    364150      1.49
     4 University of Iowa                   Bus          3751241   2437750      1.54
     5 Chatham Area Transit Authority       Ferr…         935249    582988      1.60
     6 Texas State University               Bus          4825081   2348943      2.05
     7 South Florida Regional Transportati… Bus           731643    322155      2.27
     8 University of Georgia                Bus          6267845   2714941      2.31
     9 Hillsborough Area Regional Transit … Stre…        2780595   1137177      2.45
    10 University of Michigan Parking and … Bus         11990864   4754836      2.52
    # ℹ 1,006 more rows
  4. In 2022, the Altoona Metro Transit via demand rail had the highest total fares per trip with a ratio of 660. Note that the ratio is absurdly large, there are indications as to why this number is so high. The output shows that there were only a total of 26 unlinked passenger trips for this mode of transportation. The lack of total trips is something to consider when answering this question. I would postulate setting a minimum number of unlinked passenger trips when asking this question if I wanted to explore the data more in depth.

    high_fare_UPT <- USAGE_AND_FINANCIALS |>
      group_by(Agency, Mode) |>
      summarize(total_fares = `Total Fares`, total_UPT = UPT, .groups = 'drop') |>
      mutate(fare_UPT_ratio = total_fares/total_UPT) |>
      arrange(desc(fare_UPT_ratio))
    print(high_fare_UPT)
    # A tibble: 1,016 × 5
       Agency                             Mode  total_fares total_UPT fare_UPT_ratio
       <chr>                              <chr>       <dbl>     <dbl>          <dbl>
     1 Altoona Metro Transit              Dema…       17163        26          660. 
     2 Central Pennsylvania Transportati… Dema…    14084946    280455           50.2
     3 Hampton Jitney, Inc.               Comm…    21539188    521577           41.3
     4 County of Placer                   Comm…       40847      1054           38.8
     5 Lane Transit District              Dema…    10724805    314974           34.0
     6 Pennsylvania Department of Transp… Comm…    14580664    452034           32.3
     7 Hyannis Harbor Tours, Inc.         Ferr…    25972659    878728           29.6
     8 Trans-Bridge Lines, Inc.           Comm…    11325199    403646           28.1
     9 SeaStreak, LLC                     Ferr…    16584600    750392           22.1
    10 Cambria County Transit Authority   Dema…      520554     25831           20.2
    # ℹ 1,006 more rows
  5. In 2022, the VIA Metropolitan Transit via vanpool had the lowest expenses per vehicle revenue mile with a ratio of 0.37. Once again, the transportation mode of vanpool has been a key finding in the metrics observed. Somethings that could be further explored are the characteristics of the areas vanpools are popular in.

    low_expense_VRM <- USAGE_AND_FINANCIALS |>
      group_by(Agency, Mode) |>
      summarize(total_expenses = Expenses, total_VRM = VRM, .groups = 'drop') |>
      mutate(lowestVRM = total_expenses/total_VRM) |>
      arrange(lowestVRM)
    print(low_expense_VRM)
    # A tibble: 1,016 × 5
       Agency                               Mode  total_expenses total_VRM lowestVRM
       <chr>                                <chr>          <dbl>     <dbl>     <dbl>
     1 VIA Metropolitan Transit             Vanp…        1298365   3505579     0.370
     2 County of Miami-Dade                 Vanp…        1191874   3091052     0.386
     3 County of Volusia                    Vanp…          87487    222484     0.393
     4 Corpus Christi Regional Transportat… Vanp…         433951   1006399     0.431
     5 Metropolitan Transportation Commiss… Vanp…        5491767  12341055     0.445
     6 Central Midlands Regional Transport… Vanp…         195326    438557     0.445
     7 Fort Worth Transportation Authority  Vanp…        1141477   2372285     0.481
     8 San Joaquin Council                  Vanp…        4629125   9297516     0.498
     9 Salem Area Mass Transit District     Vanp…         238952    468018     0.511
    10 San Diego Association of Governments Vanp…        5264624   9740828     0.540
    # ℹ 1,006 more rows
  6. In 2022, the Chicago Water Taxi (Wendella) via Ferry had the highest total fares per vehicle revenue mile with a ratio of 237. Like the fourth farebox recovery metric, this data point also seems skewed because of the lack of vehicle revenue miles involved. I would suggest setting a minimum amount of vehicle revenue miles as well to gauge a better understanding of which transit system boasts the highest total fare per vehicle revenue mile.

    high_fare_VRM <- USAGE_AND_FINANCIALS |>
      group_by(Agency, Mode) |>
      summarize(total_fares = `Total Fares`, total_VRM = VRM, .groups = 'drop') |>
      mutate(fare_VRM_ratio = total_fares/total_VRM) |>
      arrange(desc(fare_VRM_ratio))
    print(high_fare_VRM)
    # A tibble: 1,016 × 5
       Agency                             Mode  total_fares total_VRM fare_VRM_ratio
       <chr>                              <chr>       <dbl>     <dbl>          <dbl>
     1 Chicago Water Taxi (Wendella)      Ferr…      142473       600          237. 
     2 Altoona Metro Transit              Dema…       17163        75          229. 
     3 Jacksonville Transportation Autho… Ferr…     1432549      9084          158. 
     4 Chattanooga Area Regional Transpo… Incl…     3005198     20128          149. 
     5 Hyannis Harbor Tours, Inc.         Ferr…    25972659    188694          138. 
     6 SeaStreak, LLC                     Ferr…    16584600    143935          115. 
     7 Cape May Lewes Ferry               Ferr…     6663334     71640           93.0
     8 Woods Hole, Martha's Vineyard and… Ferr…    33424462    364574           91.7
     9 Washington State Ferries           Ferr…    57644277    738094           78.1
    10 County of Pierce                   Ferr…     2979914     44548           66.9
    # ℹ 1,006 more rows

Conclusion

This analysis was ultimately inspired by how farebox recovery rates vary by transit system across the nation. The definition of efficiency I would like to use is the transit system with the highest farebox recovery rate. From this sample, the County of Miami-Dade vanpool transit system is the most efficient. The fares made on the transit system trips exceed the operation cost per trip. Not only does the return cover the operation costs, it also exceeds it slightly. If profit is not the goal for the transit system, the surplus revenue can be reinvested into infrastructure to ensure smooth operations. Continuously enhancing the rider experience can help sustain the system over the long term, with the aim of increasing ride share participation over time.

There are other data points that can be analyzed and incorporated into this analysis for further exploration. Some areas of interest I would explore are comparing trips taken to the total population the area serves. This can give a better idea about what percentage of the population is utilizing the public transportation system. Trends about public transportation usage based on the population available can highlight how popular public transit is depending on an area. Another area of interest I would like to explore is the carbon emission reduction provided by transit systems. A transit system’s financial stability could be easily offset by environmental impact depending on the mode of transportation. There are multitudes of other data points that can be extrapolated to explore how efficient a transit system is within the scope of defining what efficiency is.