Problems With GTFS
What is GTFS?
The General Transit Feed Specification (GTFS) is a specification which “defines a common format for public transportation schedules and associated geographic information. GTFS “feeds” allow public transit agencies to publish their transit data and developers to write applications that consume that data in an interoperable way”. More specifically, GTFS is a set of files that give all the details about a transit system. You have the list of stations, the list of routes, the list of trips along each route, the stop times at each station along a trip and a calendar that details what days each route is active.
The Good of GTFS
The best part of GTFS is that it is a standard (of sorts) that is easily implementable for most (if not all) transit systems. So, anyone who’s interested can use the files to show the transit details for any transit system that has published the data. Some systems put the data out publicly. Some require you to sign up before getting the data, but it’s freely available in a standardized format. This has done a great job of making this data available to a wider set of people and, hopefully, making it easier to take public transportation.
The Problems with GTFS
There are a number of problems with GTFS. Some are inherent in the specification and some are part of the implementation from various agencies. So, let’s take a look at the different problems.
How many different ways can you specify what time a bus/light rail/train/etc stops at a station? According to GTFS, there are two. Either you can use the stop_times file, which lists the trip, the station and the arrival/departure times or you can use the frequency file. The frequency file uses the data from the stop_times file, except the actual times, and instead lists a frequency for each trip (hourly, etc) and uses the stop_times file to fill in the time difference between the different stops. I still have no idea why anyone would use the frequency file. You still need to fill out all (or most of) the data in the stop_times file, but you also need to fill out the frequency table. And when you’re trying to read the data, you have to always read the stop_times file, and then check to see if the frequency file is there for that trip as well. It makes the implementation a lot more complex for very little gain.
What stations are on a route? According to GTFS, the actual list of stops on a route (and the order of those stops) are only available in the stop_times file. This means that if you want to know just what stops are on a route, you need to look at the trips on each route and then look at the stop_times for each trip to get the list of stations on a route. However, there’s a catch, as always. Each trip has a stop sequence ID in the stop_times file. This is the enumeration of the order of stations on that trip. BUT, since a single trip on a route might not hit every station, you can’t just look at one trip to get the list of stations. Let’s look at an example with three stations (A, B, C) and three trips
Trip 1: A, C
Trip 2: A, B, C
Trip 3: A,C
So, we have three trips over three stations and station B is only used on 1 of the three trips. The stop sequence for A is (1,1,1), for B (2) and for C (2,3,2). The only trip that uses all three stations is Trip 2. But, there is no way to know that. So, you need to average the stop sequence of all the stations for the trips they appear on to determine what the correct list of stations (and their order) is. The averages of A ((1+1+1)/3), B(2/1) and C((2+3+2)/3) come out to A (1), B (2), C(2-1/3). This tells us that the correct list of stations (in the correct order) is A, B, C. But it’s a pain to do. You have to look at all the trips and all the stations and do calculations to determine this basic piece of data.
What days does each route run? Most, if not all, transit systems have different schedules on different days of the week. Some might have a weekdays and weekends schedule, others might do some routes only on specific days. The way to determine this in GTFS is the Calendar file. It has a start and end data and a list of days to mark which ones are active. This ties into the trip data to be able to say which trips take place on which days. But, as always it seems, GTFS leaves an out for a different way of specifying things. The calendar_dates files is generally used for holidays or out of service days. You can specify which days are exempt from the general schedule and what schedule to use instead. This lets transit systems say that even though Christmas is on Tuesday, the buses are running on a Sunday schedule. But, you can also put every single day into the calendar_dates file and completely ignore the calendar schedule. So, instead of saying, here’s the Monday schedule, you say here’s the schedule for December 31 2012 and here’s the schedule of Jan 1 2013 and so on for every single day. This turns a 3-5 line calendar file into a calendar_dates file with hundreds of lines.
Implicit promises. Generally, if you see a field marked as an id (such as trip_id or route_id, etc.) it’s an implicit promise that it will be an integer field. Too many transit agencies start throwing letters and special characters into these fields which should be integers
Short duration. Generally, most transit schedules don’t change that often. Yet these agencies feel the need to send out files which only list the schedule for the next 3 months. And, often, they don’t send out the updated files until right before the three months are up (and too often they don’t send out the updates until AFTER the three months are over). What’s the problem with putting out a year of data at a time or a couple years. If you know something is going to change, then put it in the files. The problem comes when the end date in the calendar expires and the new files haven’t been sent out yet. Then, according to GTFS, there isn’t any service going on. An application that follows the specification will show that there is no service on those days, an outcome that isn’t good for anyone.
I don’t want to make it sound like I’m ungrateful for this bounty of free data being available for developers and users across the world. But we shouldn’t be willing to settle for good enough when there are clear ways to fix these issues. Standardize solutions, standardize fields (with types) and encourage longer duration files. These are simple things that will make GTFS more useful and easier to use for everyone.