Features:
Tracking Amtrak 188
How a quirky data source let us get ahead of the story
Al Jazeera America published an analysis of data related to the crash of Amtrak 188 this week, using historical data that was technically publicly available. There’s nothing super surprising about that—we are a news operation after all—except that we published the speed the train was traveling (now widely known to be twice the speed limit) at roughly the same time as the major news outlets who first broke the story. (In the end, they just barely beat us in a differential we’re likely to explain away as “CMS difficulties.”) And although we’re a much smaller organization, we don’t have a veteran transportation reporter with years of sources to pull from, and Amtrak wasn’t calling us back, we were able to follow up with a day-two story comparing the train that crashed to hundreds of previous trips this year, to see how common speeding through this junction is. Even though the historical data we used wasn’t readily available after the crash.
The lesson for us was that there’s often more than one way to get the same information. With advance planning, small organizations can overcome limited resources and still report stories quickly.
How We Did It
Last year, a friend tipped me off to an interesting new tool from Amtrak that published real-time train data, though without offering any data archives. After a little bit of tinkering, I wrote a script, now named “Amtracker,” that pulled the data on a five-minute cron job, which seemed to be how often the data was being refreshed. It was very low-fi, and it simply put the files into an S3 bucket.
At that point, I knew the data set was interesting, but I didn’t have a specific project in mind. The data includes elements like how many minutes the train is delayed, its speed, origin, destination, and others—so there was a lot to work with. I considered a few projects—analyzing speed and curve data (though determining the threshold of “sharp curve” would be tricky), a train-delay predictor, or a “data sonification,” wherein constantly changing train positions could represent shifting string lengths on an instrument and play different notes—but all involved significant effort without an immediate news hook.
Meanwhile, the script continued to run.
It ran for a while.
I knew it was working because 1.) I got the S3 bill every month, and 2.) I set up a bot to tweet whenever it ran. Similar to how Mockingjay is set up to report its status via Twitter.
Then, Amtrak 188 derailed on a sharp curve in Philadelphia, and the data became singularly newsworthy.
Our team had a quick discussion and decided the most useful thing we could make immediately was an annotated map of the train’s trajectory and its final recorded speed (in this data set) of 106.22 mph. Alex Newman, a colleague here on the multimedia team, put everything into a quick CartoDB map with annotations of the various speed limits. The data for 188’s last broadcast looks like this:
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
-75.09967,
40.00172
]
},
"properties": {
"ID": "408569",
"TrainNum": "188",
"Aliases": "",
"OrigSchDep": "5/12/2015 7:10:00 PM",
"OriginTZ": "E",
"TrainState": "Active",
"Velocity": "106.216995239258",
"RouteName": "Northeast Regional",
"CMSID": "1241245668597",
"OrigCode": "WAS",
"DestCode": "NYP",
"EventCode": "TRE",
"EventDT": "",
"LastValTS": "5/12/2015 9:22:49 PM",
"Heading": "E",
"EventSchAr": "",
"EventSchDp": "",
"Station1": "{\"code\":\"WAS\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 19:10:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postdep\":\"05/12/2015 19:15:00\",\"postcmnt\":\"5 MI LATE\"}",
"Station2": "{\"code\":\"NCR\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 19:22:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postarr\":\"05/12/2015 19:25:00\",\"postdep\":\"05/12/2015 19:28:00\",\"postcmnt\":\"6 MI LATE\"}",
"Station3": "{\"code\":\"BWI\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 19:37:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postarr\":\"05/12/2015 19:39:00\",\"postdep\":\"05/12/2015 19:42:00\",\"postcmnt\":\"5 MI LATE\"}",
"Station4": "{\"code\":\"BAL\",\"tz\":\"E\",\"bus\":false,\"scharr\":\"05/12/2015 19:52:00\",\"schdep\":\"05/12/2015 19:54:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postarr\":\"05/12/2015 19:53:00\",\"postdep\":\"05/12/2015 19:55:00\",\"postcmnt\":\"1 MI LATE\"}",
"Station5": "{\"code\":\"ABE\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 20:16:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postarr\":\"05/12/2015 20:13:00\",\"postdep\":\"05/12/2015 20:17:00\",\"postcmnt\":\"1 MI LATE\"}",
"Station6": "{\"code\":\"WIL\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 20:43:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postarr\":\"05/12/2015 20:46:00\",\"postdep\":\"05/12/2015 20:48:00\",\"postcmnt\":\"5 MI LATE\"}",
"Station7": "{\"code\":\"PHL\",\"tz\":\"E\",\"bus\":false,\"scharr\":\"05/12/2015 21:07:00\",\"schdep\":\"05/12/2015 21:10:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":false,\"postarr\":\"05/12/2015 21:06:00\",\"postdep\":\"05/12/2015 21:10:00\",\"postcmnt\":\"ON TIME\"}",
"Station8": "{\"code\":\"TRE\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 21:37:00\",\"schcmnt\":\"\",\"autoarr\":false,\"autodep\":true,\"estarr\":\"05/12/2015 21:37:00\",\"estdep\":\"05/12/2015 21:38:00\",\"estarrcmnt\":\"ON TIME\",\"estdepcmnt\":\"01 MI LATE\"}",
"Station9": "{\"code\":\"MET\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 21:59:00\",\"schcmnt\":\"\",\"autoarr\":true,\"autodep\":true,\"estarr\":\"05/12/2015 21:58:00\",\"estdep\":\"05/12/2015 21:59:00\",\"estarrcmnt\":\"01 MI EARLY\",\"estdepcmnt\":\"ON TIME\"}",
"Station10": "{\"code\":\"NWK\",\"tz\":\"E\",\"bus\":false,\"schdep\":\"05/12/2015 22:14:00\",\"schcmnt\":\"\",\"autoarr\":true,\"autodep\":true,\"estarr\":\"05/12/2015 22:11:00\",\"estdep\":\"05/12/2015 22:12:00\",\"estarrcmnt\":\"03 MI EARLY\",\"estdepcmnt\":\"02 MI EARLY\"}",
"Station11": "{\"code\":\"NYP\",\"tz\":\"E\",\"bus\":false,\"scharr\":\"05/12/2015 22:34:00\",\"schcmnt\":\"\",\"autoarr\":true,\"autodep\":false,\"estarr\":\"05/12/2015 22:25:00\",\"estarrcmnt\":\"09 MI EARLY\"}",
"StatusMsg": "",
"gx_id": "408569"
}
}
We published both a detail view and the full track so that readers could explore the rest of the trajectory. At that point, we didn’t know whether it exceeded the speed limit at any other point in its journey—but the full data, we felt, was newsworthy material.
Concurrently, we wanted to do an analysis of how train 188’s speed into the curve compared to other trains historically. I didn’t realize that the Amtrak data endpoint only returns 100 results per page when I began scraping the data, though, so I had only been collecting the first page of results until a few months ago when I corrected the error (#dataheartbreak, I know). I pulled the comprehensive data from S3, which took a while both because it was 20 GB and because S3 kept stalling to < 2 kbps download speeds. After going home to get a computer with other Amazon credentials and switching download clients, I had all the data.
A histogram seemed to be the clearest way to display the information. I filtered the data first by train line (Northeast Regional), then by whether they were in the curve using Turf.js as defined by a GeoJSON feature we drew with geojson.io, and finally by northbound trains only to ensure an apples-to-apples comparison.
The code wasn’t super complicated, but I went over it with a friend before publication to ensure I didn’t make any silly mistakes.
We published the results on May 13, the day after the crash. Again, we were very fortunate to have a head start with this particular data set—but despite that advantage, it’s important to remember that smaller newsrooms with even just a handful of data journalists can advance a big story with a little elbow grease. We hope it can serve as another example of Scott Klein’s 2014 code scoop prediction:
You may feel like leaving programming to the professionals. But your next great story is locked away inside a data set. Why let somebody else get it first?
Credits
-
Michael Keller
Michael Keller is a reporter and developer on the Al Jazeera America Interactive Multimedia Team where he alternates between the phone and Sublime Text 2. He is also the co-founder of csv soundsystem, a New York City-based hacker collective and datathon dreamteam.