Publishing transport data
iRail has come a long way. Not that long, as we -almost- exist for only 2 years, but we moved from being one student with a legal problem towards a community with the most talented people I have ever met. We have been given great opportunities in the past, going from speaking at CeBIT, OGDCamp, TEDxUHasselt or Re:Publica to speaking with policy makers on how to publish transport data. The latter has always been a very difficult question and so far we have come up with this answer:
There’s a lot to publishing transport data in the open. Daniel Dietrich once told me that open transport data is probably the most interesting kind of open data, and he’s totally right. You can publish statistics, you can publish dynamic data (as most of the transport companies are doing today) and you can publish real-time data. But it doesn’t stop by just publishing data, the feedback you can get from the crowd is immense. You can even work together on your data as a common resource, but let’s start at the beginning.
Static statistical data
Thanks to the Freedom of Information legislation, you have the right-to-know and request information held by your government. Surprisingly there aren’t a lot of public transport agencies which publish raw data about punctuality, disturbances, and so on. The cost for a public transport company to publish this data is almost nonexistent: any organisation needs to collect this data for internal use and publishing the data, as the data never changes once published, can be done by publishing raw datadumps. A simple link on a data portal should allow anyone to consult that dataset at anytime.
Dynamic transport data
Data is dynamic when it needs an update periodically. It can take for instance one month for a dataset to change slightly (e.g. a list of all bus stations) or the data can change weekly (e.g. time schedules of a transport company). Data can be published by static dumps, risking the chance of people using an older dataset, or using an Application Programming Interface (API) or web-service. With iRail we have developed The DataTank to create a web-service in no time.
For dynamic transport data there are a couple of standards, none leading to complete satisfaction. For publishing static data dumps the most used format seems to become GTFS (General Transit Feed Specification). It describes a scheme for CSV files. Using GTFS you can integrate your transport service with google maps, with mapnificent, open trip planner and so many more services. There are some common pitfalls with this format, for instance international trains are hard to track when they change from one railway network to another.
There are also some dynamic data standards, such as SIRI or BISON. None of them however seem to satisfy transit companies as, when they do publish an open API, they choose to implement their own specification: api.ns.nl, … But they’re not to blame, we as well have our own specification at data.iRail.be or api.iRail.be.
Real-time transport data
Let’s get one common misconception out of this world: a RESTful or SOAP or plain-old xml webservice can never be real-time. Real-time data means immediately informing all copies of the dataset about a certain change in the complete dataset, or informing all subscribers about a certain event, with almost no delay. This means we want a publisher-subscriber architecture (pubsubhub) where a subscriber can connect to a publisher, which will inform this subscriber when an event occurs. This architecture can be described like a chat-box. When you start a conversation with a certain vehicle, you subscribe to it. You say for instance: “Hi train 314, in the future, can you please keep me up to date of your possible delays? Thanks”. After a while, train 314 may tell you that he has some technical difficulties because it hit a lama. Or maybe someone else on the train may inform the servers that the train has hit a lama, resulting in the train telling all subscribers that it hit a lama once confirmed.
The best example of this is http://42transit.com/. A user-interface where you can subscribe to real-time information about the Dutch railway company.
Now we have three interfaces which give information about the same things. We can also say these things have different representations. To be able to identify these things, so that we can always be sure we are talking about the same object, we can use a Unified Resource Identifier (URI). For example: http://railwaycompanyA.com/train/314/. If this URI is always used to talk about this train, then we can also, after creating indexes, fetch all data concerning this train without any problem. The index of all resources could be returned when you direct your browser to this URI. This is a great way to make sure all different instances inside your organisation can request the same information.
Let’s not stop here. Once we have our URIs pointing to things, we may categorize all our things. These categories need a description, for instance: a train has a location, can hold an amount of people, exists out of different wagons, and so forth. This description can also be described by combining different URIs. This is called an ontology.
And let’s not stop here either, once we have all our things described by ontologies, we can link things from this transport company to things from other transport companies, and thus, query data from different companies as if they were one.
The technology for these rather abstract concepts are already in place for a while: RDF.
Open Data is not expensive
A second misconception is that open data is expensive. Everything that we need for an open data policy: structured data, identification, ontologies, meta-data, and so on, are things that every organisation should have internally anyway. And it doesn’t mean that because a company doesn’t have this in place yet, you can’t have an open data policy. As Bart Rosseau from the city of Ghent taught us at “V-ICT-OR shops IT“, an open data policy is the perfect tool to get your internal data structure in place. Use it to challenge your organisation. When you can say and prove that you have a 5-star data structure, you have something worth more than a certificate, for free.
Open data engagement
Until now we haven’t really, really spoken about open data. So far we have spoken about structuring the data to enter the 21st century, but we haven’t spoken about licenses, getting data feedback, replying to demand-driven requests, documentation, working together on data as a common resource and, in general, how to get an added value from an open data policy as such. But that’s something for a next blogpost.