Public Transport Line Metadata

KPublicTransport gives us access to real-time departure and journey information for many public transport systems. However, the presentation of the result isn’t ideal yet, as we are missing access to the characteristic symbols/icons and colors often associated with public transport lines.

Line Metadata

There’s a number of interesting information about public transport lines relevant for displaying:

The name of the line. That’s often a simple number or letter, rarely also a more complex name.
The icon of the line. That’s usually a fairly basic colored shape with the line name or number on it. Those tend to be used on all local signage, so showing those icons in our apps significantly helps with guiding you to the right place.
The name, icon and color for the product type or mode of the line, say a tram, subway or rapid transit service. Logos for those tend to be prevalent in local signage as well, e.g. the “M” for subways in France, or the white “S” on a green circle for rapid transit trains in Germany.

Query results from online backends for departures or journeys typically contain the line name, and sometimes a color (without any guarantees of whether this is a line color or a product color). Line or product logos are not provided by any backend we have so far.

So, we need a separate data source for this. The common very simple and short line names/numbers however also mean they are far from unique. Matching them against another data source requires we also have a geographic bounding box for the lines, so we can distinguish between say “U1” in Berlin and “U1” in Hamburg.

Data Sources

The data we are looking for is available in our two favorite sources, Wikidata and OpenStreetMap. We need both here, bounding boxes are only available via OSM, and logos are only available via Wikidata. Line colors and information about the type of service are available on both. The corresponding entities are linked by each having a property referring to the unique identifier in the other database.

What we are looking for in the OSM data are relations of type route_master, and their member relations of type route. A route_master in OSM is what KPublicTransport calls a “line”, ie. a set of routes that operate under the same name. In the common simple case a line consists of two routes, each being a sequence of stops in one direction.

The Wikidata items linked to an OSM route_master are ideally subclasses of type transport line or railway line, and which have a part of relation to an item describing the product or mode of transport.

Obtaining Data

The first attempt of getting the OSM data was using an Overpass query, following Julius Tens’ previous work on this. This works great on relatively small areas such as a single city, but failed to scale to world-wide coverage. Even with various optimizations, dynamically adjusting queried areas and distributing this to five servers it would have taken many hours, if not a few days to eventually complete a single run.

Instead we are now using a full local copy of the OSM database. That’s still a multi-hour download initially and needs 150+GB of SSD storage to meaningfully work with it, but it can be kept up to date fairly efficiently thanks to incremental updates, and any further experiments on that data then don’t incur cost on common infrastructure.

There’s various ways to work with the full OSM dataset locally, I ended up with a fairly crude but simple approach:

Use osmfilter to run basic filters on the full file, massively reducing the dataset to the elements we are interested in. This can take 15-30 minutes (largely I/O bound), but the result is a subset small enough to fit into RAM entirely and to make any further processing almost instantaneous.
Use osmconvert to add bounding box annotations, and subsequently drop all depending ways and nodes with osmfilter, to simplify processing later on.
Convert the binary file we have at this point to OSM XML using osmconvert, which can then be fed into the code that had already been written for the Overpass approach as-is.

After sorting out all incomplete elements and merging things belonging to the same line we end up with less than 2,000 Wikidata items to retrieve, which can be done easily with the basic Wikidata API, rather than using the much more expensive SPARQL queries. The same API also gives us the image metadata and license information for the few hundred line and product logo candidates.

The entire process now completes in a bit less than an hour if your local OSM copy needs updating, and about a minute otherwise, good enough for the many experimental runs needed until this produced satisfying results.

Storing Data

The above should already explain why on-demand online access isn’t an option for most of this, and we instead want to ship this information with the KPublicTransport library, not to mention the privacy advantages of this.

In the first version we are looking at about 2,500 tram, subway and rapid transit lines, but ideally we want something that can handle ten times that when also considering busses and regional as well as long-distance train lines.

Information about the lines fit into about ~25kB, excluding the logos, so that part is easy. Considering all possible logos would result in a large amount though, adding a limit to only those files not larger than 10kB each would still give us ~5MB. What we therefore ended up doing is only store the file names (~11kB), and download the files on first use from Wikidata.

That leaves the lookup of the metadata from a given name and geo coordinate. As storing bounding boxes per line and doing a sequential search isn’t going to scale, we need a spatial index for this. After a few different attempts this ended up being a Z-order curve-based quadtree. That might sound complicated but is a surprisingly simple approach: By interleaving the bits of integer representations of latitude and longitude you get a 1D representation of the 2D space (a Z-order curve). Besides being storable in simple data structures, this has interesting properties such as bit-shifting corresponding to switching depths in a quadtree. With that we get a very fast lookup, at just ~13kB of extra space cost. That is only marginally higher than storing bounding boxes for all lines for sequential search (assuming similar tricks for compact storage rather than the raw 16 bytes per line, in which case this would actually be much larger).

To put all those numbers in perspective, the JSON configurations files for the backend services included in the KPublicTransport library as QResource total about 140kB, so I think this is an acceptable result :)

Integration in KDE Itinerary

KDE Itinerary is already making use of this in a few places, such as the transfer elements, the alternative journey search and the local transport departure schedules at points of your journey.

KDE Itinerary timeline showing a transfer element with official subway/rapid-transit line icons. — Transfer element showing official public transport line icons.

You can help!

To make this work, we depend on information available on Wikidata and OpenStreetMap. Reviewing the data there for public transport lines around your home (or other locations of interest to you) is a good way to help. I’ll write a separate post about how to do this in more detail, this is already getting too long again.