Updating Marble's OSM Data Server

Recently I wrote about options for getting OSM indoor map data for KDE Itinerary’s work-in-progress indoor map feature for train stations and airports. The most flexible option mentioned there was using Marble’s OSM data tiles (which I had slightly misleadingly called “vector tiles”, a term in the OSM world usually referring to a different data format with much more application-specific pre-processing applied). Here’s an update on how this topic has progressed since.

Understanding the current setup

The first challenge was figuring out how the current system works, and specifically how the data currently served by maps.kde.org has been generated. Not everyone involved in the creation of that system is still available anymore, so this required a bit of code and system archaeology. Thanks to the help from Torsten, Ben and Nicolás with that!

So far it appears that:

The current data has been pre-generated once on a no longer existing system. Which input data and generation parameters were used is lost.
The tools used for generation are in the Marble repository and after minor fixes still work.
The code in the Marble repository for on-demand data tile generation apparently never got deployed on the existing system.

There’s no memory of how long the generation took back then. The best-case estimate based on measurements on current hardware and after applying some of the optimizations mentioned below would suggest a full world-wide dataset would take about 30 days on an eight core machine.

Far from ideal, as I was hoping to achieve an update latency of about two weeks, not even mentioning the huge energy cost.

Geometry reassembly

Before looking into a more efficient way to do this, I wanted to make sure though we could solve the geometry reassembly issues mentioned in the previous post, as that would be a major blocker.

That turned out to be surprisingly simple in the end, a small mistake when converting between different coordinate formats caused a precision loss from the OSM base resolution of 100 nanodegree to about 6 microdegree. That’s the difference between a centimeter and half a meter, ie. enough to noticeably distort indoor building structures (commit).

Screenshots comparing the geometry of pillars and rooms before and after fixing a precision loss issue in Marble's OSM data tiles. — Pillars and room shapes in Paris Gare de Lyon before and after fixing the coordinate precision loss issue.

On-demand tile generation

With a full-scale pre-generation off the table, the obvious alternative would be on-demand generation of requested tiles. For this we can actually take quite some inspiration from how the OSM raster tiles are generated.

The key elements there are mod_tile, an Apache extension for serving map tiles and managing a cache of those, and Tirex, a map tile generation scheduler. This setup isn’t limited to raster tiles, nor to any specific tile generator.

OSM’s own statistics show that even on their much much wider used setup high zoom level tiles are only actually needed for a tiny fraction of the world’s surface, so there’s a lot of resources to be saved this way.

Besides having to write a bit of glue code to interface Marble’s tile generator with Tirex, this however means that the generation of a single tile (or rather a batch of 8x8 tiles, the smallest unit Tirex works with) has to be fast enough for on-demand generation, as well as having a reasonably restrained memory consumption.

Optimizing tile generation

With the existing system, that wasn’t really the case though, due to two major costs: Loading of the input data, and processing/serialization of the resulting tile.

Input Data

To generate a data tile we need to load the raw data for the region covered by the tile, and ideally not much more than that. The full dataset is about 60GB in compact binary storage, without any indexing, and far from evenly spread across the earth’s surface, so even just loading the data of an urban area is non-trivial, let alone finding the right subset in the first place.

The previous approach used a recursive split of the full database, into 2¹⁰ x 2¹⁰ input tiles. That can be done easily using the osmconvert tool, and gives us a crude spatial index. This is however a fairly resource-intensive process with no support for incremental updates, and it still gives us 2⁸ times too much data to load for a tile batch on the highest zoom level.

A proper spatially indexed database with efficient incremental update support would be ideal instead. Several options for this exist, unfortunately the most common ones apply application-specific transformations and thus cannot reproduce the original raw OSM data again.

Attending the virtualized SOTM2020 conference earlier this month however made me discover OSMExpress (osmx), which does offer exactly that. Initial tests look very promising, but there’s of course a price to pay for this as well, in the form of needing 600+GB of disk space.

Processing and Output

Once we have the raw data loaded, it’s reduced in level of detail (on the lower zoom levels), clipped to the tile boundaries and eventually written back out to its binary serialization format.

While there are a few things in there with interesting algorithmic complexity, like clipping of non-trivial polygons, this is mainly an area for technical optimizations. That is, avoiding temporary allocations, avoiding detaches in implicitly shared types, using more efficient file formats and plugging a few memory leaks (see e.g. merge requests 2, 5 or 6).

With all that we are already reaching the needed performance and memory bounds, and there’s probably another 30-50% to be gained by bypassing more of Marble’s internal data structures. Those do provide ABI stability and abstraction over different data sources, features we don’t need here but that nevertheless incur a cost.

Outlook

So far this setup has been successfully tested on a subset of the OSM data (about 10%, covering a high data density area in central Europe). The results are very promising, both regarding performance and regarding improvements in the data quality.

After getting a bigger SSD a full-scale test is now under way, if that doesn’t expose any blockers I hope we get this deployed on maps.kde.org in the not too distant future :) There’s still a few more details to fix in the data processing to retain all information KDE Itinerary would need, but that’s then fairly easy to do once the infrastructure is in place.