ISO 3166-1/2 Boundary Polygons

I have been looking into extending the coordinate-based timezone lookup system we have in KItinerary, in order to ready it for being moved to KDE Frameworks. The first step however is finding suitable boundary data files for this, next to the timezone ones.

Country Subdivision Boundaries

The existing code uses the IANA timezone boundaries to implement not only timezone lookup but also an almost complete country lookup. There’s however limitations with this, and there’s use cases for which an even higher level of detail would be desirable:

IANA timezone areas aren’t a strict superset of country areas, there’s one notable exception with Asia/Bangkok also being used in northern Vietnam.
Country subdivisions are essential parts of an address in some countries (e.g. USA).
Public holidays can depend on in which subdivision you are (e.g. Germany).
Subdivisions allow better guesses of the local language in multi-lingual countries (e.g. Belgium, Switzerland).

So, ideally we’d want a coordinate lookup not only for the timezone but also the country subdivision (ISO 3166-2), the country (ISO 3166-1) follows from that reliably in all cases then.

Boundary Polygons

The Z-order curve used by the existing timezone lookup code is generated by a QGIS script and needs either a Shapefile or a GeoJSON file with the boundary polygons as input.

We’d also want something that matches the resolution of the timezone boundaries as closely as possible. This isn’t so much about the high level of precision, we need to considerably reduce that for efficient storage anyway, but about avoiding artifacts along the borders caused by different levels of detail. And it goes without saying that we also need this as proper Open Data under a suitable license.

Unfortunately with those requirements it gets hard to find ready-made datasets for this, so DIY to the rescue!

Building Boundary Polygons

We don’t have to start from scratch, we can look at the timezone boundary builder as a starting point. Our problem is even considerably easier, as we don’t have to puzzle together our polygons from various pieces of OSM data, we just need to extract the right boundaries from OSM.

Its approach of using Overpass for retrieving the needed data unfortunately didn’t work reliably for me, repeatedly resulting in timeouts and throttling. So I ended up with working with a full offline OSM copy again instead.

The code for this is here. It generates Shapefile and GeoJSON of both ISO 3166-1 and ISO 3166-2 boundaries from OSM data and performs a few sanity checks on it.

Screenshot of QGIS showing ISO 3166-1 and ISO 3166-2 areas. — ISO 3166-1 (green) and ISO 3166-2 (blue) areas shown in QGIS.

Result

You can find the current versions of the generated data here, the same license as for OSM data applies:

iso3166-1-boundaries.geojson-2021-02-11.zip (38M download, 133M unpacked)
iso3166-1-boundaries.shp-2021-02-11.zip (57M download, 77M unpacked)
iso3166-2-boundaries.geojson-2021-02-11.zip (220M download, 783M unpacked)
iso3166-2-boundaries.shp-2021-02-11.zip (340M download, 450M unpacked)

While the country boundaries looks reasonably complete and correct, you can spot a few issues in the subdivision data. This can for example have one of the following reasons:

There are gaps in the boundaries in the OSM data. The GDAL tools used for generating the GeoJSON and Shapefile output silently discard such boundaries. Examples: IR-18, IR-30, KR-42.
Other forms of continuity issues in the border line. This case can be much harder to spot as it is not necessarily visually distinguishable from a correct boundary polygon on the OSM website. It is however also silently discarded by the GDAL tools. Example: BR-PR.
The ISO 3166-2 tag is missing on the corresponding OSM boundary. Example: CN-ZJ (already fixed).

Fixing those issues upstream in the OSM data would be by far the easiest way to complete the boundary data extracts. This is probably best done by people on the ground in the respective countries though, messing with a sensitive subject like land borders from halfway around the world doesn’t seem like a good idea. The good news however is that a lot of issues found just two weeks ago are already fixed.

Outlook

Even if not perfect yet, this gives us enough data to continue exploring efficient coordinate-based lookup. While I’m quite sure the Z-order curve approach will technically hold up, I’m not sure how well it scales with the ten-fold increase in different areas. We still want to keep the result small enough for shipping in mobile apps after all.