KDE Itinerary - Data Extraction

After the overview of KDE’s travel assistant components we are going to look at one part in particular here, the booking data extraction. The convenience and usefulness of the overall system depends on being fed with accurate and complete data of when and where you are going to travel, ideally fully automatically.

The data we are interested in is essentially everything you’d want to see on a unified itinerary for a trip. Flight and hotel bookings probably come to mind first, but there’s also event tickets, restaurant reservations, rental cars bookings, bus tickets, etc.

The primary source of that information is, like for the commercial alternatives, incoming email. However we want to run this locally, under the user’s control, so the entry point for us is the email client. My email client is KMail, so that’s what we have a plug-in for, but there is nothing in the KItinerary library that’s specific to that (or Akonadi), integration with other email clients is very much possible.

So, given those emails, how do we extract interesting information from there?

Structured Annotations

Google, facing the same question, defined a way how travel providers and agents should annotate their emails with structured data representing the relevant information in a machine-readable way. That’s an elegant solution, but it requires the unique market position to get significant buy-in. This work eventually became part of schema.org and thus a free standard, so something we can also benefit from.

Technically this is JSON-LD or Microdata annotations inside the HTML email, just slightly complicated by the fact that we also have to deal with some “creative” JSON or XML encodings…

In general this is the easiest to extract and most complete and accurate form of data we can get. If it’s used by a travel provider or agency seems to vary by region, and it seems to be more common with newer companies. Searching the HTML source code for schema.org is usually a quick indicator if such annotations are present.

See https://schema.org/FlightReservation for an example how this data might look like.

Barcodes

Another useful source for structured data are barcodes on airline boarding passes. The convenient part here is that international air travel is globally standardized by the IATA. The downside is that the air travel backend systems are fairly old, which shows in things like limitations to 6 bit ASCII character sets and incomplete year information in date values.

While the data in such barcodes doesn’t contain enough information to compile a proper itinerary (departure and arrival times are missing for example), it’s still giving us a useful skeleton to work with. We then only have to fill a few gaps by other data sources in what is otherwise fairly reliable data.

Barcodes are also sometimes used on train tickets, however this isn’t globally standardized. We do have two variants implemented so far, the ones used by SNCF (which is a fairly straightforward plain-text encoding), and UIC 918.3 which is used by Deutsche Bahn and a few others. UIC 918.3 is a fairly complex compressed binary encoding, but unfortunately contains few interesting information for us so far.

In order to decode barcodes KItinerary needs to be compiled with zxing. The extractor then looks for known barcodes in PDF attachments and Apple Wallet pass files.

Unfortunately most of this isn’t covered by open standards or at least openly documented. So far we could work around that with a bit of creative search engine usage though, to find excerpts of the relevant documents in public tenders or EU regulation documents for example. So no glorious reverse engineering work here so far.

We do have a few samples from Finnish and Italian train companies with binary-only barcodes that we haven’t managed to decode yet, in case you are into solving binary puzzles or happen to know where to find documentation about those codes.

Unstructured Data

Finally, there are cases where none of the above structured data sources are available (or where those provide incomplete data). For this we have a fallback system in KItinerary that consists of provider-specific scripts. Such a script typically consists of a bunch of regular expressions or XPath queries to obtain whatever data we are interested in.

This is very powerful, the data quality we can achieve that way is only limited by what’s in the source data. At the same time this is however also very error prone, and requires manual work that scales with the amount of providers we want to support.

How to add custom extractor scripts is a larger topic, I’ll try to cover that in future posts.

Contribute

There’s easy ways to help here, for example by filling the provider matrix in the wiki, so we know which formats are used where, and for which providers we still need custom extractors.

Another way to contribue is sending us more test data. That is, any email or file containing relevant information that we should be able to extract automatically. Preferably in its original form, ie. forwarded as attachment rather than inline, so email headers stay untouched.

Obviously this contradicts the privacy concerns motivating all this, and while it’s quite easy to strip out personal information from plain text content while retaining a valid test case, it can be fairly complicated for binary content such as PDF files. So, please be aware of that before sharing test data. There’s probably also enough to say about how to sanitize test data to justify its own post.