Following the look at how KDE Itinerary does data extraction, this post will cover custom data extractors in a bit more detail. Custom extractors are needed where we are unable to obtain the information we are interested in from structured annotations, or add information to incomplete structured data (such as boarding pass barcodes).

General Workflow

Data extraction is usually performed on incoming emails, and roughly follows these steps:

  1. Find all suitable extractors for a given document. Typically this is done based on the context, not the content. Practically all existing extractors do this by wildcard matching on the From header of the email. This makes this step very fast, and allows us to avoid expensive processing of the content if not necessary.

  2. If a matching custom extractor has been found, it is called with the input document as argument and returns a JSON-LD structure following the schema.org ontology with whatever data it could find. Custom extractors are simple JavaScript files executed in a QJSEngine.

  3. The resulting JSON-LD structures are then normalized, augmented by information from Wikidata, and merged with data from e.g. structured annotations or barcodes. Elements still invalid at the end of this processing are discarded. This means that custom extractors have some flexibility when it comes to the completeness of their results, there is no need to extract information that can be deduced by other means, see the post-processing docs for details.

Extractor Metadata

To create a custom extractor you first need to define on which documents it is applicable. This is done in the extractor meta data, a simple JSON file specifying where the code for your extractor can be found, and when to trigger it.

{
    "type": "text",
    "filter": [ { "header": "From", "match": "@aohostels.com" } ],
    "script": "aohostels.js"
}

This is fairly straightforward, defining to run the script aohostels.js on plain text parts of emails received from that vendor.

The KItinerary::Extractor documentation explains what else can be specified here.

Extractor Scripts

The interesting part is if course in the extractor script itself. The entry point is the function mentioned in the meta data (main is the default). It’s argument depends on the type, for plain text that’s just a string, for HTML, PDF or Apple Pass files you get an object representing that type. This function is expected to return a JSON object (or an array of JSON objects) containing the extracted data in JSON-LD encoding of the schema.org ontology.

The following (slightly simplified) excerpt from the extractor for the booking confirmations of the Akademy accommodation shows the common patterns of a simple text-based extractor:

function main(text) {
    // create JSON-LD objects
    var res = JsonLd.newObject("LodgingReservation");

    // use regular expressions to find relevant content
    var bookingRef = text.match(/Booking no\.\s+([A-Z0-9-]+)\s+/);
    res.reservationNumber = bookingRef[1];

    // convert textual/localized date/time values
    var date = text.match(/Arrival\s+(\d{1,2}\.\d{1,2}\.\d{4})\s+/);
    res.checkinTime = JsonLd.toDateTime(arrivalDate[1], "dd.MM.yyyy", "en");

    // use Google maps links to obtain geo coordinates
    res.reservationFor = JsonLd.newObject("LodgingBusiness");
    res.reservationFor.geo = JsonLd.newObject("GeoCoordinates");
    var geo = text.match(/google.com\/maps\/place\/([0-9\.]+),([0-9\.]+)>/);
    res.reservationFor.geo.latitude = geo[1];
    res.reservationFor.geo.longitude = geo[2];

    ...

    return res;
}

The full source code is here. Next to that you’ll also find more elaborate examples covering multi-leg trips, support for multiple localized variants, other document types, multi-column layouts, or documents that don’t specify the year in any mentioned date.

The KItinerary::ExtractorEngine documentation describes the API exposed to extractor scripts, both for working with more complex input types and for performing common operations like locale-specific date/time parsing or barcode decoding.

Development Tools

To make custom extractor development easier, we have an interactive data inspection and test tool called KItinerary Workbench. With the KItinerary Workbench you can inspect input data as seen by the extractor script, that is look at the extracted text and images of a PDF file or the DOM tree of an HTML document, or decode barcodes in various formats.

Most importantly though, you can re-run your just edited extractor scripts without recompiling or restarting anything. See the documentation on how to set that up.

KItinerary Workbench screenshot
KItinerary Workbench

Get Started!

This post should contain enough pointers to get started with custom extractor development, more details can be found in the KItinerary documentation. If you have questions, join us on the KDE PIM mailing list or in the #kontact channel on Freenode.