Michael Maclean

Using JQ with SQLite

A while ago I was doing some work with data that I was loading into SQLite. I had been using jq to do some exploratory work to get different parts of the data out, but those queries didn’t directly translate to the JSON operators that exist in SQLite. I wondered if it might be possible to add an extension to SQLite to add operators that speak jq’s syntax.

It turns out that it’s relatively easy. I opted to use Go to write this addon, as there is both a solid native Go implementation of jq and a Go library for writing SQLite addons, and I didn’t fancy writing it in C. The extension is available on GitHub.

Installing

The instructions are a little different between macOS and Linux.

On macOS, run make, then you can load the resulting extension into sqlite3 using .load sqlite_jq.dylib (or perhaps .load sqlite_jq.so, depending where you got your compiler from).

On Linux, run make to build, though you will then have to place the extension somewhere on LD_LIBRARY_PATH. Alternatively, for testing, you can set this directly:

export LD_LIBRARY_PATH=$PWD:LD_LIBRARY_PATH

Then, you can load the resulting extension with .load sqlite_jq.

Overview

It provides a scalar function, jq(), which lets you extract values from a JSON field using a jq expression:

select jq('{"a": "xyz"}', '.a');
-- returns "xyz"

There is also a table-valued function called jq_each() that can be used to create a table of values from JSON fields.

select * from jq_each('{"rows": ["hello", "world"]}', '.rows[]');
-- returns two rows, "hello" and "world"

Worked example

For this example, I’m going to use Jonty’s airline routes data. It’s a large JSON file, which looks like this:

{
    "AAA": {
        "city_name": "Anaa",
        "continent": "OC",
        "country": "French Polynesia",
        "country_code": "PF",
        "display_name": "Anaa, Anaa Airport (AAA), French Polynesia",
        "elevation": 23,
        "iata": "AAA",
        "icao": "NTGA",
        "latitude": "-17.355648",
        "longitude": "-145.50913",
        "name": "Anaa Airport",
        "routes": [
            {
                "carriers": [
                    "Air Tahiti"
                ],
                "iata": "FAC",
                "km": 76,
                "min": 25
            }
        ],
        "timezone": "Pacific/Tahiti"
    },
    // ...
}

It’s a single large object where the keys are the IATA codes for each airport. To start with, I’m going to create a table with a single column, and import each of these sub-objects into that table, one per row.

create table raw_data (
    raw text
);

Then I’ll insert the rows directly from the file:

insert into raw_data select * from jq_each(readfile('airline_routes.json'), '.[]');

Now, I can use these functions to extract the data into other tables. Let’s create a table for the airports.

create table airports (
    iata char(3) primary key,
    icao char(4),
    latitude float,
    longitude float,
    name varchar(255),
    timezone varchar(255)
);

I can then use the jq() function to load data into this table.

insert into airports select
    jq(raw, '.iata'),
    jq(raw, '.icao'),
    jq(raw, '.latitude'),
    jq(raw, '.longitude'),
    jq(raw, '.name'),
    jq(raw, '.timezone')
from raw_data;

This can, of course, be done just as easily with SQLite’s own json_extract() function, with a slightly different syntax, or in newer releases of SQLite using the -> or ->> operators.

Now, we’ll extract the route data, into another table.

create table routes (
    origin_iata char(3),
    destination_iata char(3),
    distance int,
    time int,
    carrier varchar(255)
);

And we can use a much longer jq expression to extract the route and carrier information. I’m using json_extract() here to demonstrate that the two extensions interoperate.

insert into routes select
    jq(raw_data.raw, '.iata') as origin_iata,
    json_extract(routes.value, '$[0]') as dest_iata,
    json_extract(routes.value, '$[1]') as distance,
    json_extract(routes.value, '$[2]') as time,
    json_extract(routes.value, '$[3]') as carrier
from raw_data,
jq_each(raw_data.raw, '.routes[] | . as $route | .carriers[] | [ $route.iata, $route.km, $route.min, . ]') as routes;

Now we can use the data to find all of the routes from Glasgow Airport.

select origin.iata, origin.name, destination.iata, destination.name, routes.* 
from airports origin
left join routes on routes.origin_iata = origin.iata
left join airports as destination on routes.destination_iata = destination.iata
where origin.iata = 'GLA';

-- GLA|Glasgow|LHR|Heathrow|GLA|LHR|555|90|British Airways
-- GLA|Glasgow|DUB|Dublin|GLA|DUB|295|60|Aer Lingus
-- GLA|Glasgow|DUB|Dublin|GLA|DUB|295|60|Ryanair
-- GLA|Glasgow|LGW|Gatwick|GLA|LGW|595|95|easyJet
-- ...

The complete script can be seen here.

To-do

It’s very small, and I’ve got some work to do to add tests now that I understand how everything fits together. There’s probably some optimisation work to be done too, as this does do quite a lot of translation back and forth from JSON behind the scenes. I’m not sure if some form of caching might be appropriate, but I’ll look into it.

There’s also a question about whether this is a good idea at all, but I’m happy with it.