A while ago I was doing some work with data that I was loading into SQLite. I had been using jq
to do some exploratory work to get different parts of the data out, but those queries didn’t directly translate to the JSON operators that exist in SQLite. I wondered if it might be possible to add an extension to SQLite to add operators that speak jq
’s syntax.
It turns out that it’s relatively easy. I opted to use Go to write this addon, as there is both a solid native Go implementation of jq
and a Go library for writing SQLite addons, and I didn’t fancy writing it in C. The extension is available on GitHub.
Installing
The instructions are a little different between macOS and Linux.
On macOS, run make
, then you can load the resulting extension into sqlite3
using .load sqlite_jq.dylib
(or perhaps .load sqlite_jq.so
, depending where you got your compiler from).
On Linux, run make
to build, though you will then have to place the extension somewhere on LD_LIBRARY_PATH
. Alternatively, for testing, you can set this directly:
Then, you can load the resulting extension with .load sqlite_jq
.
Overview
It provides a scalar function, jq()
, which lets you extract values from a JSON field using a jq
expression:
select jq('{"a": "xyz"}', '.a');
-- returns "xyz"
There is also a table-valued function called jq_each()
that can be used to create a table of values from JSON fields.
select * from jq_each('{"rows": ["hello", "world"]}', '.rows[]');
-- returns two rows, "hello" and "world"
Worked example
For this example, I’m going to use Jonty’s airline routes data. It’s a large JSON file, which looks like this:
It’s a single large object where the keys are the IATA codes for each airport. To start with, I’m going to create a table with a single column, and import each of these sub-objects into that table, one per row.
(
raw text
);
Then I’ll insert the rows directly from the file:
insert into raw_data select * from jq_each(readfile('airline_routes.json'), '.[]');
Now, I can use these functions to extract the data into other tables. Let’s create a table for the airports.
(
iata char(3) primary key,
icao char(4),
latitude float,
longitude float,
name varchar(255),
timezone varchar(255)
);
I can then use the jq()
function to load data into this table.
insert into airports select
jq(raw, '.iata'),
jq(raw, '.icao'),
jq(raw, '.latitude'),
jq(raw, '.longitude'),
jq(raw, '.name'),
jq(raw, '.timezone')
from raw_data;
This can, of course, be done just as easily with SQLite’s own json_extract()
function, with a slightly different syntax, or in newer releases of SQLite using the ->
or ->>
operators.
Now, we’ll extract the route data, into another table.
(
origin_iata char(3),
destination_iata char(3),
distance int,
time int,
carrier varchar(255)
);
And we can use a much longer jq
expression to extract the route and carrier information. I’m using json_extract()
here to demonstrate that the two extensions interoperate.
insert into routes select
jq(raw_data.raw, '.iata') as origin_iata,
json_extract(routes.value, '$[0]') as dest_iata,
json_extract(routes.value, '$[1]') as distance,
json_extract(routes.value, '$[2]') as time,
json_extract(routes.value, '$[3]') as carrier
from raw_data,
jq_each(raw_data.raw, '.routes[] | . as $route | .carriers[] | [ $route.iata, $route.km, $route.min, . ]') as routes;
Now we can use the data to find all of the routes from Glasgow Airport.
select origin.iata, origin.name, destination.iata, destination.name, routes.*
from airports origin
left join routes on routes.origin_iata = origin.iata
left join airports as destination on routes.destination_iata = destination.iata
where origin.iata = 'GLA';
-- GLA|Glasgow|LHR|Heathrow|GLA|LHR|555|90|British Airways
-- GLA|Glasgow|DUB|Dublin|GLA|DUB|295|60|Aer Lingus
-- GLA|Glasgow|DUB|Dublin|GLA|DUB|295|60|Ryanair
-- GLA|Glasgow|LGW|Gatwick|GLA|LGW|595|95|easyJet
-- ...
The complete script can be seen here.
To-do
It’s very small, and I’ve got some work to do to add tests now that I understand how everything fits together. There’s probably some optimisation work to be done too, as this does do quite a lot of translation back and forth from JSON behind the scenes. I’m not sure if some form of caching might be appropriate, but I’ll look into it.
There’s also a question about whether this is a good idea at all, but I’m happy with it.