Michael Maclean

Wrangling JSON in PostgreSQL

I’ve been using SQLite a bit in some of my own projects recently, and I think it’s a great solution for small self-hosted applications, avoiding the need to run a full-blown RDBMS where it’s not really warranted. Xe’s post about JSON in SQLite talks a lot about how to use SQLite to handle JSON data and load it into other tables without having to have your application do it, which is great.

On the other hand, in 2020 I spent a lot of time with a team demolishing an old enterprise system at work, with a fairly convoluted database structure. I learned a bit about PostgreSQL’s JSON functions in the process. I’ve been meaning to write up what I learned for some time now before I banished forgot it entirely, so Xe’s post was a good prompt.

It turns out that similar capabilities exist in both databases, so I’m going to translate Xe’s post more or less directly into PostgreSQL equivalents. My site doesn’t have a JSONFeed endpoint (yet, perhaps Zola will acquire one, one day), so I’m going to borrow Xe’s feed and Python script as the basis for what I do here. I’m using PostgreSQL 13.5 here. There are some neat features in 14, like array subscripting for JSON column types and things like that, which are worth reading up on.

Setting up the tables

SQLite uses text data types to store JSON in, and has operators which can work on that. However, PostgreSQL has a separate JSON datatype. In fact, it has two–json and jsonb1. The json type stores an exact copy of the input text, while jsonb transforms it into a binary form that is faster for many operations as well as being indexable. I’ve only ever used jsonb, and I’ll continue to do that here.

Because PostgreSQL is a relational database server, there is some work around setting up a new database and its permissions that I’m going to skip past. There are plenty of tutorials for getting this up and running already elsewhere. Because I’m a Nix fan I used nix-shell to install PostgreSQL, as described in a previous post. For this example I’m going to assume the database is local, and the database, username and password are all mydb.

Let’s create a table, and throw some data into it. The key differences are the column types I’m using (timestamp and jsonb), and the slightly different syntax required for the default timestamp.

# schema.sql
CREATE TABLE IF NOT EXISTS jsonfeed_raw
  ( feed_url     TEXT  PRIMARY KEY
  , scrape_date  TIMESTAMP  NOT NULL  DEFAULT NOW()
  , raw          JSONB  NOT NULL
  );

There are a couple of Python libraries for communicating with PostgreSQL, but the one I’m most familar with is the dubiously-named psycopg2.

#!/usr/bin/env nix-shell
#! nix-shell -p python39 python39Packages.psycopg2 -i python

import psycopg2
import urllib.request

con = psycopg2.connect(
    host='localhost',
    dbname='mydb',
    user='mydb',
    password='mydb',
)

def get_feed(feed_url):
    req = urllib.request.Request(feed_url, headers={"User-Agent": "mgdm/feedfetch"})
    with urllib.request.urlopen(req) as response:
        cur = con.cursor()
        body = response.read().decode('utf-8')
        cur.execute("""
           INSERT INTO jsonfeed_raw
             (feed_url, raw)
           VALUES
             (%s, %s)
        """, (feed_url, body))
        con.commit()
        print("got feed %s" % (feed_url))

get_feed("https://christine.website/blog.json")

We can create the table from the schema above with the psql command:

$ psql -U mydb -f schema.sql mydb

Then use the Python script to load the data:

$ python jsonfeedfetch.py

JSON operators

PostgreSQL has a few operators that can be used for accessing data held in JSON columns. They look a little different from the ones in SQLite. The -> operator returns a JSONB type, which can be queried further–we’ll see this in action later. However, the ->> operator will return a text representation of whatever value it retrieves. Note that the names of the JSON fields need to be quoted, otherwise PostgreSQL will think that you are referring to a column.

mydb=> select raw->'title' as title,  pg_typeof(raw->'title') from jsonfeed_raw;
    title    | pg_typeof
-------------+-----------
 "Xe's Blog" | jsonb
(1 row)

mydb=> select raw->>'title' as title,  pg_typeof(raw->>'title') from jsonfeed_raw;
   title   | pg_typeof
-----------+-----------
 Xe's Blog | text
(1 row)

You can chain these operators, and also use integers to select array elements, so you can use something like this to select the title of the first post in the feed:

mydb=> select raw->'items'->0->'title' as post_title from jsonfeed_raw;
          post_title           
-------------------------------
 "A Tool to Aid Forgetfulness"
(1 row)

Rather than using this syntax, it’s possible to use other operators to go more directly to the data you need. The query above could also be expressed as one of the following:

-- Using a list of object keys and array indices
select raw #> '{items,0,title}' as post_title from jsonfeed_raw;

-- Using JSONPath
select jsonb_path_query(raw, '$.items[0].title') from jsonfeed_raw;

Let’s go on and create the metadata table and insert the data into it. I only had to make one change to the syntax here, to the default value on the updated_at column:

CREATE TABLE IF NOT EXISTS jsonfeed_metadata
  ( feed_url      TEXT  PRIMARY KEY
  , title         TEXT  NOT NULL
  , description   TEXT
  , home_page_url TEXT
  , updated_at    TEXT  NOT NULL  DEFAULT NOW()
  );

Populating the second table looks fairly similar, although note the different operators inside the SELECT:

INSERT INTO jsonfeed_metadata
            ( feed_url
            , title
            , description
            , home_page_url
            , updated_at
            )
SELECT jsonfeed_raw.feed_url AS feed_url
     , raw->>'title' AS title
     , raw->>'description' AS description
     , raw->>'home_page_url' AS home_page_url
     , NOW() AS updated_at
FROM jsonfeed_raw;

Triggers

Triggers are a little more complex in PostgreSQL. It’s not currently possible to write triggers in plain SQL2. Instead we have to write a function in another language and call that. The easiest choice in this case is to use PL/pgSQL, which is a PostgreSQL-specific superset of SQL. That’s going to look something like this:

DROP TRIGGER IF EXISTS jsonfeed_raw_ins ON jsonfeed_raw;
DROP FUNCTION IF EXISTS jsonfeed_update;

CREATE FUNCTION jsonfeed_update() RETURNS trigger AS $jsonfeed_update$
  BEGIN
    INSERT INTO jsonfeed_metadata
                ( feed_url
                , title
                , description
                , home_page_url
                )
    VALUES ( NEW.feed_url
            , NEW.raw->'title'
            , NEW.raw->'title'
            , NEW.raw->'home_page_url'
            )
    ON CONFLICT (feed_url) DO
       UPDATE SET
             title         = NEW.raw->'title'
           , description   = NEW.raw->'description'
           , home_page_url = NEW.raw->'home_page_url';
    RETURN NULL; -- Ignored because this is an AFTER trigger
  END;
$jsonfeed_update$ LANGUAGE plpgsql;

CREATE TRIGGER jsonfeed_raw_ins
AFTER INSERT OR UPDATE OR DELETE ON jsonfeed_raw
    FOR EACH ROW EXECUTE FUNCTION jsonfeed_update();

This is a little more verbose, but does the same thing. Note the slightly different ON CONFLICT syntax, where we have to specify a column name unlike in the SQLite version. Speaking of conflicts, let’s also update the Python script so that it does an update on a conflict in the jsonfeed_raw table:

        cur.execute("""
           INSERT INTO jsonfeed_raw
             (feed_url, raw)
           VALUES
             (%s, %s)

          ON CONFLICT (feed_url) DO
            UPDATE SET raw = %s
        """, (feed_url, body, body))

Now, if we delete everything from each of the tables and run the script again, we should see data in jsonfeed_metadata.

mydb=> select * from jsonfeed_metadata;
              feed_url               |    title    |                        description                         |        home_page_url        |          updated_at           
-------------------------------------+-------------+------------------------------------------------------------+-----------------------------+-------------------------------
 https://christine.website/blog.json | "Xe's Blog" | "My blog posts and rants about various technology things." | "https://christine.website" | 2022-01-12 22:15:36.920008+00
(1 row)

mydb=>

Extracting rows from the JSON

Like the json_each function in SQLite, PostgreSQL can unpack arrays of items in JSON and use those to create more rows using a function called jsonb_array_elements. Let’s create a table to put blog posts into:

CREATE TABLE IF NOT EXISTS jsonfeed_posts
  ( url             TEXT  PRIMARY KEY
  , feed_url        TEXT  NOT NULL
  , title           TEXT  NOT NULL
  , date_published  TIMESTAMP  NOT NULL
  );

We can populate this table as follows:

INSERT INTO jsonfeed_posts
            ( url
            , feed_url
            , title
            , date_published
            )
SELECT
  items->>'url' AS url
, raw->>'feed_url' AS feed_url
, items->>'title' AS title
, (items->>'date_published')::timestamp AS date_published
FROM
  jsonfeed_raw 
, jsonb_array_elements(jsonfeed_raw.raw->'items') as items;

We can do the same thing here as we did before, and turn this into a trigger. Again, we’ll need to use a slightly different language to do it, but it’s not going to be particularly difficult.

DROP TRIGGER IF EXISTS jsonfeed_raw_ins_posts ON jsonfeed_raw;
DROP FUNCTION IF EXISTS jsonfeed_update_posts;

CREATE FUNCTION jsonfeed_update_posts() RETURNS trigger AS $jsonfeed_update_posts$
  BEGIN
    INSERT INTO jsonfeed_posts
                ( url
                , feed_url
                , title
                , date_published
                )
    SELECT
      items->>'url' AS url
    , raw->>'feed_url' AS feed_url
    , items->>'title' AS title
    , (items->>'date_published')::timestamp AS date_published
    FROM
      jsonfeed_raw 
    , jsonb_array_elements(jsonfeed_raw.raw->'items') as items
    ON CONFLICT (url) DO
      UPDATE SET title = EXCLUDED.title
               , date_published = EXCLUDED.date_published;
    RETURN NULL;
  END;
$jsonfeed_update_posts$ LANGUAGE plpgsql;

CREATE TRIGGER jsonfeed_raw_ins_posts
AFTER INSERT OR UPDATE OR DELETE ON jsonfeed_raw
    FOR EACH ROW EXECUTE FUNCTION jsonfeed_update_posts();

Next time you run jsonfeedfetch.py, you should have a fully-populated jsonfeed_posts table containing all of the posts from the feed.

1

There are also operators that can work on JSON stored in text fields, but I’ve not used those at all.