Michael Maclean

htmlq, revisited

A couple of years ago I wrote about a tool I’d created called htmlq. At the time, I’d been doing a bit of web scraping, and while it’s absolutely doable using Python and BeautifulSoup glued together with a shell script, it was a bit slow waiting for the Python startup time for each iteration. I ended up writing a bit more code to do the thing I wanted to do at an acceptable pace. Don’t get me wrong–both Python and BeautifulSoup are great tools, but sometimes I just want to be able to glue things together in an iterative way with curl and bash. I’m a fan of Rust, and knowing that the Mozilla Servo browser components were available, I looked to use those to make a tool to solve my problem (and hopefully learn something along the way).

Last Monday I woke up to find that the tool I’d written was at the top of Hacker News, where it had been submitted by Jason Bosco. It remained near the top for much of the day. This was great! However, at the time I’d not really thought too much about htmlq since I first wrote it. It turns out that a couple of people had been sending PRs and making suggestions in the time I’d not been paying attention, but I’d not spotted the notifications about them. I can only apologise for the oversight, but between some PRs that have been there for a little while and some more that came since, it’s in a better shape.

I’ve released a new version to crates.io. Nikolay Murha kindly added it to Homebrew. Chris Dickinson kindly sent a PR last year which used GitHub Actions to create binary builds, which unfortunately sat around for a while before I got to merging it. Several people have also contributed to the documentation and to tweaking the defaults. It’s been really nice to have people do these things!

At the moment I’m working on adding a couple of features that were requested, including one to convert relative links to absolute ones either using a <base> specified in the document or one supplied on the command line. I’m aiming to put a bit more testing around it too, and improve the pretty printing of the output.

If anyone’s got any more suggestions, I’d be happy to hear about them.