Continuing my journey into learning Rust, I have been looking around for small, useful projects that might be interesting to build rather than doing exercises. In the mean time I was doing some web scraping using Python and BeautifulSoup, which is great but it can be a little slow. I am aware of Servo, the browser written by Mozilla in Rust, and had the idea to try and recycle parts of it to make a small CLI tool to extract data from HTML documents.
The obvious choice for me for the query language was CSS selectors. While XPath works fine for XML, the more specific selector language seems like a better fit for what I wanted to do here. I intended for the CLI to be fairly simple, and fast.
Servo's HTML rendering library is called
html5ever. I initially looked at using this directly, and integrating with the CSS selector parsing engine, but it turns out that someone else has done most of the work I needed here and wrapped both in a library called Kuchiki. Most of the hard work is actually done in here. Parsing the HTML document becomes as simple as something like this:
let document = kuchiki::parse_html() .from_utf8() .read_from(&mut input) .unwrap();
I'm very grateful to Kuchiki's authors for doing most of the hard work!
For example, if I wanted to get the HTML for the section with ID
posts on this blog, I could do:
curl -s https://mgdm.net | htmlq '#posts'
Or to find all the links inside
<a> elements on a page:
curl -s https://www.rust-lang.org/ | htmlq -a href a
-a option takes the name of an attribute to return, and the final
a is the CSS selector. Elements that do not have the matching attribute are ignored.
-t option allows you to just return the text content, ignoring the rest of the markup. There's also a
-p which attempts to pretty-print the HTML, but this is a bit of a work in progress.
It's now on crates.io and the code is on GitHub. You should be able to install it on Unix-like systems with
cargo install htmlq. I'd be happy to hear any feedback people may have, and I hope that it might be useful.