Search!

Tuesday 3rd January 2023

Something that bugged me for the entire duration that this site was just a pile of static files was that I didn’t have a good solution for search. The previous incarnation of zerosleeps did have a search form on every blog page, but it just sent the user off to DuckDuckGo and performed a “site:zerosleeps.com” search there.

The results were never great, presumably because DuckDuckGo had no reason to make a decent index of a tiny site like this. Plus it was an external dependency which I’m never a fan of.

Well that changes today: blog posts are now searchable entirely in-house. I’m using PostgreSQL’s full-text search functionality, which is made a little easier by Django’s support for it in django.contrib.postgres. Just blog posts at the moment, but I want to get the reading log in there as well and have one global search.

Post titles carry more weight than the post body, and the results are sorted by whatever rank PostgreSQL comes up with. I’m also using the “headline” function to show the most relevant snippets. Behind the scenes it looks a little something like this:

vector = SearchVector("title", weight="A") + SearchVector("body")
return (
    Post.objects.annotate(
        search=vector,
        rank=SearchRank(vector, query),
        headline=SearchHeadline(
            Concat("title", Value(" "), "body"), query, max_fragments=2
        ),
    )
    .filter(search=query)
    .order_by("-rank")
)

It’s not perfect but it’s absolutely good enough. Better than good enough. Makes me glad I chose PostgreSQL over my other choice - SQLite - as well. SQLite does have full-text search built-in, but from what I can tell it involves creating virtual tables and keeping them up-to-date with the real content. Seems messy. This solution instead boils down to just one (slightly verbose and repetitive but who cares) SQL query which contains:

SELECT
    ts_rank(
        (
            setweight(to_tsvector(COALESCE("blog_post"."title", '')), 'A')
            ||
            to_tsvector(COALESCE("blog_post"."body", ''))
        ),
        plainto_tsquery('search term')
    ) AS "rank",
    ts_headline(
        CONCAT("blog_post"."title", ' ', "blog_post"."body"),
        plainto_tsquery('search term'),
        'MaxFragments=2'
    ) AS "headline"
FROM
    "blog_post"
WHERE
    (
        setweight(
            to_tsvector(COALESCE("blog_post"."title", '')), 'A')
            ||
            to_tsvector(COALESCE("blog_post"."body", ''))
    ) @@ (plainto_tsquery('search term'))
ORDER BY "rank" DESC;