Migrating the blog to soupault

Last update:

Tags: web, self-hosted

I've migrated my blog to soupault. I hope it didn't break too many links or anything else, if you spot a problem, let me know. If you are interested in the details of the migration, read on. However, note that this post is a bit too full of idle musings on blogs, universe, and everything. If a list of pages sorted by date is all you want, read this post instead.

Now, assuming you are actually interested… you may remember my old blog made with Pelican. You may wonder what’s wrong with Pelican, and the answer is, well, nothing. It wasn’t an “I’m done with it” migration. Definitely not like last time with Cyrus IMAP. It’s just a mix of wanting to test soupault in a new role, and wanting to keep everything in one place.

Initially, I planned to keep the blog as is because soupault is (proudly) not a blog generator. It has no content model of its own, other than “everything is a page”. It also doesn’t create any page files on its own, even autogenerated indices are inserted into existing pages.

When I’ve been working on the first version of soupault this summer, my goal for the public release was being able to configure it to build my own website. The question I’ve been asking was “could I use it for my website if it was someone else’s project?” Thus, when I discovered missing bits, I didn’t hack it specially for my site, but added features that could support that out of the box. Quite often I first wrote a non-working configuration, then worked on the code to make that configuration valid.

The big idea was to make a tool that can automate management of formerly hand-written websites. My own website was powered by a custom-made generator, but I wanted to help Web 1.0 enthusiasts spend more time making cool things than doing tedious HTML editing. That demanded removing the limitations of classic website generators, such as Markdown processor lock-ins.1 It didn’t become popular with Web 1.0 people, but attracted a few people who like to invent highly customized workflows.

While it has no content model of its own, it can extract metadata from HTML (remember microformats?) and either generate HTML from it or dump it to JSON (or both). So, it supports a “bring your own content model” approach.

A simple blog is easy to make with built-in functionality. There’s no shame in it I guess, these days many tech blogs don’t have any features of classic blogs. Back in the “golden age of blogging”, there was a canonical feature set for a blog engine: tag cloud, calendar etc. I don’t see calendar views any often now, and I think the main reason is that these days blogs are more about content rather than people. LiveJournal etc. were first of all social networks, and people would browse a blog of a person and wonder what she might have been up to last January. If you are browsing a blog of a project maintainer, you could care less what she did last January, though you likely want to browse posts by tag/category like release, CVE etc.

I’m sure whoever is reading my blog, is reading a blog of an open source developer who loves OCaml and networks, and whether I’m Daniil Baturin, or a dog, or a general AI is irrelevant. So, the goals for this blog were archives by tag and an Atom feed.

Pelican uses the classic MTF approach. MTF stands for Markdown, Templates, Front matter—an industry standard abbreviation I just made up.

Markdown is the easy part, soupault can pipe a page through any external program before further processing. I like cmark, so all I needed was to associate it with .md extension:

[preprocessors]
  md = cmark

Now what to do with the front matter? Typical post looked like this:

Title: Yet another dmbaturin's blog iteration
Date: 2018-02-20 21:33:30
Category: Misc

I could write a script to convert that to HTML, but since I only have that many posts, and I needed to edit them anyway to fix the internal links etc., I just converted it all by hand, to something like this:

# This is a post

<time>1970-01-01</time>
<tags>foo, bar, baz</tags>

You may ask, what on earth is going on here? You’ll be right, none of that stuff has any meaning, but soon I’ll give it a meaning, one thing at a time.

First of all, I’ve put all old posts in site/blog. Since soupault never changes page names, I’ve renamed them so that they match the URLs. E.g. that first post was content/yet-another-dmbaturins-blog.md in the original Pelican setup, and I saved it in site/blog/yet-another-dmbaturins-blog-iteration.md. Keeping the essential part of the URL intact simplifies redirection, but we’ll get to it later.

Now, how that all becomes data for the blog index? Well, first in the page <h1> is soupault’s default for the page title. Then, first <time> is the default for page modification data. The remaining unusual thing is the fake, invalid <tags> element. You can tell soupault to extract any custom fields by their CSS selector, and I did just that.

Here are the metadata extraction settings:

[index]
  index = true
  use_default_view = false
  dump_json = "index.json"

  exclude_path_regex = ["blog/index.html", "blog/tag/(.*)"]

  index_date_selector = ["#revision", "time"]
  index_excerpt_selector = ["#summary", "p"]

  newest_entries_first = true

[index.custom_fields]
  category = { selector = "#category", default = "Misc" }
  tags = { selector = "tags" }

I prevent soupault from trying to index non-content pages, namely blog/index.html and anything in blog/tag/ with exclude_path_regex. We also deviate from the defaults in the field mapping settings: index_excerpt_selector = ["#summary", "p"] means that it will first look for an element with id="summary" and if it’s not found in the page, it will take the first paragraph for the page excerpt. That allows displaying a paragraph other than the first in /blog, which is useful if you want to add an intro, or a disclaimer, or something else.

In the custom fields settings, I say that the content of the first element matching CSS selector tags (which is any <tags> element) should be saved in a metadata field named tags.

The category field is irrelevant. It’s used for the Notes section index. I could use it for the blog posts, but I don’t want to.

A complete autogenerated JSON metadata entry will look like this:

  {
    "nav_path": [
      "blog"
    ],
    "page_file": "site/blog/yet-another-dmbaturins-blog-iteration.md",
    "url": "/blog/yet-another-dmbaturins-blog-iteration",
    "title": "Yet another dmbaturin's blog iteration",
    "date": "2018-02-20",
    "author": null,
    "excerpt": "Looks like my blogging cycle is two years in and two years out.[SNIP]"
    "category": "Misc",
    "tags": "misc, personal"
  }

There is support for extracting multiple elements and saving them in a list, but comma-separated tags are easier to write than <tag>misc</tag> <tag>personal</tag>, so I’ll use separate those strings into individual tags elsewhere.

Soupault supports multiple index “views” so that it’s easy to present the same data in different ways. For the /blog section, I did this:

[index.views.blog]
  index_selector = "#blog-index"
  index_item_template = """
    <h2><a href="{{url}}">{{title}}</a></h2>
    <p><strong>Last update:</strong> {{date}}.</p>
    <tags>{{tags}}</tags>
    <p>{{{excerpt}}}</p>
    <a href="{{url}}">Read more</a>
    <hr>

Wait, but the mysterious <tags> element is still here in the index. How does it became a list of hyperlinked tags? I’ve made a custom Lua plugin to convert it to real HTML. Here’s the plugin:

Plugin.require_version("1.9")

base_path = config["base_path"]
if not base_path then
  base_path = "/tag"
end

function make_tag_links(tags_elem)
  tags_string = String.trim(HTML.strip_tags(tags_elem))
  if not tags_string then
    return nil
  end

  -- Split the tags string like "foo, bar" into tags
  tags = Regex.split(tags_string, ",")

  -- Generate <a href="/tag/$tag">$tag</a> elements for each tag
  local count = size(tags)
  local index = 1
  prev_elem = nil
  tag_links = {}
  while (index <= count) do
    tag = String.trim(tags[index])

    tag_link = HTML.create_element("a", tag)
    HTML.set_attribute(tag_link, "href", format("%s/%s", base_path, tag))

    tag_links[index] = tostring(tag_link)

    index = index + 1
  end

  -- Home-made join(", ", list)...
  -- I should add it to the plugin API
  index = 1
  links_html = ""
  while (index <= count) do
    links_html = links_html .. tag_links[index]
    if (index < count) then
      links_html = links_html .. ", "
    end
    index = index + 1
  end

  links_html = format("<p><strong>Tags:</strong> <span id=\"tags\">%s</span></p>", links_html)

  HTML.insert_after(tags_elem, HTML.parse(links_html))
  HTML.delete_element(tags_elem)
end

tags_elems = HTML.select(page, "tags")
index = 1
count = size(tags_elems)
while (index <= count) do
  make_tag_links(tags_elems[index])
  index = index + 1
end

Lua isn’t a very concise language, and Lua 2.5 was even less concise, but the idea is clear: split at commas, make links, insert right next to the former <tags> element, then remove that element.

I discover missing functions in the soupault plugin API all the time, in this case it was that I’ve added Regex.split, but didn’t think of String.join(sep, strings). I’ll add it in 1.11, but for now I just faked it in Lua, no big deal.

The plugin is in plugins/blog-tags.lua. Soupault automatically discovers plugins since 1.10, so when I saved it to that files, it automatically became available as blog-tags widget.

All is left is to setup that widget to run on all pages in blog/. In soupault.conf, I’ve done this:

[widgets.process-blog-tags]
  path_regex = "blog/(.*)"
  widget = "blog-tags"
  base_path = "/blog/tag"

Ok, but where do the pages in blog/tag/* come from? That’s where “bring your own content model” approach comes into play. I made a Python script that generates those archive pages from index.json:

#!/usr/bin/env python3

import os
import sys
import json

import pystache

template = """
<h2><a href="{{url}}">{{{title}}}</a></h2>
<p><strong>Last update:</strong> {{date}}.</p>
<tags>{{tags}}</tags>
<p>{{{excerpt}}}</p>
<a href="{{url}}">Read more</a>
<hr>
"""

base_path = sys.argv[1]

try:
    index_file = sys.argv[2]
    with open(index_file) as f:
        index_entries = json.load(f)
except:
    print("Could not read index file!")
    sys.exit(1)

renderer = pystache.Renderer()

# Create a dict of entries grouped by tag
entries = {}
for e in index_entries:
    tags_string = e["tags"]
    if tags_string:
        tag_strings = filter(lambda s: s != "",  map(lambda s: s.strip(), tags_string.split(",")))
    else:
        continue

    for t in tag_strings:
        if not (t in entries):
            entries[t] = []
        entries[t].append(e)

# Now for each tag, create an HTML page with all entries that have that tag
for t in entries:
    page_file = os.path.join(base_path, t) + '.html'
    with open(page_file, 'w') as f:
        print("""<h1>Posts tagged &ldquo;{}&rdquo;</h1>""".format(t), file=f)
        for e in entries[t]:
            print(renderer.render(template, e), file=f)

Nothing fancy, as you can see. Not very pretty, but it’s under 100 lines and does the job.

Will soupault run twice, first to generate those archive pages in site/blog/tag/* and then to generate a final website? Sort of. Specially for this kind of uses cases I’ve added a --index-only option that makes soupault stop at generaging index.json. Thus the makefile target looks like this:

.PHONY: site
site:
     	$(SOUPAULT) --index-only
        scripts/blog-archive.py $(SITE_DIR)/blog/tag $(INDEX_FILE)
        $(SOUPAULT)
        scripts/json2feed.py index.json > $(BUILD_DIR)/blog/atom.xml

The Atom feed is generated from the same JSON data by another script, which is even simpler.

One last thing: redirects. I’ve added a bunch of RedirectMatch options to the Apache HTTPD config of the original virtual host:

RedirectMatch "^/tag/(.*)\.html" "https://baturin.org/blog/tag/$1"
RedirectMatch "^/category/programming-languages.html" "https://baturin.org/blog/tag/programming"
RedirectMatch "^/category/development.html" "https://baturin.org/blog/tag/programming"
RedirectMatch "^/category/networking.html" "https://baturin.org/blog/tag/networking"
RedirectMatch "^/category/misc.html" "https://baturin.org/blog/tag/misc"
RedirectMatch "^/category/system-administration.html" "https://baturin.org/blog/tag/servers"
RedirectMatch "^/(.*).html" "https://baturin.org/blog/$1"
RedirectMatch "^/feeds/atom.xml" "https://baturin.org/blog/atom.xml"
RedirectMatch "^/(.*)" "https://baturin.org/blog/"

It doesn’t cover everything, but I think it prevents link rot to the largest extent that is practical. Old links to my post will continue to work, and search engines won’t be confused, or so I hope.

That’s the migration process. Was it worth it? Well, it took an evening, and the external scripts only use mature libraries that shouldn’t break any time soon, so I think this setup is as low-maintenance as Pelican, and it does everything I want it to do. Except per-tag feeds, but that won’t take long either.

Should you do something like this to your blog? It’s up to you, but if you do, I hope this post will help.

1Syntax highlighting, footnotes etc. are often features of Markdown libraries, and generators have no insight into the page content. Some, like Hugo, offer different processor option with different features.