Blog Archiving

13 Aug 2016

The first place I posted journal entries online was the now-defunct Asian community site Asian Avenue. During college, Xanga became really big and I started posting there. Once I graduated and started getting some money in, I moved my blog to self-hosted Wordpress on a shared Dreamhost instance.

Well after static site generators like Jekyll became popular, I decided to restart a blog using Jekyll and hosted on GitHub Pages. This has worked out for me really well, helping to reduce monthly hosting costs.

The problem is what to do with my old blog hosted on Wordpress. Although it’s hosted on the cheapest Digital Ocean droplet available, I don’t make any new entries and it requires running mysql, which sometimes went down during higher traffic periods. (I setup supervisord to restart the process automatically when this problem occurs.)

I’ve wanted to archive it for awhile since it is fairly low traffic and it can be static. There are three primary goals I had with archiving:

  • To reduce my hosting costs (either real or in terms of resources used)
  • To limit breaking existing links
  • To limit content errors

Given these goals, I’ve had a number of approaches going back several years. One of the things I’ve tried to do was to identify the most visited pages and manually migrate them to the new blog, then setup redirects to those pages using the Quick Page/Post Redirect Plugin.

This has worked fine but has not allowed me to reduce the Wordpress site’s footprint. I’ve also tried various import methods over the years, attempting to convert the existing content to run on Jekyll. I’ve tried both the official Wordpress importer available in the jekyll-import gem and the exitwp tool. These work decently, except that each had quirks in the import process.

The content imported by jekyll-import has the following problems:

  1. Some characters are HTML entity encoded unnecessarily, breaking the markup. There’s a GitHub issue reporting this problem.
  2. The import introduces <br /> tags where they weren’t in the original markup. This causes an issue especially with code examples.

The content imported by exitwp had the following problems:

  1. It attempts to convert the post to Markdown, which resulted in issues with the markup with linked images.

Both of the tools have an additional drawback in that Wordpress shortcodes are (understandably) ignored. Some examples of shortcodes include:

It’s fairly straightforward to write code to translate [code] shortcode tags into markdown-style code tags. Here’s some ruby code I wrote to do that:

def replace_code(line)
  start_re = /\[code[^\]]*(?:\s+lang\s*=\s*"(\w+)")?[^\]]*\]/
  end_re = /\[\/code\]/

    .gsub(end_re, '```')
    .gsub(start_re) do |_|
      match = Regexp.last_match
      '```' + ( match.captures.empty? ? '' : match.captures[0] )

To replace the [embed] short tag, I used a Jekyll plugin that simply takes the passed URL and writes IFRAME HTML for the Youtube link.

Translating the gallery code is more difficult. One of the problems is that although I can get a copy of wp-content, the directory where Wordpress uploads post media, the directory tree is organized by year and month (this may be configurable). I’m not sure how to determine which images go with which post. There’s probably a mapping in the database somewhere, but it’s just more to do.

Finally, I figured I should just find a way to save the static content of Wordpress as is, instead of trying to convert it to Jekyll. I tried the WP Static HTML Output plugin, and it worked really well. After installing, you can go into a special menu to download a static version of the site, including JavaScript, CSS, and indexes by tag and category.

The unfortunate part is that making changes to layout is more difficult because the layout is embedded into each page. Still, with a backup of the Wordpress site, one could alter the layout and re-export the site.

There was one small problem with the output of the static site generator. There were some weirdly encoded tags added to the output. These were easy to strip out using sed, and less trouble than filtering the jekyll-import and exitwp output. Here’s an example (note the LC_ALL=C is some magic for sed on OS X)

find . ! -name '.git' -type f -name '*.html' \
  -exec env LC_ALL=C sed -i '' -e \
  "s#&lt;/p&gt;##g; s#&lt;pre&gt;&lt;code&gt;##g" {} \;

Since I wanted the code to have its own domain and I was already using GitHub pages for my current blog, I deployed these static files on GitLab pages instead, and created a CNAME for the new site This required adding a simple .gitlab-ci.yml file which described how to deploy my site.

# .gitlab-ci.yml
  stage: deploy
  - echo 'Nothing to do...'
    - public
  - master

Finally, I setup a redirect from the old site using mod_rewrite, that forwarded all traffic to the new sub-domain, simply rewriting the old URL to match the new format.

Looking for more content? Check out other posts with the same tags: