Cleaning up UTF-8 character entities when exporting from WordPress to Jekyll

I’ve been experimenting with converting this blog to Jekyll or another static blog generator. I’m sticking with Jekyll at the moment due to its ease of use and its plugin environment. The main idea behind this is to reduce the resource consumption and hopefully also speed up the delivery of the blog. In fact, there is a static version of the blog available right now, even though it’s kinda pre-alpha and not always up to date. The Jekyll version also doesn’t have the comments set up yet nor does it have a theme I like, so it’s still very much work in slow progress.

To export the contents from WordPress to Jekyll I use the surprisingly named WordPress to Jekyll exporter plugin. This plugin dumps the whole WordPress data including pictures into a zip file in a format that is mostly markdown grokked by Jekyll. It doesn’t convert all the links to markdown, so the generated files need some manual cleanup. One problem I keep running into is that the exporter dumps out certain UTF-8 character entities as their numerical code. Unfortunately when processing the data with Jekyll afterwards, those UTF-8 entities get turned into strings that are displayed as is. Please note I’m not complaining about this functionality, I’d rather have this information preserved so I can rework it later on. So I wrote a script to help with this task.

Read More

Improving my blogging workflow using Emacs (of course)

I try not to post too many metablogging posts. Other people do it better and I’m trying to focus on journalling what I learn as a software engineer and manager, not what tools I use for blogging. However after losing another post to WordPress’s built-in editor I decided Something Must Be Done. I think this is only the second post I lost, but it’s a fairly regular occurrence for a journalist friend of mine and I really don’t have that much time to retype blog entries that ended up in Bit Nirvana.

My first attempt was to resurrect the weblogger-mode setup I used to have a while ago but after switching the admin interface on my WordPress install to https, I couldn’t quite get it to work again. Plus it was a bit of a half hearted attempt as I never quite warmed to this mode in the first place. It’s actually quite odd as I tend to use gnus semi-regularly and the interface is very similar, but it never quite clicked for me for writing blog posts.

If I would exclusively blog on Windows, I’d just use Windows Live Writer, but as I switch between Windows, OS X, Linux and FreeBSD depending on which machine I’m on, Windows only software just isn’t going to cut it.

As everybody raves about org-mode (which I admittedly have never used) I decided to give org2blog a chance. It’s probably not the smartest idea to try to learn too many new tools at the same time but at least Emacs doesn’t occasionally eat my scribblings. Plus, I’ve started using Jekyll for another one of my experimental blogs, so using org mode and having they ability to publish to a Jekyll blog is also very useful.

So far I’ve got the basics up and running and the main blog configured. I’m using visual-line-mode to do automatic line wrapping and now will have to set up flyspell on the machines that haven’t got it installed yet so I can have basic spell checking.

So far, the basic workflow I’m planning is:

  • Sketch the post(s) and write the drafts in Emacs in the comforts of my local machine
  • Publish them as drafts to my standalone WordPress install
  • Do the final editing and spill chucking in WordPress
  • Ignore or heed the recommendations from the WordPress SEO plugin. That’ll be mostly ignore, then
  • Schedule the final publishing on the WordPress admin console

Hopefully that should work better than the “log into WordPress and start typing” approach I’ve used so far.