Cleaning up UTF-8 character entities when exporting from WordPress to Jekyll

I’ve been experimenting with converting this blog to Jekyll or another static blog generator. I’m sticking with Jekyll at the moment due to its ease of use and its plugin environment. The main idea behind this is to reduce the resource consumption and hopefully also speed up the delivery of the blog. In fact, there is a static version of the blog available right now, even though it’s kinda pre-alpha and not always up to date. The Jekyll version also doesn’t have the comments set up yet nor does it have a theme I like, so it’s still very much work in slow progress.

To export the contents from WordPress to Jekyll I use the surprisingly named WordPress to Jekyll exporter plugin. This plugin dumps the whole WordPress data including pictures into a zip file in a format that is mostly markdown grokked by Jekyll. It doesn’t convert all the links to markdown, so the generated files need some manual cleanup. One problem I keep running into is that the exporter dumps out certain UTF-8 character entities as their numerical code. Unfortunately when processing the data with Jekyll afterwards, those UTF-8 entities get turned into strings that are displayed as is. Please note I’m not complaining about this functionality, I’d rather have this information preserved so I can rework it later on. So I wrote a script to help with this task.

Read More