Cleaning up UTF-8 character entities when exporting from WordPress to Jekyll

August 27, 2017

Timo Geusch

2-Minute Read

I’ve been experimenting with converting this blog to Jekyll or another static blog generator. I’m sticking with Jekyll at the moment due to its ease of use and its plugin environment. The main idea behind this is to reduce the resource consumption and hopefully also speed up the delivery of the blog. In fact, there is a static version of the blog available right now, even though it’s kinda pre-alpha and not always up to date. The Jekyll version also doesn’t have the comments set up yet nor does it have a theme I like, so it’s still very much work in slow progress.

To export the contents from WordPress to Jekyll I use the surprisingly named WordPress to Jekyll exporter plugin. This plugin dumps the whole WordPress data including pictures into a zip file in a format that is mostly markdown grokked by Jekyll. It doesn’t convert all the links to markdown, so the generated files need some manual cleanup. One problem I keep running into is that the exporter dumps out certain UTF-8 character entities as their numerical code. Unfortunately when processing the data with Jekyll afterwards, those UTF-8 entities get turned into strings that are displayed as is. Please note I’m not complaining about this functionality, I’d rather have this information preserved so I can rework it later on. So I wrote a script to help with this task.

The bash script iterates over all the .md files in the current directory and cleans up the common UTF-8 entity codes that tend to appear in my blog posts. I then rely on Jekyll’s markdown processors to “do the right thing” when it processes the cleaned up markdown.

for file in *.md
do
    mv $file $file.old
    cat $file.old | sed 's/\&\#215\;/x/g; s/\&\#8211\;/-/g; s/\&\#8230\;/.../g; s/\&\#8220\;/\"/g; s/\&\#8221\;/\"/g; s/\&\#8243\;/\"/g' &gt; tmp.$file
    cat tmp.$file | sed "s/\&\#8216\;/\'/g; s/\&\#8217\;/\'/g; " &gt; $file
    rm tmp.$file $file.old
done

There is definitely some scope for improvement in the script but so far it helps with the cleanup and automates an otherwise tedious part of the process. Unfortunately the double invocation of sed appears to be necessary on my platform (OS X) to ensure that it converts single and double quotes properly. The main problem is that the script has to work around having to convert both single and double quotes, so it has to juggle the sed command delimiters accordingly. However, even with the double invocation the runtime of the script is still acceptable, at least with the number of markdown files in my blog.

The Lone C++ Coder's Blog

Cleaning up UTF-8 character entities when exporting from WordPress to Jekyll

Recent Posts

We're using the wrong measure for LLM productivity

If you get this error from Time Machine on Samba, check available disk space

Don't forget to set the home directory for Emacs on Windows

How to install WSL on Windows 11 without a default distribution

How to build/upgrade emacs-mac using homebrew

Categories

About