Pygmentising Hakyll's Syntax Highlighting

Posted on 2023-01-21  ·  last modified: 2023-01-23  ·  7 min read  · 

By default, Hakyll uses pandoc to generate syntax highlighting for all kinds of different programming languages. However, even in simple examples the html this produces is unsatisfactory. Thankfully, the two programs are almost infinitely customisable, and changing pretty much any setting doesn’t usually involve a lot of work—this is no exception. Using pygmentize as an example, I will show you how you can swap out pandoc’s native syntax highlighting with pretty much any third party tool that can output html.

The problem§

Pandoc uses the skylighting library to generate syntax highlighting for a given block of code. Skylighting, in turn, uses kde xml syntax definitions for the respective tokenisers. However, even for simple examples I don’t agree with the html this generates. Consider the following Haskell code block.
fibs :: [Integer]
fibs = 0 : scanl' (+) 1 fibs

Pandoc would generate something like the following:
<div class="sourceCode" id="cb1">
  <pre class="sourceCode haskell">
    <code class="sourceCode haskell">
      <span id="cb1-1">
        <a href="#cb1-1" aria-hidden="true" tabindex="-1"> </a>
        <span class="ot">    fibs :: </span> [<span class="dt">Integer </span>]
      </span>
      <span id="cb1-2">
        <a href="#cb1-2" aria-hidden="true" tabindex="-1"> </a>    fibs
        <span class="ot">= </span>
        <span class="dv">0 </span> <span class="op">: </span>
        scanl&#39; (<span class="op">+ </span>) <span class="dv">1
        </span> fibs
      </span>
    </code>
  </pre>
</div>

One can already see a few things wrong with this: (i) in the type signature, the name of the list is smushed together with the separating double colon (worse: it’s just in the “other” syntax class), (ii) in the actual definition, fibs isn’t assigned any class at all, and (iii) the assignment operator is also in the “other” class, instead of something related to it being a built in operator! As one can imagine, this only gets worse as snippets get more complicated.

These kinds of issues, combined with the fact that certain languages—like Emacs Lisp—don’t have any syntax definitions at all, annoyed me enough to look for an alternative way to highlight code on this website.
All of this work for a mostly greyscale theme!
There are of course many options to choose from; I went with pygmentize, solely because I already had it installed. All that’s left is to tell pandoc and Hakyll to make use of it. As mentioned, this thankfully doesn’t turn out to be very difficult!

Playing with pygmentize§

Having never used pygmentize as a command line utility,
So far, the only interaction I had with the program was through the excellent minted LaTeX package.
I expected this to take some work—possibly involving Python shudder—but all of the necessary pieces are already present in the cli. First up, the -f option specifies the formatter to use, which will decide the shape of the output.
$ pygmentize -L formatter | grep html
* html:
    Format tokens as HTML 4 ``<span>`` tags within a ``<pre>`` tag, wrapped
    in a ``<div>`` tag. The ``<div>``'s CSS class can be set by the `cssclass`
    option. (filenames *.html, *.htm)

We can test how this highlighting looks straight away; executing
$ echo "fibs :: [Integer]\nfibs = 0 : scanl' (+) 1 fibs" \
   \ | pygmentize -l haskell -f html

produces an html output along the lines of
<div class="highlight">
  <pre>
    <span> </span>
    <span class="nf">fibs </span> <span class="w"> </span>
    <span class="ow">:: </span> <span class="w"> </span>
    <span class="p">[</span>
      <span class="kt">Integer </span> <span class="p">] </span>
    <span class="nf">fibs </span> <span class="w"> </span>
    <span class="ow">= </span> <span class="w"> </span>
    <span class="mi">0 </span> <span class="w"> </span>
    <span class="kt">: </span> <span class="w"> </span>
    <span class="n">scanl&#39; </span> <span class="w"> </span>
    <span class="p">(</span>
      <span class="o">+ </span> <span class="p">) </span>
    <span class="w"> </span>
      <span class="mi">1 </span> <span class="w"> </span>
    <span class="n">fibs </span>
  </pre>
</div>

This looks much better! The class names are kind of obtuse, but pygmentize can also give you nicely annotated css styles for its supported colour schemes. For example, the following is a small excerpt of the output:
$ pygmentize -S emacs -f html

.nf { color: #00A000 }                    /* Name.Function */
.ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
.kt { color: #00BB00; font-weight: bold } /* Keyword.Type */
.w { color: #bbbbbb }                     /* Text.Whitespace */
.c { color: #008800; font-style: italic } /* Comment */

You can redirect this into a pygments.css file, link to it (e.g., from your default.html template), and be on your way. The annotation also makes it very easy to change that file after the fact, in case pygmentize does not have the theme that you want.

Integration§

The idea of what we want to do is quite simple: for every code block in a given post, shell out to pygmentize, and use its output to replace the block, somehow making sure pandoc doesn’t touch it afterwards. Let’s solve this step by step.

Pandoc§

Pandoc has an aptly named Pandoc type, which represents the internal structure of a document.
data Pandoc = Pandoc Meta [Block]

We’ll neglect the metadata for now and just look at the Blocks; specifically, we want to zoom in on two constructors that will give you everything we need:
data Block
  -- Lots of other constructors omitted
  = CodeBlock Attr Text   -- ^ Code block (literal) with attributes
  | RawBlock Format Text  -- ^ Raw block

-- | Attributes: identifier, classes, key-value pairs
type Attr = (Text, [Text], [(Text, Text)])

-- | Formats for raw blocks
newtype Format = Format Text

To get a feeling for how these CodeBlocks look, again consider our fibs example from above. By default, the corresponding CodeBlock for this would look something like
CodeBlock ("", ["haskell"], [])
          "fibs :: [Integer]\nfibs = 0 : scanl' (+) 1 fibs"

Importantly, the language (if any) is the first argument of the classes field of Attr.

A strategy begins to form: look for all occurences of a CodeBlock in the Pandoc type, and replace it with a RawBlock "html" such that it isn’t touched anymore. Doing so will not pose very many challenges—pandoc has really great capabilities for walking its ast in order to facilitate exactly these kinds of changes. Unsurprisingly, the Walkable class resides over all things walkable; an abbreviated definition looks like this:
class Walkable a b where
  -- | @walk f x@ walks the structure @x@ (bottom up) and replaces every
  -- occurrence of an @a@ with the result of applying @f@ to it.
  walk  :: (a -> a) -> b -> b
  walk f = runIdentity . walkM (return . f)

  -- | A monadic version of 'walk'.
  walkM :: (Monad m, Applicative m, Functor m) => (a -> m a) -> b -> m b

Specifically, as we’ll need to shell out to an external program, let us restrict our attention to the more general walkM function here. There is an instance
instance Walkable Block Pandoc

which will be all that we need. The necessary code now just materialises in front of our eyes:
-- {-# LANGUAGE BlockArguments    #-}
-- {-# LANGUAGE LambdaCase        #-}
-- {-# LANGUAGE OverloadedStrings #-}
-- {-# LANGUAGE ViewPatterns      #-}
--
-- import Data.Maybe (fromMaybe, listToMaybe)
-- import qualified Data.Text as T
-- import Hakyll
-- import System.Process (readProcess)
-- import Text.Pandoc.Definition (Block (CodeBlock, RawBlock), Pandoc)
-- import Text.Pandoc.Walk (walk, walkM)

pygmentsHighlight :: Pandoc -> Compiler Pandoc
pygmentsHighlight = walkM \case
  CodeBlock (_, (T.unpack -> lang) : _, _) (T.unpack -> body) ->
    RawBlock "html" . T.pack <$> unsafeCompiler (callPygs lang body)
  block -> pure block
 where
  pygs :: String -> String -> IO String
  pygs lang = readProcess "pygmentize" ["-l", lang, "-f", "html"]

Notice how a priori this would have type Pandoc -> IO Pandoc, but since we want to use it from Hakyll I’ve already inserted a call to unsafeCompiler in the correct place.

Further, the above code checks whether the block has an explicit language attached to it and, if not, leaves it alone; this was suggested by LSLeary on Reddit. If you want to have a single div class for every code block—say, for some custom css—then you can replace
  CodeBlock (_, (T.unpack -> lang) : _, _) (T.unpack -> body) ->
    RawBlock "html" . T.pack <$> unsafeCompiler (callPygs lang body)

with
  CodeBlock (_, listToMaybe -> mbLang, _) (T.unpack -> body) -> do
    let lang = T.unpack (fromMaybe "text" mbLang)
    RawBlock "html" . T.pack <$> unsafeCompiler (callPygs lang body)

Hakyll§

Thankfully, integrating pygmentsHighlight into Hakyll is not very complicated either. In addition to the normal pandocCompiler or pandocCompilerWith functions that you are probably already using, there is also pandocCompilerWithTransformM:
pandocCompilerWithTransformM
  :: ReaderOptions
  -> WriterOptions
  -> (Pandoc -> Compiler Pandoc)
  -> Compiler (Item String)

Basically, in additions to reader and writer options, it also takes a monadic transformation of pandoc’s ast and builds an appropriate Compiler from that.
-- import Hakyll
-- import Text.Pandoc.Options

myPandocCompiler :: Compiler (Item String)
myPandocCompiler =
  pandocCompilerWithTransformM
    defaultHakyllReaderOptions
    defaultHakyllWriterOptions
    pygmentsHighlight

The myPandocCompiler function can now be used as any other compiler; for example:
main :: IO ()
main = hakyll do
  -- …
  match "posts/**.md" do
    route (setExtension "html")
    compile $ myPandocCompiler
          >>= loadAndApplyTemplate "templates/default.html" defaultContext
          >>= relativizeUrls
  -- …

For a full working example, see my configuration.

Conclusion§

That’s it! To my eyes, syntax highlighting looks much better now, and on the way I—and perhaps you as well—even learned a little bit about how pandoc internally represents its documents. Time well spent. As I said in the beginning, in principle one could swap out pygmentize for any other syntax highlighter that can produce html. However, for me these results are good enough that I will probably not try out every tool under the sun, chasing that ever present epsilon of highlighting cases which I still don’t agree with—at least for now.
  • Vaibhav Sagar has written a fantastic post outlining how one can use ghc itself to generate highlighting for Haskell code using the ghc-syntax-highlighter library. Seeing how there are a lot of language extensions that pygmentize does not highlight correctly, this seems well worth it!