A new look into the HTML 5 tokenizer specification

Posted by Romain, Comments

Over the past year or so, the HTML 5 specification has been a non-friendly but necessary reference to me (/us). Indeed, this is the only place that really explains how an HTML 5 document gets tokenized (a necessary step before parsing).

If you're doing research related to XSS or HTML contexts, and you never had a look at this document, I suggest you go ahead and dive into it. That's mostly the key to finding something like the script data double escaped state as described by Jon.

However, if you had a serious look into it, I'm sure you had one of these reactions: Where am I know? How did I end up here? So, just for the sake of making our life easier, I created a small visualization page for the spec. I mostly scraped the tokenizer spec, and generated a graph for it.

The result is a self-contained HTML document that helps you navigate through the tokenization specification, and lets you click on states and remembers where you're coming from. It's really just to make our life easier:

HTML 5 grammar preview

This document is available here: HTML5 tokenization visualization.

If you're interested in how to get the data, we published the script that generates a JSON or DOT file on Github: security/html5-tokenizer-extraction at master · coverity/security · GitHub