Rust macros in html5ever

Keegan McAllister

November 6, 2014

Navigate with ← and → keys, or view all slides.

Available at kmcallister.github.io

The Servo project

Servo is an experimental browser engine from Mozilla Research

Developed by a dozen Mozilla employees + hundreds of others

Layout code is all-new and written in Rust

Acid1 cake

Acid2 cake

Bootstrapping

C libs replaced in Rust:

Faster, cleaner, more safe, more correct

Future: JS engine, rasterizer?

HTML parsing

What is the HTML syntax? Depends who you ask!

Which one is relevant for real browsers and content?

TURN DOWN FOR WHATWG

Parsing rules

A start tag whose tag name is "nobr"

Reconstruct the active formatting elements, if any.

If the stack of open elements has a nobr element in scope, then this is a parse error; run the adoption agency algorithm for the tag name "nobr", then once again reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active formatting elements that element.

Parsing rules

When the steps below require the UA to generate implied end tags, then, while the current node is a dd element, a dt element, an li element, …

When the steps below require the UA to generate all implied end tags thoroughly, then, while the current node is a caption element, a colgroup element, an dd element, …

whyyyyyy

The upside: Any crap HTML (even 1996 GeoCities) will parse the same in every modern browser

<kmc> should I be scared when the WHATWG spec says "for historical reasons"? because I feel like that phrase already applies to the entire document

<Ms2ger> Correct

<Ms2ger> That just means "for historical reasons we dislike particularly"

html5ever

html5ever is Servo's new HTML parser, written mostly by me over the course of about 7 months

We now have 8 contributors and several users!

Fast, safe, generic, native UTF-8

Rust and C APIs available

Macros in html5ever

Factor the problem into:

Bonus: code looks like the spec!

Tokenizer rule

12.2.4.1 Data state

Consume the next input character:

U+0026 AMPERSAND (&)
Switch to the character reference in data state.
U+003C LESS-THAN SIGN (<)
Switch to the tag open state.
U+0000 NULL
Parse error. Emit the current input character as a character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.

Tokenizer code

match self.state {
    states::Data => loop { match get_char!(self) {
        '&'  => go!(self: consume_char_ref),
        '<'  => go!(self: to TagOpen),
        '\0' => go!(self: error; emit '\0'),
        c    => go!(self: emit c),
    }},

In Hubbub this is about 150 lines of C.

In html5lib it's 20 lines of Python.

Incremental parsing

// Feed more input
fn feed(&mut self, input: String);

// Get next character, if available
fn get_char(&mut self) -> Option<char>;

// true => made progress
fn step(&mut self) -> bool;

fn run(&mut self) {
    while self.step() { }
}

get_char!

macro_rules! unwrap_or_return (
    ($opt:expr, $retval:expr) => (
        match $opt {
            None => return $retval,
            Some(x) => x,
        }
    )
)

macro_rules! get_char (
    ($me:expr) => (
        unwrap_or_return!($me.get_char(), false)
    )
)

Tokenizer actions

macro_rules! shorthand (

    ($me:expr: emit $c:expr)
        => ( $me.emit_char($c); );

    // 22 more of these
)

Allows for compact sequencing:

go!(self: error; create_doctype; force_quirks;
          emit_doctype; to Data)

go! is used over 200 times.

Sequencing

A pattern like $($cmd:tt)* ; $($rest:tt)* is ambiguous :(

macro_rules! go (
  ($me:expr: $a:tt                 ; $($rest:tt)* )
    => ({ shorthand!($me: $a);       go!($me: $($rest)*); });

  ($me:expr: $a:tt $b:tt           ; $($rest:tt)* )
    => ({ shorthand!($me: $a $b);    go!($me: $($rest)*); });

  ($me:expr: $a:tt $b:tt $c:tt     ; $($rest:tt)* )
    => ({ shorthand!($me: $a $b $c); go!($me: $($rest)*); });

Sequencing (continued)

  // Can only come at the end
  ($me:expr: to $s:ident)
    => ({ $me.state = states::$s; return true; });

  // Base cases
  ($me:expr: $($cmd:tt)+ )
    => ( shorthand!($me: $($cmd)+); );

  ($me:expr: ) => (());
)

Procedural macros

We're already stretching the limits of macro_rules! and we haven't touched tree construction…

Procedural macros run arbitrary Rust code at compile time,
using rustc's plugin infrastructure

See doc.rust-lang.org/guide-plugin.html

Named characters

&lt; parses as "<"

&ContourIntegral; parses as "∮"

WHATWG publishes about 2,000 of these as JSON

pub static NAMED_ENTITIES: PhfMap<&'static str, [u32, ..2]>
    = named_entities!("data/entities.json");

The named_entities! macro

let map: HashMap<String, [u32, ..2]> = ...;

let toks: Vec<_> = map.into_iter().flat_map(
    |(k, [c0, c1])| {
        let k = k.as_slice();
        (quote_tokens!(&mut *cx,
            $k => [$c0, $c1],
        )).into_iter()
    }
).collect();

MacExpr::new(quote_expr!(&mut *cx,
    phf_map!($toks)
))

Perfect hashing

We use another procedural macro, from sfackler's rust-phf library, to generate a perfect hash table at compile time.

phf_map!(k => v,
         k => v,
         ...)

Tree builder actions

Tree builder has its own rules, less regular in form than the tokenizer.

Instead of match + go!, we'll need a procedural macro.

match_token!

match mode {
    InBody => match_token!(token { 
        tag @ </a> </b> </big> </code> </em> </font>
              </i> </nobr> </s> </small> </strike>
              </strong> </tt> </u> => {
            self.adoption_agency(tag.name);
            Done
        }

        tag @ <h1> <h2> <h3> <h4> <h5> <h6> => {
            self.close_p_element_in_button_scope();
            if self.current_node_in(heading_tag) {
                // ...

Custom syntax trees

struct Tag {
    kind: TagKind,
    name: Option<TagName>,  // None for wild
}

/// Left-hand side of a pattern-match arm.
enum LHS {
    Pat(P<ast::Pat>),
    Tags(Vec<Spanned<Tag>>),
}

Source code spans

In syntax::codemap you will find

pub struct Span {
    pub lo: BytePos,
    pub hi: BytePos,

    /// Macro expansion context
    pub expn_info: Option<P<ExpnInfo>>
}

pub struct Spanned<T> {
    pub node: T,
    pub span: Span,
}

Tracking spans

use syntax::codemap::{Span, Spanned, spanned};
use syntax::parse::parser::Parser;

fn parse_spanned_ident(parser: &mut Parser)
    -> Spanned<Ident>
{
    let lo = parser.span.lo;
    let ident = parser.parse_ident();
    let hi = parser.last_span.hi;
    spanned(lo, hi, ident)
}

Throwing compiler errors

macro_rules! bail (
    ($cx:expr, $span:expr, $msg:expr) => ({
        $cx.span_err($span, $msg);
        return ::syntax::ext::base::DummyResult::any($span);
    })
)

macro_rules! bail_if (
    ($e:expr, $cx:expr, $span:expr, $msg:expr) => (
        if $e { bail!($cx, $span, $msg) }
    )
)

Validating macro input

match (lhs.node, rhs.node) {
    (Pat(pat), Expr(expr)) => {
        bail_if!(!wildcards.is_empty(), cx, lhs.span,
            "ordinary patterns may not appear after \
             wildcard tags");

Do this to guarantee the semantics of in-order matching

src/tree_builder/rules.rs:100:17: 100:48
  error: ordinary patterns may not appear after wildcard tags
src/tree_builder/rules.rs:100
  CharacterTokens(NotSplit, text) => SplitWhitespace(text),
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error: aborting due to previous error

My favorite rule in the spec

An end tag whose tag name is "sarcasm"
Take a deep breath, then act as described in the "any other end tag" entry below.