Keegan McAllister
November 6, 2014
Navigate with ← and → keys, or view all slides.
Available at kmcallister.github.io
Servo is an experimental browser engine from Mozilla Research
Developed by a dozen Mozilla employees + hundreds of others
Layout code is all-new and written in Rust
C libs replaced in Rust:
Faster, cleaner, more safe, more correct
Future: JS engine, rasterizer?
What is the HTML syntax? Depends who you ask!
Which one is relevant for real browsers and content?
- A start tag whose tag name is "nobr"
Reconstruct the active formatting elements, if any.
If the stack of open elements has a
nobr
element in scope, then this is a parse error; run the adoption agency algorithm for the tag name "nobr", then once again reconstruct the active formatting elements, if any.Insert an HTML element for the token. Push onto the list of active formatting elements that element.
When the steps below require the UA to generate implied end tags, then, while the current node is a
dd
element, adt
element, anli
element, …When the steps below require the UA to generate all implied end tags thoroughly, then, while the current node is a
caption
element, acolgroup
element, andd
element, …
The upside: Any crap HTML (even 1996 GeoCities) will parse the same in every modern browser
<kmc> should I be scared when the WHATWG spec says "for historical reasons"? because I feel like that phrase already applies to the entire document
<Ms2ger> Correct
<Ms2ger> That just means "for historical reasons we dislike particularly"
html5ever is Servo's new HTML parser, written mostly by me over the course of about 7 months
We now have 8 contributors and several users!
Fast, safe, generic, native UTF-8
Rust and C APIs available
Factor the problem into:
Bonus: code looks like the spec!
12.2.4.1 Data state
Consume the next input character:
- U+0026 AMPERSAND (&)
- Switch to the character reference in data state.
- U+003C LESS-THAN SIGN (<)
- Switch to the tag open state.
- U+0000 NULL
- Parse error. Emit the current input character as a character token.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as a character token.
match self.state { states::Data => loop { match get_char!(self) { '&' => go!(self: consume_char_ref), '<' => go!(self: to TagOpen), '\0' => go!(self: error; emit '\0'), c => go!(self: emit c), }},
In Hubbub this is about 150 lines of C.
In html5lib it's 20 lines of Python.
// Feed more input fn feed(&mut self, input: String); // Get next character, if available fn get_char(&mut self) -> Option<char>; // true => made progress fn step(&mut self) -> bool; fn run(&mut self) { while self.step() { } }
get_char!
macro_rules! unwrap_or_return ( ($opt:expr, $retval:expr) => ( match $opt { None => return $retval, Some(x) => x, } ) ) macro_rules! get_char ( ($me:expr) => ( unwrap_or_return!($me.get_char(), false) ) )
macro_rules! shorthand ( ($me:expr: emit $c:expr) => ( $me.emit_char($c); ); // 22 more of these )
Allows for compact sequencing:
go!(self: error; create_doctype; force_quirks; emit_doctype; to Data)
go!
is used over 200 times.
A pattern like $($cmd:tt)* ; $($rest:tt)*
is ambiguous :(
macro_rules! go ( ($me:expr: $a:tt ; $($rest:tt)* ) => ({ shorthand!($me: $a); go!($me: $($rest)*); }); ($me:expr: $a:tt $b:tt ; $($rest:tt)* ) => ({ shorthand!($me: $a $b); go!($me: $($rest)*); }); ($me:expr: $a:tt $b:tt $c:tt ; $($rest:tt)* ) => ({ shorthand!($me: $a $b $c); go!($me: $($rest)*); });
// Can only come at the end ($me:expr: to $s:ident) => ({ $me.state = states::$s; return true; }); // Base cases ($me:expr: $($cmd:tt)+ ) => ( shorthand!($me: $($cmd)+); ); ($me:expr: ) => (()); )
We're already stretching the limits of macro_rules!
and we haven't touched
tree construction…
Procedural macros run arbitrary Rust code at compile time,
using rustc
's plugin infrastructure
See doc.rust-lang.org/guide-plugin.html
<
parses as "<"
∮
parses as "∮"
WHATWG publishes about 2,000 of these as JSON
pub static NAMED_ENTITIES: PhfMap<&'static str, [u32, ..2]> = named_entities!("data/entities.json");
named_entities!
macrolet map: HashMap<String, [u32, ..2]> = ...; let toks: Vec<_> = map.into_iter().flat_map( |(k, [c0, c1])| { let k = k.as_slice(); (quote_tokens!(&mut *cx, $k => [$c0, $c1], )).into_iter() } ).collect(); MacExpr::new(quote_expr!(&mut *cx, phf_map!($toks) ))
We use another procedural macro, from sfackler's rust-phf
library,
to generate a perfect hash table at compile time.
phf_map!(k => v, k => v, ...)
Tree builder has its own rules, less regular in form than the tokenizer.
Instead of match
+ go!
, we'll need a procedural macro.
match_token!
match mode { InBody => match_token!(token { tag @ </a> </b> </big> </code> </em> </font> </i> </nobr> </s> </small> </strike> </strong> </tt> </u> => { self.adoption_agency(tag.name); Done } tag @ <h1> <h2> <h3> <h4> <h5> <h6> => { self.close_p_element_in_button_scope(); if self.current_node_in(heading_tag) { // ...
struct Tag { kind: TagKind, name: Option<TagName>, // None for wild } /// Left-hand side of a pattern-match arm. enum LHS { Pat(P<ast::Pat>), Tags(Vec<Spanned<Tag>>), }
In syntax::codemap
you will find
pub struct Span { pub lo: BytePos, pub hi: BytePos, /// Macro expansion context pub expn_info: Option<P<ExpnInfo>> } pub struct Spanned<T> { pub node: T, pub span: Span, }
use syntax::codemap::{Span, Spanned, spanned}; use syntax::parse::parser::Parser; fn parse_spanned_ident(parser: &mut Parser) -> Spanned<Ident> { let lo = parser.span.lo; let ident = parser.parse_ident(); let hi = parser.last_span.hi; spanned(lo, hi, ident) }
macro_rules! bail ( ($cx:expr, $span:expr, $msg:expr) => ({ $cx.span_err($span, $msg); return ::syntax::ext::base::DummyResult::any($span); }) ) macro_rules! bail_if ( ($e:expr, $cx:expr, $span:expr, $msg:expr) => ( if $e { bail!($cx, $span, $msg) } ) )
match (lhs.node, rhs.node) { (Pat(pat), Expr(expr)) => { bail_if!(!wildcards.is_empty(), cx, lhs.span, "ordinary patterns may not appear after \ wildcard tags");
Do this to guarantee the semantics of in-order matching
src/tree_builder/rules.rs:100:17: 100:48
error: ordinary patterns may not appear after wildcard tags
src/tree_builder/rules.rs:100
CharacterTokens(NotSplit, text) => SplitWhitespace(text),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error: aborting due to previous error
- An end tag whose tag name is "sarcasm"
- Take a deep breath, then act as described in the "any other end tag" entry below.