Lexing and parsing

Post on 28-Jun-2015

937 views 1 download

Tags:

description

Beginners guide of Lexing and Parsing for PHP developers - given at Zendcon 2014

Transcript of Lexing and parsing

LEXING AND PARSINGTHE BEGINNER’S GUIDE

WHY ARE WE DOING THIS?

• bbcode

• html

• xml

• programming language

BUT I CAN JUST REGEX

• sometimes you can

• sometimes you can’t

• is your html well formed? (view source some time)

• it depends!!

CHOMSKY HIERARCHY

COMPUTER SCIENCEWE LIKE ACRONYMS AND WEIRD WORDS

ENGLISH IS HARD!

• tokenizer

• scanner

• lexer

• parser

• lexical analyzer

• syntactic analyzer

• formal grammar

LEXICAL ANALYSISBREAK DOWN INPUT INTO A SEQUENCE OF TOKENS

LEXING

SCANNING

• Finite State Machine

• Finds Lexemes

• Might backtrack

FINITE STATE MACHINE

EVALUATOR

• looks at lexeme to get value

• lexeme + value = token

LEXING PHP - $Y = 5;• $y

• array[309, ‘$y’, 1],

• =

• =

• 5

• array[305, 5, 1]

• 309 == T_VARIABLE

• 305 == T_LNUMBER

LEXER GENERATORSDO NOT WRITE THIS BY HAND

Famous• lex

• flex

• re2c

• ANTLR

• DFASTAR

• jflex

• jlex

• quex

PHP generators• https://github.com/oliverheins/PHPSimpleLexYacc

• lex syntax

• https://github.com/pear/PHP_LexerGenerator

• re2c syntax

• https://github.com/wez/JLexPHP

• jlex syntax

• token_get_all (see php-parser)

• parse_ini_file/string (combined with parser)

RE2C

IN PHP LAND

SYNTACTIC ANALYSISCONSTRUCTING SOMETHING BASED ON A GRAMMAR

PARSING

THE PARSING PROCESS

• Tokens come in

• Magic

• Data structure comes out

• parse tree

• AST

GRAMMAR (FORMAL OF COURSE)

• "Brave men run in my family.”

• I can't recommend this book too highly.

• Prostitutes Appeal to Pope

• I had had my car for four years before I ever learned to drive it.

TYPES OF PARSERS

• Top Down

• Recursive Decent

• LL (left to right, leftmost derivation)

• Earley parser

• Bottom Up

• Precedence parser

• Operator-precedence parser

• Simple precedence parser

• BC (bounded context) parsing

• LR parser (Left-to-right, Rightmost derivation)

• Simple LR (SLR) parser

• LALR parser

• Canonical LR (LR(1)) parser

• GLR parser

• CYK parser

• Recursive ascent parser

SENTENCE DIAGRAMMING

• People who live in glass house shouldn't throw stones.

PARSE TREE

TOP DOWN VS. BOTTOM UP PARSING

PARSE TREES

• Constituency-based parse trees

• Dependency-based parse trees

AST

• Not everything appears

• additional information may be applied

• can “improve” tree nodes

• PHP is getting one!

LALR(K)

• Look ahead prevents “ambiguous” parsing

• I have one token, what token comes next?

PARSER GENERATORS

Famous• bison

• bison

• bison

• bison

• yacc

• lemon

• ANTLR

PHP versions• https://github.com/wez/lemon-php

• https://github.com/pear/PHP_ParserGenerator

• lemon

• https://github.com/scato/phpeg

• peg (peg.js)

• https://github.com/jakubkulhan/pacc

• yacc

BISON

• Generates LALR (or GLR) parsers

• Code in C, C++ or Java

• reentrant with %define api.pure set

• used by ALL THE THINGS

• PHP

• Ruby

• Postgresql

• Go

BISON IN C

LEMON

• Generates LALR(1) parser

• reentrant AND thread safe

• non-terminal destructor (leak avoidance)

• pull parsing

• sqlite

PHP LEMON

REENTRANT VS THREAD SAFE

• Process

• Thread

• Locking

• Scope

• Reentrant

COMPILE IT

• transform programming language to computer language

INTERPRET IT

• directly executes programming language

PROFIT

UNDER THE HOODWHAT USES THIS STUFF?

PHPRE2C + Bison + these crazy opcodes….

LALR(1) WRITTEN BY HANDHow - pythonic

HHVMFlex and Bison and JIT – OH MY!

SQLITELemon is tasty!

WRITING PARSERS AND LEXERSTHEORIES OF CODING

STEP 1: THINK SMALL

• Writing a general purpose parser is hard – that’s why you use PHP

• Writing a single purpose parser is much easier

• markup text (markdown)

• configuration or definition files (behat/gherkin syntax)

• complex validation (addresses in multiple formats)

STEP 2: SEPARATE AND UNOPTIMIZED

• premature optimization yada yada

• combine after it’s ready to be used (or not at if you’ll need to change it later)

• lexer and parser each have unique, well defined goals

• the ability to potentially switch parser styles later will help you!

STEP 3: LEXER

• the lexer's job is to recognize tokens

• it can do this via a giant switch statement of doom

• or maybe a giant loop

• or maybe a list of goto statements

• or maybe a complex class with methods

• …. or you can just use a generator

LET’S BREAK THAT DOWN

1. Define a token format

2. Define grammar format (what are we looking for?)

3. Go over the input data (usually a string) and make matches

1. compare or regex or ctype_* or however it make sense

4. Keep track of your current state

5. Have an output format – AST, tree, whatever

STEP 4: PARSER

• Loop over our tokens

• Look at the values and decide to what to do

STEP 5: DO SOMETHING WITH IT!

1. Compile – write out to something that can be run (html)

2. Interpret – run through another program to get output (templates to html)

3. Analyze – run through to analyze the data inside (code analysis/sniffer tools)

4. Validate – check for proper “spelling and grammar”

5. ???

6. PROFIT

“If you’re not sure how to do a job – ask!”

- silly poster on my laundry room wall

CONTACT ME

• auroraeosrose@gmail.com

• auroraeosrose – freenode.net #phpmentoring #phpwomen

• Twitter - @auroraeosrose

• http://github.com/auroraeosrose