About Tokens and Lexemes

download About Tokens and Lexemes

If you can't read please download the document

description

A Parser is an integral part when building a Domain Specific Language or file format parser, such as our example usage case: the Ical format. This session will cover the general concept about tokenizing and parsing into a datastructure, as well as going into depth about how to keep the memory footprint and runtime low with the help of a stream-tokenizer.

Transcript of About Tokens and Lexemes

  • 1. About tokens and lexemes Ben Scholzen Game Developer Gameforge Productions GmbH

2. What we'll cover

  • Definition of a compiler, tokenizer and parser

3. Basic structure of a tokenizer and a parser 4. Where to optimize things for PHP 5. What about parser generators? 6. They are evil!

  • PHP_LexerGenerator, PHP_ParserGenerator, lemon-PHP

7. Create lots of function calls like lemon parsers in C 8. Are not working very performance-wise 9. Will eat up all your memory 10. Conclusion

  • Don't use them!

11. Let's get started 12. What a compiler is and how it works

  • Acts as frontend for the application

13. Converts human-readable data into machine-readable data 14. Consists of a two components:

  • The lexer:
  • Is a finite-state-machine

15. Reads the input stream 16. Clears up the input data 17. Creates a list of tokens The parser:

  • Gets tokens from the tokenizer

18. Converts them into a data structure 19. What a compiler is and how it works Lexer Parser TokensDocument Stream Structure 20. Sounds great, but where do I need it?

  • Formatting languages
  • BB-Code

21. Wiki-Codes Description languages

  • iCalendar / vCalendar

22. XML Even programming languages

  • JavaScript

23. PHP Anything else you want your program to understand 24. The lexer (or tokenizer) 25. What are tokens?

  • Categorized block of text
  • Token type

26. Corresponding block of text (lexeme) List of tokens represents an entire document 27. Example in PHP:$value = 5 * 7 ; 28. How the tokenizer works

  • Define possible states of the lexer

29. Tokenize the input in a loop

  • Scan with preg_match()
  • Strtok() is mostly too simple

30. Reading char-by-char is too slow 31. Use the offset parameter 32. Use the G assertion (^ won't work) Always store the current position 33. Use either a switch-statement or a structured arrayReturn the tokens 34. What we can optimize

  • Use little memory
  • Always just read a partial part of the document into memory
  • Via fopen() and fgets()

35. Requires previous knowledge about when tokens end Offer a method for the parser to get a partial bunch of tokens Speed up execution-time

  • Do no internal function-calls if applicable

36. Going into practice 37. The beginning

  • Use little memory
  • Via fopen() and fread()
  • Requires previous knowledge about when tokens end

38. Offer a method for the parser to get a partial bunch of tokens Speed up execution-time Do no internal function-calls if applicable 39. Throwing in a file 40. Preparing stuff 41. Base state 42. Operator state 43. Value state 44. Rounding it up 45. Some actual testing 46. And what we get

  • array(6) {

47. [0]=> 48. array(2) { 49. [0]=> 50. string(8) "variable" 51. [1]=> 52. string(6) "$value" 53. } 54. [1]=> 55. array(2) { 56. [0]=> 57. string(8) "operator" 58. [1]=> 59. string(1) "=" 60. } 61. [2]=> 62. array(2) { 63. [0]=> 64. string(6) "number" 65. [1]=> 66. string(1) "5" 67. }

  • [3]=>

68. array(2) { 69. [0]=> 70. string(8) "operator" 71. [1]=> 72. string(1) "*" 73. } 74. [4]=> 75. array(2) { 76. [0]=> 77. string(6) "number" 78. [1]=> 79. string(1) "7" 80. } 81. [5]=> 82. array(2) { 83. [0]=> 84. string(8) "operator" 85. [1]=> 86. string(1) ";" 87. } 88. } 89. The parser 90. So we have a bunch of tokens, what now?

  • Loop through the tokens and analyze them

91. Create an object-oriented tree-structure or interpret 92. Avoid non-tail recursion

  • Use tail-recursion (trampoline) instead

93. Saves you from hitting the stack limit That's it! 94. Summary Questions? 95. Where to go from here

  • Wikipedia: http://en.wikipedia.org/wiki/Compiler http://en.wikipedia.org/wiki/Parsing

96. About tail-recursion in PHP: http://www.alternateinterior.com/2006/09/tail-recursion-in-php.html 97. My blog: http://www.dasprids.de 98. Rate this talk: http://joind.in/635 99. Follow me on twitter: 100. http://www.twitter.com/dasprid 101. Thank you!