A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of...

27
A simple library for regular expressions Claude Marché June 20, 2002

Transcript of A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of...

Page 1: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

A simplelibrary for regularexpressions

ClaudeMarché

June20,2002

Page 2: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Contents

1 Intr oduction 3

2 Library documentation 42.1 ModuleRegexp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Theregexp datatypeandits constructors. . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Regexp matchingby runtimeinterpretationof regexp . . . . . . . . . . . . . . . . . . . . 52.1.3 Syntaxof regularexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Compilationof regularexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Matchingandsearchingfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Documentationof implementation 83.1 ModuleHashcons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 ModuleRegular expr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Theregexp datatypeandits constructors. . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.2 Simpleregexp operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 Regexp matchingby runtimeinterpretationof regexp . . . . . . . . . . . . . . . . . . . . 11

3.3 Implementationof moduleRegular expr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.1 regexp datatypeandsimplifying constructors . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 extendedregexp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.3 Regexp matchby run-timeinterpretationof regexp . . . . . . . . . . . . . . . . . . . . . 13

3.4 ModuleRegexp syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Implementationof moduleRegexp syntax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 ModuleAutomata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6.1 Automata,compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6.2 Executionof automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6.3 Matchingandsearchingfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 Implementationof moduleAutomata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7.1 Thetypeof automata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7.2 Executionof automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7.3 Searchingfunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.7.4 Compilationof aregexp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.7.5 Outputfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2

Page 3: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Chapter 1

Intr oduction

Thislibrary implementssimpleuseof regularexpressions.It providesdirectdefinitionsof regularexpressionsformthe usualconstructions,or definition of suchexpressionsform a syntacticmannerusingGNU regexp syntax. Itprovidesasimplecompilationof regexp into adeterministicautomaton,anduseof suchanautomatonfor matchingandsearching.

3

Page 4: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Chapter 2

Library documentation

This is thedocumentationfor useof thelibrary. It providesall functionsin onemodule.

2.1 Module Regexp

2.1.1 The regexpdatatypeand its constructors

Thetypeof regularexpressionsis abstract.Regularexpressionsmaybebuilt from thefollowing constructors:

� emptyis theregexp thatdenotesno wordat all.

� epsilonis theregexp thatdenotestheemptyword.

� charc returnsa regexp thatdenotesonly thesingle-characterwordc.

� char interv a b returnsa regexp thatdenotesany single-characterwordbelongingto charinterval a � b.

� string str denotesthestringstr itself.

� stare wheree is a regexp, denotestheKleeneiterationof e, thatis all thewordsmadeof concatenationofzero,oneor morewordsof e.

� alt e1 e2 returnsa regexp for theunionof languagesof e1 ande2.

� seqe1 e2 returnsa regexp for theconcatenationof languagesof e1 ande2.

� opt e returnsa regexp for thesetof wordsof e andtheemptyword.

� somee denotesall thewordsmadeof concatenationof oneor morewordsof e.

type regexp

val empty � regexp

val epsilon � regexp

val char: char � regexp

val char interv � char � char � regexp

val string : string � regexp

val star � regexp � regexp

val alt � regexp � regexp � regexp

val seq � regexp � regexp � regexp

val opt � regexp � regexp

val some � regexp � regexp

4

Page 5: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

char denotesthecharactercharfor all non-specialchars\ char denotesthecharactercharfor specialcharacters. , \ , * , +, ?, [ and]. denotesany single-characterword[ set] denotesany single-characterword belongingto set. Intervals may be

givenasin [a-z] .[^ set] denotesany single-characterword notbelongingto set.regexp* denotestheKleenestarof regexpregexp+ denotesany concatenationof oneor morewordsof regexpregexp? denotestheemptyword or any word denotedby regexpregexp � | regexp� denotesany wordsin regexp � or in regexp�regexp � regexp� denotesany contecatenationof awordof regexp � andawordof regexp�( regexp) parentheses,denotesthesamewordsasregexp.

Figure2.1: Syntaxof regularexpressions

2.1.2 Regexpmatching by runtime interpr etation of regexp

match string r s returnstrueif thestrings is in thelanguagedenotedby r. This functionis by no meansefficient,but it is enoughif you justneedto matchoncea simpleregexp againsta string.

If you have a complicatedregexp, or if you’re going to matchthe sameregexp repeatedlyagainstseveralstrings,we recommendto usecompilationof regexp providedby moduleAutomata.

val match string � regexp � string � bool

2.1.3 Syntax of regular expressions

This functionoffersa way of building regularexpressionssyntactically, following moreor lessthe GNU regexpsyntaxasin egrep.This is summarizedin thetableof figure2.1.

Moreover, the postfix operators* , + and? have priority over the concatenation,itself having priority overalternationwith |

from string str builds a regexp by interpretingsyntacticallythe stringstr. Thesyntaxmustfollow the tableabove. It raisesexceptionInvalid argument"from string" if thesyntaxis not correct.

val from string � string � regexp

2.1.4 Compilation of regular expressions

type compiled regexp

val compile � regexp � compiled regexp

2.1.5 Matching and searching functions

searchforward e str possearchin thestringstr, startingatpositionposawordthatis in thelanguageof r. Returnsa pair

�b � e� whereb is positionof thefirst charmatched,ande is thepositionfollowing thepositionof the last

charmatched.RaisesNot found of no matchingword is found.Notice:evenif theregularexpressionacceptstheemptyword,thisfunctionwill neverreturn

�b � e� with e � b.

In otherwords,this functionalwayssearchfor non-emptywordsin thelanguageof r.Unpredictableresultsmayoccurif pos � .

val searchforward � compiled regexp � string � int � int � int

split stringsr s extract from strings thesubwords(of maximalsize)thatarein the languageof r. For examplesplit strings

�compile

�from string "[0-9]+" � � "12+3*45" returns["12" ;"3" ;"45" ].

5

Page 6: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

split delim a s splits string s into pieces delimited by r. For examplesplit strings

�compile

�char’:’ ��� "marche:G6H3a656h6g56:534:180:Claude Marche:/hom e/marche:/ bin/bash"

returns[ "marche" ; "G6H3a656h6g56" ; "534" ; "180" ; "ClaudeMarche" ; "/home/marche" ; "/bin/bash" ].

val split strings � compiled regexp � string � string list

val split delim � compiled regexp � string � string list

6

Page 7: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Index of Identifiers

alt, 4char, 4char interv, 4compile, 5compiled regexp (type), 5, 5, 6empty, 4epsilon, 4from string, 5match string, 5opt, 4regexp (type), 4, 4, 5Regexp (module), 4searchforward, 5seq, 4some, 4split delim, 6split strings, 6star, 4string, 4

7

Page 8: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Chapter 3

Documentationof implementation

This part describethe implementationof the library. It providesseveral moduleswhich dependseachotherasshown by thegraphbelow.

8

Page 9: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Tests

AutomataRegexp_syntax

Regular_expr

Inttagmap InttagsetRegexp_lexer

Ptset

Hashcons

Regexp_parser

Regexp

Gen_lock

Classical_automata

3.1 Module Hashcons

* hashcons:hashtablesfor hashconsingCopyright(C) 2000Jean-ChristopheFILLIATREThis library is free software;you can redistribute it and/ormodify it underthe termsof the GNU Library

GeneralPublicLicenseversion2, aspublishedby theFreeSoftwareFoundation.This library is distributedin thehopethatit will beuseful,but WITHOUT ANY WARRANTY; without even

theimpliedwarrantyof MERCHANTABILITY or FITNESSFORA PARTICULAR PURPOSE.

9

Page 10: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

SeetheGNU Library GeneralPublicLicenseversion2 for moredetails(enclosedin thefile LGPL).Hashtablesfor hashconsing.Hashconsedvaluesareof thefollowingtypehash consed. Thefield tagcontains

a uniqueinteger (for valueshashconsedwith thesametable). Thefield hkey containsthehashkey of thevalue(without modulo)for possibleusein otherhashtables(andinternallywhenhashconsingtablesareresized).Thefield node� containsthe valueitself �type � hash consed ���

hkey � int ;tag � int ;node �����Functorialinterface.

module type HashedType �sig

type tval equal � t � t � boolval hash � t � int

end

module type S �sig

type keytype tval create � int � tval hashcons � t � key � key hash consed

end

module Make�H � HashedType��� �

S with type key � H.t �

3.2 Module Regular expr

Thismoduledefinestheregularexpressions,andprovidessomesimplemanipulationof them.

3.2.1 The regexpdatatypeand its constructors

Thetypeof regularexpressionsis abstract.Regularexpressionsmaybebuilt from thefollowing constructors:

� emptyis theregexp thatdenotesno wordat all.

� epsilonis theregexp thatdenotestheemptyword.

� charc returnsa regexp thatdenotesonly thesingle-characterwordc.

� charss returnsa regexp thatdenotesany single-characterwordbelongingto setof charss.

� string str denotesthestringstr itself.

� stare wheree is a regexp, denotestheKleeneiterationof e, thatis all thewordsmadeof concatenationofzero,oneor morewordsof e.

� alt e1 e2 returnsa regexp for theunionof languagesof e1 ande2.

� seqe1 e2 returnsa regexp for theconcatenationof languagesof e1 ande2.

� opt e returnsa regexp for thesetof wordsof e andtheemptyword.

� somee denotesall thewordsmadeof concatenationof oneor morewordsof e.

10

Page 11: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

type regexp

val uniq tag � regexp � int

val empty � regexp

val epsilon � regexp

val char: char � regexp

val char interv � char � char � regexp

val string : string � regexp

val star � regexp � regexp

val alt � regexp � regexp � regexp

val seq � regexp � regexp � regexp

val opt � regexp � regexp

val some � regexp � regexp

3.2.2 Simple regexpoperations

Thefollowing threefunctionsprovide somesimpleoperationson regularexpressions:

� nullabler returnstrue if regexp r acceptstheemptyword.

� residualr c returnstheregexp r’ denotingthelanguageof words � suchthatc � is in thelanguageof r.

� firstcharsr returnsthesetof charactersthatmayoccurat thebeginningof wordsin thelanguageof e.

val nullable � regexp � bool

val residual � regexp � int � regexp

val firstchars � regexp � �int � int � regexp � list

3.2.3 Regexpmatching by runtime interpr etation of regexp

match string r s returnstrueif thestrings is in thelanguagedenotedby r. This functionis by no meansefficient,but it is enoughif you justneedto matchoncea simpleregexp againsta string.

If you have a complicatedregexp, or if you’re going to matchthe sameregexp repeatedlyagainstseveralstrings,we recommendto usecompilationof regexp providedby moduleAutomata.

val match string � regexp � string � bool

pretty-printer

val fprint � Format.formatter � regexp � unit

val print � regexp � unit

3.3 Implementation of moduleRegular expr

3.3.1 regexpdatatypeand simplifying constructors

open Hashcons

type regexp � regexp structHashcons.hash consed

11

Page 12: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

and regexp struct �| Empty| Epsilon| Char interv of int � int| String of string

���lengthat least2

� �| Starof regexp| Alt of regexp structPtset.t| Seqof regexp � regexp

let uniq tag r � r.tag

let regexp eq�r1 � r2 ���

match�r1 � r2 � with

| Empty� Empty � true| Epsilon� Epsilon � true| Alt s1 � Alt s2 � Ptset.equalqs1 s2| Starr1 � Starr2 � r1 � r2| Seq

�s11� s12��� Seq

�s21� s22��� s11 � s21 � s12 � s22

| String s1 � String s2 � s1 � s2| Char interv

�a1 � b1 ��� Char interv

�a2 � b2 ��� a1 � a2 � b1 � b2

| � false

module Hash � Hashcons.Make�struct type t � regexp struct

let equal � regexp eqlet hash � Hashtbl.hashend �

let hash consing table � Hash.create� �"!let hash cons � Hash.hashconshash consing table

let empty � hash consEmpty

let epsilon � hash consEpsilon

let stare �match e.nodewith

| Empty # Epsilon � epsilon| Star � e| � hash cons

�Stare�

let alt e1 e2 �match e1.node� e2.nodewith

| Empty� � e2| � Empty � e1| Alt

�l1 �$� Alt

�l2 ��� hash cons

�Alt

�Ptset.union l1 l2 � �

| Alt�l1 �$� � hash cons

�Alt

�Ptset.adde2 l1 ���

| � Alt�l2 ��� hash cons

�Alt

�Ptset.adde1 l2 ���

| �if e1 � e2 then e1else hash cons

�Alt

�Ptset.adde1

�Ptset.singletone2��� �

let seqe1 e2 �match e1.node� e2.nodewith

| Empty� � empty| � Empty � empty| Epsilon� � e2| � Epsilon � e1| � hash cons

�Seq

�e1 � e2���

let charc � let c � Char.codec in hash cons�Char interv

�c � c ���

12

Page 13: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let char interv a b � hash cons�Char interv

�Char.codea � Char.codeb ���

let string s �if s � ""then

epsilonelse

if String.lengths �&%then let c � String.gets in charcelse hash cons

�String s �

3.3.2 extendedregexp

let somee � �seqe

�stare� �

let opt e � �alt e epsilon�

3.3.3 Regexpmatch by run-time interpr etation of regexp

let rec nullabler �match r.nodewith

| Empty � false| Epsilon � true| Char interv

�n1 � n2 ��� false

| String � false�'�

cannotbe""� �

| Stare � true| Alt

�l ��� Ptset.exists nullablel

| Seq�e1 � e2��� nullablee1 � nullablee2

let rec residualr c �match r.nodewith

| Empty � empty| Epsilon � empty| Char interv

�a � b ��� if a ( c � c ( b then epsilonelse empty

| String s � �'�s cannotbe""

� �if c � Char.code

�String.gets "�

then string�String.subs % �

pred�String.length s �����

else empty| Stare � seq

�residuale c � r

| Alt�l ���Ptset.fold�

fun e accu � alt�residuale c � accu�

lempty

| Seq�e1 � e2���

if nullable�e1�

then alt�seq

�residuale1 c � e2� � residuale2 c �

else seq�residuale1 c � e2

13

Page 14: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let match string r s �let e � ref r infor i �) to pred

�String.length s � do

e �*� residual + e �Char.code

�String.gets i ���

done;nullable + e

firstcharsr returnsthesetof charactersthatmaystarta word in thelanguageof r

let adda b r l � if a , b then l else�a � b � r �-�.� l

let rec inserta b r l �match l with

| /'01� / � a � b � r �20|�a1 � b1 � r1 �-�.� rest �

if b a1then���

a <= b < a1<= b1� ��

a � b � r �3�.� lelse

if b ( b1then

if a ( a1then���

a <= a1<= b <= b1� �

adda�a1 45%6� r

� �a1 � b � alt r r1 �-�.� � add

�b 78%9� b1 r1 rest���

else���a1< a <= b <= b1

� ��a1 � a 45% � r1 �3�.� � a � b � alt r r1 �3�.� � add

�b 7:%9� b1 r1 rest�

elseif a ( a1then���

a <= a1<= b1 < b� �

adda�a1 45%6� r

� �a1 � b1 � alt r r1 �3�.� � insert

�b1 78%6� b r rest���

elseif a ( b1then�'�

a1< a <= b1 < b� ��

a1 � a 4:% � r1 �;�<� � a � b1 � alt r r1 �3�<� � insert�b1 78%9� b r rest�

else�'�a1<= b1 < a <= b

� ��a1 � b1 � r1 �-�.� � inserta b r rest�

let insert list l1 l2 �List.fold right�

fun�a � b � r � accu � inserta b r accu�

l1l2

14

Page 15: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let rec firstcharsr �match r.nodewith

| Empty � /�0| Epsilon � /'0| Char interv

�a � b ��� / � a � b � epsilon�20

| String s �let c � Char.code

�String.gets "� in

/ � c � c � string�String.subs % � pred

�String.lengths ����� �=0

| Stare �let l � firstcharse inList.map

�fun

�a � b � res��� �

a � b � seqresr ��� l| Alt

�s ���Ptset.fold�

fun e accu � insert list�firstcharse� accu�

s/�0

| Seq�e1 � e2���

if nullablee1then

let l1 � firstcharse1 and l2 � firstcharse2 ininsert list�

List.map�fun

�a � b � res��� �

a � b � seqrese2� � l1 �l2

elselet l1 � firstcharse1 inList.map

�fun

�a � b � res��� �

a � b � seqrese2��� l1

let �let r � seq

�star

�alt

�char ’a’ � � char ’b’ ��� � � char ’c’ � in

firstcharsr

open Format

let rec fprint fmt r �match r.nodewith

| Empty � fprintf fmt "(empty)"| Epsilon � fprintf fmt "(epsilon)"| Char interv

�a � b ��� fprintf fmt "[’%c’-’%c’]"

�Char.chr a � � Char.chr b �

| String s � fprintf fmt "\"%s\"" s| Stare � fprintf fmt "(%a)*" fprint e| Alt

�l ���fprintf fmt "(" ;Ptset.iter

�fun e � fprintf fmt "|%a" fprint e� l >

fprintf fmt ")"| Seq

�e1 � e2��� fprintf fmt "(%a %a)" fprint e1 fprint e2

let print � fprint std formatter

Module Regexp parser (Yacc)

Header

open Regular expr

15

Page 16: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Token declarations

%token <char> ?A@CBED%token <Regular expr.regexp> ?A@CBCDEFHGJI%token FKIHBED5BELMI�NMLPOAFRQSOTG3FKIVU6WYXZWYN[GMXCNTBED\?AL;W3F]GMNTBCDRG^WM_%start `�a'b adcfe gfh=i"`�h%type <Regular expr.regexp> `�a'b adcfe g$h=i"`$h%left BELMI%left ?-WMX1?ABjI:?A@EBCDk?A@CBEDEF]GJI8WMNMGMXENTBCD%nonassoc FKIHBCDRNMLPOAFRQlOmG3FKIVU6WMX

Grammar rules

`�a'b adcfe g$hni `$h3�<�*�#o` a2b apcpeqG^WM_

{ $1 }

`�a'b adcfer�.�*�#H?A@EBCD

{ char$1 }#H?A@EBCDEF]GJI

{ $1 }#o` a2b apcpesFKIHBCD

{ star tu% }#o` a2b apcpeqNML^OVF

{ some tv% }#o` a2b apcpe�QlOmG3FKIVU6WMX

{ opt tv% }#o` a2b apcpeqBCLMIw`�a'b adcfe

{ alt tv%xt y }#o` a2b apcpe�`�a'b adcfe %prec ?-WYXz?ABJI

{ seq tu%xt � }#HWMNMGMXENTBCD{` a'b"apcpe�?AL-W3F]G[NTBCD

{ $2 }

Module Regexp lexer (Lex)

{

open Regular expr

open Regexp parser

open Lexing

16

Page 17: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let add inter a b l �let rec add rec a b l �

match l with| /�0|� / � a � b �20|�a1 � b1 �-�.� r �

if b a1then

�a � b �-�.� l

elseif b1 athen

�a1 � b1 �;�<� � add rec a b r �

else�'�intervalsa � b anda1 � b1 intersect

� �add rec

�min a a1� � max b b1 � r

inif a , b then l else add rec a b l

let complementl �let rec comp rec a l �

match l with| /�0|�

if a Z� �"} then / � a �p�"� �"�20 else /�0|�a1 � b1 �-�.� r �

if a a1 then�a � a1 4:%9�3�.� � comp rec

�b1 78%9� r � else comp rec

�b1 78%9� r

incomp rec l

let interv a b � char interv�Char.chr a� � Char.chr b �

let rec make charsetl �match l with

| /�0|� empty|�a � b �-�.� r � alt

�interv a b � � make charsetr �

}

17

Page 18: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

rule h=~9� a9��� parse| ’\\’

{ CHAR�lexeme char lexbuf %6� }

| ’.’{ CHARSET

�interv ^�"� �"� }

| ’*’{ STAR }

| ’+’{ PLUS }

| ’?’{ QUESTION}

| ’|’{ ALT }

| ’(’{ OPENPAR }

| ’)’{ CLOSEPAR }

| "[ˆ"{ CHARSET

�make charset

�complement

�charsetlexbuf � ��� }

| ’[’{ CHARSET

�make charset

�charsetlexbuf ��� }

|{ CHAR

�lexeme char lexbuf � }

| eof{ EOF }

and �p�ui"`fgfa6hs� parse| ’]’

{ /�0 }| ’\\’

{ let c � Char.code�lexeme char lexbuf %9� in

add inter c c�charsetlexbuf � }

| [ˆ ’\\’ ] ’-’{ let c1 � Char.code

�lexeme char lexbuf �

and c2 � Char.code�lexeme char lexbuf � �

inadd inter c1 c2

�charsetlexbuf � }

|{ let c � Char.code

�lexeme char lexbuf "� in

add inter c c�charsetlexbuf � }

3.4 Module Regexp syntax

This moduleoffers a way of building regular expressionssyntactically, following moreor lessthe GNU regexpsyntaxasin egrep.This is summarizedin thetableof figure3.1.

Moreover, the postfix operators* , + and? have priority over the concatenation,itself having priority overalternationwith |

from string str builds a regexp by interpretingsyntacticallythe stringstr. Thesyntaxmustfollow the tableabove. It raisesexceptionInvalid argument"from string" if thesyntaxis not correct.

val from string � string � Regular expr.regexp

18

Page 19: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

char denotesthecharactercharfor all non-specialchars\ char denotesthecharacterchar for specialcharacters. , \ , * , +, ?, | , [ , ] ,

( and). denotesany single-characterword[ set] denotesany single-characterword belongingto set. Intervals may be

givenasin [a-z] .[^ set] denotesany single-characterword notbelongingto set.regexp* denotestheKleenestarof regexpregexp+ denotesany concatenationof oneor morewordsof regexpregexp? denotestheemptyword or any word denotedby regexpregexp � | regexp� denotesany wordsin regexp � or in regexp�regexp � regexp� denotesany contecatenationof awordof regexp � andawordof regexp�( regexp) parentheses,denotesthesamewordsasregexp.

Figure3.1: Syntaxof regularexpressions

3.5 Implementation of moduleRegexp syntax

let from string s �try

let b � Lexing.from string s inRegexp parser.regexp startRegexp lexer.tokenb

withParsing.Parse error � invalid arg "from string"

3.6 Module Automata

This moduleprovidescompilationof a regexp into an automaton.It thenprovidessomefunctionsfor matchingandsearching.

3.6.1 Automata, compilation

Automataconsideredherearedeterministic.Thetypeof automatais abstract.compiler returnsanautomatonthatrecognizesthelanguageof r.

type automaton

val compile � Regular expr.regexp � automaton

3.6.2 Execution of automata

exec automatonauto str pos executesthe automatonauto on string str startingat position pos. Returnsthemaximalpositionp suchthat the substringof str from positionspos (included)to p (excluded)is acceptablebytheautomaton,-1 if no suchpositionexists.

Unpredictableresultsmayoccurif pos � .val exec automaton � automaton � string � int � int

3.6.3 Matching and searching functions

searchforwarda str possearchin thestringstr, startingatpositionposawordthatis in thelanguageof automatona. Returnsa pair

�b � e� whereb is positionof thefirst charmatched,ande is thepositionfollowing theposition

of thelastcharmatched.

19

Page 20: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

RaisesNot found of no matchingword is found.Notice: even if theautomatonacceptstheemptyword, this functionwill never return

�b � e� with e � b. In

otherwords,this functionalwayssearchfor non-emptywordsin thelanguageof automatona.Unpredictableresultsmayoccurif pos � .

val searchforward � automaton � string � int � int � int

split stringsa s extract from strings thesubwords(of maximalsize)thatarein the languageof a. For examplesplit strings

�compile

�from string "[0-9]+" � � "12+3*45" returns["12" ;"3" ;"45" ].

split delim a s splits string s into pieces delimited by a. For examplesplit strings

�compile

�char’:’ ��� "marche:G6H3a656h6g56:534:180:Claude Marche:/hom e/marche:/ bin/bash"

returns[ "marche" ; "G6H3a656h6g56" ; "534" ; "180" ; "ClaudeMarche" ; "/home/marche" ; "/bin/bash" ].

val split strings � automaton � string � string list

val split delim � automaton � string � string list

to dot a f exportstheautomatona to thefile f in DOT format.

val to dot � automaton � string � unit

3.7 Implementation of moduleAutomata

Compilationof regexp into anautomaton

open Regular expr

3.7.1 The type of automata

Automataconsideredherearedeterministic.Thestatesof theseautomataarealwaysrepresentedasnaturalnumbers,theinitial statealwaysbeing0.An automatonis thenmadeof a transitiontable,giving for eachstate � a sparsearraythatmapscharacters

codesto states; anda tableof acceptingstates.

type automaton �{

auto trans � �int � int array� array;

auto accept � bool array;}

3.7.2 Execution of automata

exec automatonauto str pos executesthe automatonauto on string str startingat position pos. Returnsthemaximalpositionp suchthat the substringof str from positionspos (included)to p (excluded)is acceptablebytheautomaton,-1 if no suchpositionexists.

20

Page 21: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let exec automatonautos pos �let state � ref 0and last acceptpos �

ref�if auto.auto accept� � � then poselse -1�

and i � ref posand l � String.lengthsintry

while !i l dolet

�offset � m ��� auto.auto trans� � + state� in

let index � Char.code�String.gets + i �Y4 offset in

if index ��� index � Array.length m then raiseExit >state �*� m � � index ��>if !state � 4R% then raiseExit >incr i >if auto.auto accept� � + state� then last acceptpos �*�&+ i >

done;!last acceptpos>

withExit � + last acceptpos>

3.7.3 Searching functions

searchforwarda str possearchin thestringstr, startingatpositionposawordthatis in thelanguageof automatona. Returnsa pair

�b � e� whereb is positionof thefirst charmatched,ande is thepositionfollowing theposition

of thelastcharmatched.RaisesNot found of no matchingword is found.Notice: even if theautomatonacceptstheemptyword, this functionwill never return

�b � e� with e � b. In

otherwords,this functionalwayssearchfor non-emptywordsin thelanguageof automatona.Unpredictableresultsmayoccurif pos � .

let rec searchforward autos pos �if pos � String.length sthen raiseNot foundelse

let n � exec automatonautos pos inif n , pos then pos� n else searchforward autos

�succpos�

split stringsa s extractfrom strings thesubwords(of maximalsize)thatarein thelanguageof a

let split stringsautoline �let rec loop pos �

trylet b � e � searchforward autoline pos inlet id � String.sub line b

�e 4 b � in

id �.� � loop e �with Not found � /�0

inloop

21

Page 22: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let split delim autoline �let rec loop pos �

trylet b � e � searchforward autoline pos inlet id � String.sub line pos

�b 4 pos� in

id �.� � loop e �with Not found �

/ String.sub line pos�String.lengthline 4 pos�=0

inloop

3.7.4 Compilation of a regexp

compiler returnsanautomatonthatrecognizesthelanguageof r.

module IntMap �Inttagmap.Make

�struct type t � int let tag x � x end �

module IntSet �Inttagset.Make

�struct type t � int let tag x � x end �

let rec computemax b l �match l with

| /'01� b|� � x � �;�<� r � computemax x l

module HashRegexp �Hashtbl.Make

�struct

type t � Regular expr.regexplet equal � � ���let hash � Regular expr.uniq tag

end �

let compiler �

we have a hashtableto avoid severalcompilationof thesameregexp

let hashtable � HashRegexp.create� �"!

transtableis the transitiontableto fill, andacceptableis the tableof acceptingstates.transtablemapsany stateinto a CharMap� t , which itself mapscharactersto states.

and transtable � ref IntMap.emptyand acceptable � ref IntSet.emptyand next state � ref 0in

loop r fills thetablesfor theregexp r, andreturntheinitial stateof theresultingautomaton.

22

Page 23: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let rec loop r �try

HashRegexp.find hashtablerwith

Not found ����generatea new state

� �let init ��+ next stateand next chars � Regular expr.firstcharsrinincr next state>���

fill thehashtablebeforerecursion� �

HashRegexp.add hashtabler init >���fill thesetof acceptablestates

� �if nullabler then acceptable �*� IntSet.add init + acceptable>���

computethemapfrom charsto statesfor thenew state� �

let t � build sparsearraynext charsin���addit to thetransitiontable

� �transtable � � IntMap.addinit t + transtable>���

returnthenew state� �

init

and build sparsearraynext chars �match next charswith

| /�0|� � u�6/�#.# 0n�|�a � b � �-�.� r �

let mini � aand maxi � List.fold left

�fun

� � x � ��� x � b rinlet t � Array.create

�maxi 4 mini 7w%9� � 4J%6� in

List.iter�fun

�a � b � r ���

let s � loop r infor i � a to b do

t � � i 4 mini ��� sdone �

next chars>�mini � t �

inlet � loop r in

we thenfill thearraysdefiningtheautomaton

{auto trans � Array.init + next state

�fun i � IntMap.find i + transtable�$>

auto accept � Array.init + next state�fun i � IntSet.memi + acceptable�$>

}

3.7.5 Output functions

to dot a f exportstheautomatona to thefile f in DOT format.

open Printf

module CharSet �Set.Make

�struct type t � char let compare

�x � char� � y � char��� comparex y end �

let no chars � CharSet.empty

23

Page 24: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let rec char interval a b �if a , b then no charselse CharSet.add

�Char.chr a � � char interval

�succa � b �

let all chars � char interval C� �"�let complement � CharSet.diff all chars

let intervals s �let rec interv � function

| i �3/'01� List.rev i| /'0�� n �.� l � interv

� / � n � n �20�� l �|�mi � ma�x�.� i as is � n �.� l �

if Char.coden � succ�Char.codema� then

interv� �

mi � n �;�<� i � l �else

interv� �

n � n �;�<� is � l �ininterv

� /�0�� CharSet.elementss�let output label cout s �

let char= function| ’"’ � "\\\""| ’\\’ � "#92"| c �

let n � Char.codec inif n ,�y"��� n &%6�"! then String.make% c else sprintf "#%d" n

inlet output interv

�mi � ma���

if mi � ma thenfprintf cout "%s "

�charmi �

else if Char.codemi � pred�Char.codema� then

fprintf cout "%s %s "�charmi � � charma�

elsefprintf cout "%s-%s "

�charmi � � charma�

inlet is � intervalss inlet ics � intervals

�complements � in

if List.length is List.length ics thenList.iter output interv is

else beginfprintf cout "[ˆ" ; List.iter output interv ics > fprintf cout "]"

end

let output transitionscout i�n � m ���

let rev m � ref IntMap.emptyinfor k �� to Array.length m 4�% do

let j � m � � k � inlet s � try IntMap.find j + rev m with Not found � CharSet.empty inrev m � � IntMap.addj

�CharSet.add

�Char.chr

�k 7 n � � s�3+ rev m

done;IntMap.iter�

fun j s �fprintf cout " %d -> %d [ label = \"" i j >output label cout s >fprintf cout "\" ];\n" �

!rev m

24

Page 25: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

let to dot a f �let cout � open out f infprintf cout "digraph finite state machine {

/* rankdir=LR; */orientation=land;node [shape = doublecircle];" ;

Array.iteri�fun i b � if b then fprintf cout "%d " i � a.auto accept>

fprintf cout ";\n node [shape = ci rcle];\n" ;Array.iteri

�output transitionscout� a.auto trans>

fprintf cout "}\n" ;close out cout

25

Page 26: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Index of Identifiers

add, 14, 12,14,22,24add inter, 16, 18all chars, 24, 24alt, 11, 12, 13,14,15,16,17Alt , 11, 12ALT (camlyacctoken), 16, 16Automata(module), 19, 20automaton(type), 19, 20, 19,20auto accept(field), 20, 20,23,25auto trans(field), 20, 20,23,25char, 11, 12CHAR (camlyacctoken), 16, 16charset(camllex parsingrule), 18CharSet(module), 23, 23,24CHARSET(camlyacctoken), 16, 16char interv, 11, 13, 17Char interv, 11, 12,13char interval, 24, 24CLOSEPAR (camlyacctoken), 16, 16compare, 23, 23compile, 19, 22complement, 17, 24, 17,24computemax, 22, 22create, 10, 12,22,23empty, 11, 12, 12,13,17,22,23,24Empty, 11, 12EOF (camlyacctoken), 16, 16epsilon, 11, 12, 12,13,14Epsilon, 11, 12equal, 10, 12, 22exec automaton, 19, 20, 21firstchars, 11, 14, 14,15,22fprint, 11, 15, 15from string, 18, 19, 19hash, 10, 12, 22, 12Hash(module), 12, 12hashcons, 10, 12Hashcons(module), 9, 11,12HashedType (sig), 10, 10HashRegexp (module), 22, 22hash cons, 12, 12,13hash consed(type), 10, 10,11hash consing table, 12, 12hkey (field), 10insert, 14, 14

insert list , 14, 14interv, 17, 17intervals, 24, 24IntMap (module), 22, 22,23,24IntSet(module), 22, 22,23key (type), 10, 10Make(module), 10, 12,22,23make charset, 17, 17match string, 11, 13node(field), 10, 12,13,14,15no chars, 23, 24nullable, 11, 13, 13,14,22OPENPAR (camlyacctoken), 16, 16opt, 11, 13, 16output label, 24, 24output transitions, 24, 25PLUS (camlyacctoken), 16, 16print, 11, 15QUESTION(camlyacctoken), 16, 16regexp (type), 11, 11,16,18,19,22regexp (camlyaccnon-terminal), 16, 16regexp eq, 12, 12Regexp lexer (module), 16, 19Regexp parser(module), 15, 16,19regexp start (camlyaccnon-terminal), 16, 16regexp struct (type), 11, 11,12Regexp syntax(module), 18, 19Regular expr (module), 10, 11, 15,16,18,19,20,

22residual, 11, 13, 13S (sig), 10, 10searchforward, 20, 21, 21seq, 11, 12, 13,14,15,16Seq, 11, 12some, 11, 13, 16split delim, 20, 21split strings, 20, 21star, 11, 12, 13,15,16Star, 11, 12STAR (camlyacctoken), 16, 16string, 11, 13String, 11, 13t (type), 10, 12, 16, 22, 23, 10,11tag, 22tag (field), 10, 12

26

Page 27: A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

token(camllex parsingrule), 17to dot, 20, 25uniq tag, 11, 12, 22

27