## Thursday, 1 May 2008

### Partial Derivative Regular Expression Pattern Matching

`Regular expression pattern matching as found in domain-specificlanguages such as XDuce, CDuce, XHaskell is a highly usefulconcept but is not commonly supported by mainstream languages.Below I give an elementary Haskell implementation of regular expressionpattern matching based on (partial) derivatives of regular expressions.This work is part of the XHaskell project, carriedout in collaboration with my PhD student Kenny Lu Zhuo.Thanks to Burak Emir for helpful discussions(and suggesting the name partial derivate regexp pattern matching).This post is a literate Haskell file (converted to html thanksto Malcolm Wallace's hscolour).----------------- Motivation Our goal is to understand and implement pattern matching involvingregular expressions a la XDuce, CDuce and XHaskell.Here's a simple example that illustrates what's going on.f :: (Space | Text)* -> Text*f () = ()f (x : Text+, y : (Space | Text)*) = (x, f y)f (x : Space*, y : (Space | Text)*) = f yThe above function removes all (white) space from a text.We assume that Text refers to some alpha-numeric character.We expect thatf "   Hello Bye" yields "HelloBye"Here's what's happening during the evaluation of f "   Hello Bye".We match the input "   Hello Bye" against each pattern clause(from top to bottom)The first clause cleary doesn't apply, so does not the second clausebecaues the pattern (x : Text+, y : (Space | Text)*) matches onlyinput that starts with some Text.The third pattern (x : Space*, y : (Space | Text)*) applies.The subpattern x : Space* binds any number of spaces. We assumethat we greedily consumes spaces. Hence, the pattern match(x : Space*, y : (Space | Text)*) against "   Hello Bye" yields "HelloBye"yields x holds "   "        (the three leading spaces) y holds "Hello Bye"  (the rest)(Note: We view here a string as a concenation of Space and Text)Having established the binding of pattern variables, we canexecute the body of the pattern clause. This leads tof "Hello Bye"Again, we check which pattern matches the input "Hello Bye".This time the second pattern applies and we find that x holds "Hello"  (all non-empty text) y holds " Bye"   (the rest)The subsequent call isf " Bye"It should be clear now how the rest of the computation progresses.We keep all text but throw away all spaces.Hence,f "   Hello Bye"evaluates to"HelloBye"The main difference to pattern matching a la ML and Haskell is thatwe can NOT perform the pattern match by comparing the structureof the pattern against the structure of the incoming value.The reason is that a pattern like(x : A*)can possibly match a number of structurally different values such as ""           (also known as epsilon) "A" "AA"  and so onThe challenge is that we need to take into account the semanticmeaning of regular expressions when performing the pattern match.Here's another tricky example.Matching the input "AB" against the pattern (x:A*, y:((A,B)* | B))yields the following pattern bindings (x,""), (y,"AB")and (x,"A"), (y,"B")This shows that regular expression pattern matchingis indeterministic (unless we fix a specific pattern matching policy).------------------------------------------------- Formalizing the Pattern Matching RelationWe first need to fix the language of regular expressions and patterns.r ::= r + r         -- choice                    -- later, we often write (r | r) instead of (r + r)   |  (r,r)         -- concatenation   |  r*            -- repetition   |  epsilon       -- empty word   |  A | B | ...   -- alphabet symbols/lettersp ::= (x : r)       -- r is a regular expression   |  (p,p)         -- pairs   |  (p + p)       -- choiceThe pattern matching relation w <| p ~> envpronounced"word w matches pattern p and yiels the environment env"(Note: words are strings, sequences of letters/characters)Here are the rules describing all valid pattern matching relations.We write L(r) to denote the language describe by a regular expression.       w in L(r)(Var)  -------------------------       w <| (x :: r) ~> [(x,w)]       w = w1 w2       w1 <| p1 ~> env1       w2 <| p2 ~> env2           (Pair) -----------------------------       w <| (p1,p2) ~> env1 ++ env2          w <| p1 ~> env1(Choice1) ------------------------          w <| p1 + p2 ~> env1          w <| p2 ~> env2(Choice1) ------------------------          w <| p1 + p2 ~> env2How to implement  w <| p ~> envHow to cope with indeterminism?For example, consider(1)AA <| (x : A*, y : A*)~> [(x,epsilon), (y,AA)]~> [(x,A), (y,A)]~> [(x,AA), (y,epsilon)]A sample derivation for the first case     epsilon in L(A*)    ----------------------------------    epsilon <| (x:A*) ~> [(x,epsilon)]     AA in L(A*)    -----------------------    AA <| (y:A*) ~> [(y,AA)]   ------------------------------------------------   AA <| (x : A*, y : A*) ~> [(x,epsilon), (y,AA)](2)Recall the earlier example.AB <| (x: A*, y : ((A,B)* | B)) ~> [(x,A), (y,B)] ~> [(x,epsilon), (y,AB)]Before we develop an algorithm for theregular expression pattern matching problemlet's consider a simpler problem:How to check that w is in L(r), pronounced as"Word w matches the regular expression r"(This is also known as the word problem)We solve the word problem by applying the derivative algorithm: w matches r iff  case w = epsilon: isEmpty r   isEmpty yields True if the r is empty, otherwise False   The definition is straightforward. See below.  case w = (l,v):  v matches r\l where r\l is the derivative of r wrt l We obtain r\l by take away all "leading" l's. Semantically, we can explain the derivative operation as follows:    L(r\l) = { w | lw in L(r) } Below we define a function which computes an expression r\l representing the derivative of r wrt l. We can build the derivative r\l by composition of its partial derivatives. For example,  (r1 | r2)\l  = r1\l  |  r2\l  We call r1\l the partial derivative of (r1 | r2)\l.  Therefore, we often refer to derivatives as  partial derivatives.Here's the implementation in Haskell.The regular expression language, parameterized over some alphabet> data RE a where>   Phi :: RE a                     -- empty language>   Empty :: RE a                   -- empty word>   L :: a -> RE a                  -- single letter taken from alphabet a>   Choice :: RE a -> RE a -> RE a  -- r1 + r2>   Seq :: RE a -> RE a -> RE a     -- (r1,r2)>   Star :: RE a -> RE a            -- r*>  deriving EqSome combinators> re_star r = Star r> re_empty = Empty> re_letter l = L l> re_choice r1 r2 = Choice r1 r2> re_pair r1 r2 = Seq r1 r2A word is a list of alphabet letters> type Word a = [a]Pretty printing of regular expressions> instance Show a => Show (RE a) where>     show Phi = "{}">     show Empty = "<>">     show (L c) = show c>     show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")>     show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")>     show (Star r) = (show r ++ "*")Matching words against regular expressions> matchWord :: Eq a => RE a -> Word a -> Bool> matchWord r [] = isEmpty r> matchWord r (l:w) = matchWord (partDeriv r l) w> isEmpty :: RE a -> Bool> isEmpty Phi = False> isEmpty Empty = True> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)> isEmpty (Star r) = True> isEmpty (L _) = False> partDeriv :: Eq a => RE a -> a -> RE a> partDeriv Phi _ = Phi> partDeriv Empty _ = Phi> partDeriv (L l1) l2 = if l1 == l2  then Empty else Phi> partDeriv (Choice r1 r2) l = Choice (partDeriv r1 l) (partDeriv r2 l)> partDeriv (Seq r1 r2) l =>     if isEmpty r1>     then Choice (Seq (partDeriv r1 l) r2) (partDeriv r2 l)>     else Seq (partDeriv r1 l) r2> partDeriv (this@(Star r)) l =>    Seq (partDeriv r l) thisMost of the cases are straightforward.In case of (r1, r2) \ l we need to take care of the case that r1 is empty!Here's an example   partDeriv (A|B)* A=  ( partDeriv (A|B) A , (A|B)* )=  ( (partDeriv A A) | (partDeriv B A), (A|B)*)=  ( () | Phi, (A|B)*)effectively= (A|B)*The above also explains why we need Phi.some examples> r1 = (Choice (Star (Seq (L 'A') (L 'B'))) (L 'B')) -- ((A,B)*|B)> r2 = Seq (Star (L 'A')) (Star (L 'A'))             -- (A*,A*)> r3 = Star (Choice (L 'A') (L 'B'))                 -- (A|B)*> w1 = "AABBAAA"> m1 = matchWord r1 w1> m2 = matchWord r2 w1> m3 = matchWord r3 w1Let's get back to our problem of implementingthe pattern matching relation w <| p ~> env---------------------------------------------------- Towards a regexp pattern matching algorithmThe key idea: We apply the idea of derivatives to pattern matching(l,w) <| p ~> env iff w <| pdPat p l ~> env(l,w) matches the pattern p and yields environment Enviff w matches the (partial) pattern derivative p wrt lHow is pdPat defined?The easy case first.pdPat (p1 | p2) l = (pdPat p1 l) | (pdPat p2 l)What aboutpdPat (x:r) l = ?We can't simply replace ? by (x: r\l).Here building the derivative r\r of r wrt l means thatwe have consumed the input symbol l. But we mustrecord somewhere that we have consumed l.The easiest method is to record with eachvariable pattern the sequence of symbolswhich have been consumed so far by that variable pattern.So,the variable pattern<w> x : rsays that we have consumed so far w, hence,any valid pattern binding will look like (x,w...),but we yet need to consume any word v whichis accepted by r (ie v in L(r)).The derivative definition for variable patternsis then straightforward.pdPat l  (<w> x : r)  = <w,l> x : r\lHere's an example.AB |> (x : (A|B)*) ~> envOf course env = [(x,AB)]In some intermediate steps we findpdPat (<> x : (A|B)*) A =   <A> x : (A|B)*        -- shortcut see earlier calculation       -- and *remember* that we consumed ApdPat (<A> x : (A|B)*) B = <AB> x : (A|B)*What about pair patterns?pdPat (p1,p2) l = if isEmpty (strip p1)                 then (pdPat p1 l, p2) | (??, pdPat p2 l)                 else (pdPat p1 l, p2)stip simply extracts the regular expression.The question is what to replace ?? with ?If p1 is empty this means that all further matching will helping in p2.But we simply can't throw away p1 because we record the variablebinding in the pattern itself.We also can't simply keep p1, we must indicate that the matchinguing p1 is finished.The idea is to replace ?? by the pattern where wemake all regular expresssions empty (if possibly).For example, consider the pattern (<AB> x : (A|B)*, <> y : C*)where  x : (A|B)*  has already consumed ABand C is the remaining input.We expect pdPat (<AB> x : (A|B)*, <> y : C*) C=    (<ABC> x : Phi, <y> y : C*) | (<AB> x : (), <C> y : C*) (A|B)* \ C = Phi which shows that from (<ABC> x : Phi, <y> y : C*) we can't obtain any valid pattern binding. But (A|B)* is empty. Hence, we can stop matching using the subpattern <AB> x : (A|B)* which we indicate by replacing (A|B)* with () (the empty sequence). In general, to stop matching uinsg p1 we need to make the pattern p1 empty by replacing all regular expressions which contain () by () and all others we replace by Phi.Here's the implementation of pattern derivatives in Haskell.The pattern language, we assume that patterns are linear> data Pat a where>  PVar :: Int -> Word a -> RE a-> Pat a>  PPair :: Pat a -> Pat a -> Pat a>  PChoice :: Pat a -> Pat a -> Pat a  --  PVar var w r  -- variables var are represented by Ints  -- w represents the part we have already seen  -- r represents the remaining part we yet have to matchCombinators> pat_var n r = PVar n [] r> pat_pair p1 p2 = PPair p1 p2> pat_choice p1 p2 = PChoice p1 p2pretty printing> instance Show a => Show (Pat a) where>   show (PVar i w r) = ("x"++show i ++ "::" ++ show r)>   show (PPair p1 p2) = ("<" ++ show p1 ++ "," ++ show p2 ++ ">")>   show (PChoice p1 p2) = "(" ++ show p1 ++ "|" ++ show p2 ++ ")"(partial) derivative of patterns> pdPat :: Eq a => Pat a -> a -> Pat a> pdPat (PVar x w r) l = PVar x (w ++ [l]) (partDeriv r l)> pdPat (PPair p1 p2) l =>   if (isEmpty (strip p1))>   then PChoice (PPair (pdPat p1 l) p2) (PPair (mkEmpPat p1) (pdPat p2 l))>   else PPair (pdPat p1 l) p2> pdPat (PChoice p1 p2) l =>   PChoice (pdPat p1 l) (pdPat p2 l)> strip :: Pat a -> RE a> strip (PVar _ w r) = r> strip (PPair p1 p2) = Seq (strip p1) (strip p2)> strip (PChoice p1 p2) = Choice (strip p1) (strip p2)replace all (<w> x : r) by (<w> x: <>) if isEmpty rotherwise yield (<w> x: Phi)> mkEmpPat :: Pat a -> Pat a> mkEmpPat (PVar x w r)>  | isEmpty r = PVar x w Empty>  | otherwise = PVar x w Phi> mkEmpPat (PPair p1 p2) = PPair (mkEmpPat p1) (mkEmpPat p2)> mkEmpPat (PChoice p1 p2) = PChoice (mkEmpPat p1) (mkEmpPat p2)An example to see the derivation operation on patterns in action(<> x : A*, <> y : A*) and input AA-->A  (<A> x : A*, <> y : A*) | (<> x : (), <A> y : A*)-->A                                                     |   (<AA> x : A*, <> y : A*)  |    (<A> x : Phi, <A> y : A*) | (<A> x : (), <A> y : A*)  |  | (<> x : (), <AA> y : A*)Observation: We not only calculate one possible matching,            we calculate all matchings!(No performance penalty thanks to lazy evaluation)The binding of pattern variables to words> type Env a = [(Int,Word a)]> patMatch :: Eq a => Pat a -> Word a -> [Env a]> patMatch p (l:w) =>    patMatch (pdPat p l) w> patMatch (PVar x w r) [] =>    if isEmpty r then [[(x,w)]] else []> patMatch (PChoice p1 p2) [] =>   (patMatch p1 []) ++ (patMatch p2 [])           >             -- indet choice> patMatch (PPair p1 p2) [] =>    (patMatch p1 []) `combine` (patMatch p2 [])   >           -- build all possible combinations>  where>    combine xss yss = [ xs ++ ys |  xs <- xss, ys <- yss]> longPatMatch :: Eq a => Pat a -> Word a -> Maybe (Env a)> longPatMatch p w =>      first (patMatch p w)>   where>     first (env:_) = return env>     first _ = Nothing> shortPatMatch :: Eq a => Pat a -> Word a -> Maybe (Env a)> shortPatMatch p w =>      last (patMatch p w)>   where>      last [env] = return env>      last (_:xs) = last xs>      last _  = Nothing`