Thursday 1 May 2008

Partial Derivative Regular Expression Pattern Matching



Regular expression pattern matching as found in domain-specific
languages such as XDuce, CDuce, XHaskell is a highly useful
concept but is not commonly supported by mainstream languages.
Below I give an elementary Haskell implementation of regular expression
pattern matching based on (partial) derivatives of regular expressions.


This work is part of the XHaskell project, carried
out in collaboration with my PhD student Kenny Lu Zhuo.
Thanks to Burak Emir for helpful discussions
(and suggesting the name partial derivate regexp pattern matching).

This post is a literate Haskell file (converted to html thanks
to Malcolm Wallace's hscolour).

---------------
-- Motivation

Our goal is to understand and implement pattern matching involving
regular expressions a la XDuce, CDuce and XHaskell.
Here's a simple example that illustrates what's going on.

f :: (Space | Text)* -> Text*
f () = ()
f (x : Text+, y : (Space | Text)*) = (x, f y)
f (x : Space*, y : (Space | Text)*) = f y

The above function removes all (white) space from a text.
We assume that Text refers to some alpha-numeric character.

We expect that

f " Hello Bye" yields "HelloBye"

Here's what's happening during the evaluation of f " Hello Bye".

We match the input " Hello Bye" against each pattern clause
(from top to bottom)

The first clause cleary doesn't apply, so does not the second clause
becaues the pattern (x : Text+, y : (Space | Text)*) matches only
input that starts with some Text.
The third pattern (x : Space*, y : (Space | Text)*) applies.
The subpattern x : Space* binds any number of spaces. We assume
that we greedily consumes spaces. Hence, the pattern match
(x : Space*, y : (Space | Text)*) against " Hello Bye" yields "HelloBye"
yields

x holds " " (the three leading spaces)
y holds "Hello Bye" (the rest)

(Note: We view here a string as a concenation of Space and Text)

Having established the binding of pattern variables, we can
execute the body of the pattern clause. This leads to

f "Hello Bye"

Again, we check which pattern matches the input "Hello Bye".
This time the second pattern applies and we find that

x holds "Hello" (all non-empty text)
y holds " Bye" (the rest)

The subsequent call is

f " Bye"

It should be clear now how the rest of the computation progresses.
We keep all text but throw away all spaces.
Hence,

f " Hello Bye"

evaluates to

"HelloBye"


The main difference to pattern matching a la ML and Haskell is that
we can NOT perform the pattern match by comparing the structure
of the pattern against the structure of the incoming value.
The reason is that a pattern like

(x : A*)

can possibly match a number of structurally different values such as

"" (also known as epsilon)

"A"

"AA"

and so on


The challenge is that we need to take into account the semantic
meaning of regular expressions when performing the pattern match.

Here's another tricky example.
Matching the input "AB" against the pattern (x:A*, y:((A,B)* | B))
yields the following pattern bindings

(x,""), (y,"AB")

and

(x,"A"), (y,"B")

This shows that regular expression pattern matching
is indeterministic (unless we fix a specific pattern matching policy).


-----------------------------------------------
-- Formalizing the Pattern Matching Relation


We first need to fix the language of regular expressions and patterns.


r ::= r + r -- choice
-- later, we often write (r | r) instead of (r + r)
| (r,r) -- concatenation
| r* -- repetition
| epsilon -- empty word
| A | B | ... -- alphabet symbols/letters

p ::= (x : r) -- r is a regular expression
| (p,p) -- pairs
| (p + p) -- choice



The pattern matching relation

w <| p ~> env

pronounced

"word w matches pattern p and yiels the environment env"

(Note: words are strings, sequences of letters/characters)


Here are the rules describing all valid pattern matching relations.
We write L(r) to denote the language describe by a regular expression.


w in L(r)
(Var) -------------------------
w <| (x :: r) ~> [(x,w)]


w = w1 w2
w1 <| p1 ~> env1
w2 <| p2 ~> env2
(Pair) -----------------------------
w <| (p1,p2) ~> env1 ++ env2


w <| p1 ~> env1
(Choice1) ------------------------
w <| p1 + p2 ~> env1


w <| p2 ~> env2
(Choice1) ------------------------
w <| p1 + p2 ~> env2


How to implement w <| p ~> env
How to cope with indeterminism?

For example, consider

(1)

AA <| (x : A*, y : A*)

~> [(x,epsilon), (y,AA)]

~> [(x,A), (y,A)]

~> [(x,AA), (y,epsilon)]


A sample derivation for the first case


epsilon in L(A*)
----------------------------------
epsilon <| (x:A*) ~> [(x,epsilon)]


AA in L(A*)
-----------------------
AA <| (y:A*) ~> [(y,AA)]

------------------------------------------------
AA <| (x : A*, y : A*) ~> [(x,epsilon), (y,AA)]

(2)

Recall the earlier example.

AB <| (x: A*, y : ((A,B)* | B))

~> [(x,A), (y,B)]

~> [(x,epsilon), (y,AB)]


Before we develop an algorithm for the
regular expression pattern matching problem
let's consider a simpler problem:

How to check that w is in L(r), pronounced as

"Word w matches the regular expression r"

(This is also known as the word problem)

We solve the word problem by applying the derivative algorithm:

w matches r iff

case w = epsilon: isEmpty r

isEmpty yields True if the r is empty, otherwise False
The definition is straightforward. See below.

case w = (l,v): v matches r\l

where r\l is the derivative of r wrt l


We obtain r\l by take away all "leading" l's.
Semantically, we can explain the derivative
operation as follows:

L(r\l) = { w | lw in L(r) }

Below we define a function which computes an
expression r\l representing the derivative of
r wrt l.

We can build the derivative r\l by composition
of its partial derivatives.
For example,

(r1 | r2)\l = r1\l | r2\l

We call r1\l the partial derivative of (r1 | r2)\l.
Therefore, we often refer to derivatives as
partial derivatives.

Here's the implementation in Haskell.

The regular expression language, parameterized over some alphabet

> data RE a where
> Phi :: RE a -- empty language
> Empty :: RE a -- empty word
> L :: a -> RE a -- single letter taken from alphabet a
> Choice :: RE a -> RE a -> RE a -- r1 + r2
> Seq :: RE a -> RE a -> RE a -- (r1,r2)
> Star :: RE a -> RE a -- r*
> deriving Eq

Some combinators

> re_star r = Star r
> re_empty = Empty
> re_letter l = L l
> re_choice r1 r2 = Choice r1 r2
> re_pair r1 r2 = Seq r1 r2

A word is a list of alphabet letters

> type Word a = [a]

Pretty printing of regular expressions

> instance Show a => Show (RE a) where
> show Phi = "{}"
> show Empty = "<>"
> show (L c) = show c
> show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")
> show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")
> show (Star r) = (show r ++ "*")


Matching words against regular expressions

> matchWord :: Eq a => RE a -> Word a -> Bool
> matchWord r [] = isEmpty r
> matchWord r (l:w) = matchWord (partDeriv r l) w

> isEmpty :: RE a -> Bool
> isEmpty Phi = False
> isEmpty Empty = True
> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
> isEmpty (Star r) = True
> isEmpty (L _) = False

> partDeriv :: Eq a => RE a -> a -> RE a
> partDeriv Phi _ = Phi
> partDeriv Empty _ = Phi
> partDeriv (L l1) l2 = if l1 == l2 then Empty else Phi
> partDeriv (Choice r1 r2) l = Choice (partDeriv r1 l) (partDeriv r2 l)
> partDeriv (Seq r1 r2) l =
> if isEmpty r1
> then Choice (Seq (partDeriv r1 l) r2) (partDeriv r2 l)
> else Seq (partDeriv r1 l) r2
> partDeriv (this@(Star r)) l =
> Seq (partDeriv r l) this


Most of the cases are straightforward.
In case of (r1, r2) \ l we need to take care of the case that r1 is empty!

Here's an example

partDeriv (A|B)* A
= ( partDeriv (A|B) A , (A|B)* )
= ( (partDeriv A A) | (partDeriv B A), (A|B)*)
= ( () | Phi, (A|B)*)

effectively

= (A|B)*

The above also explains why we need Phi.


some examples

> r1 = (Choice (Star (Seq (L 'A') (L 'B'))) (L 'B')) -- ((A,B)*|B)
> r2 = Seq (Star (L 'A')) (Star (L 'A')) -- (A*,A*)
> r3 = Star (Choice (L 'A') (L 'B')) -- (A|B)*

> w1 = "AABBAAA"
> m1 = matchWord r1 w1
> m2 = matchWord r2 w1
> m3 = matchWord r3 w1


Let's get back to our problem of implementing
the pattern matching relation

w <| p ~> env


--------------------------------------------------
-- Towards a regexp pattern matching algorithm


The key idea: We apply the idea of derivatives to pattern matching


(l,w) <| p ~> env iff w <| pdPat p l ~> env

(l,w) matches the pattern p and yields environment Env
iff w matches the (partial) pattern derivative p wrt l

How is pdPat defined?

The easy case first.

pdPat (p1 | p2) l = (pdPat p1 l) | (pdPat p2 l)

What about

pdPat (x:r) l = ?

We can't simply replace ? by (x: r\l).
Here building the derivative r\r of r wrt l means that
we have consumed the input symbol l. But we must
record somewhere that we have consumed l.

The easiest method is to record with each
variable pattern the sequence of symbols
which have been consumed so far by that variable pattern.

So,

the variable pattern

<w> x : r

says that we have consumed so far w, hence,
any valid pattern binding will look like (x,w...),
but we yet need to consume any word v which
is accepted by r (ie v in L(r)).

The derivative definition for variable patterns
is then straightforward.

pdPat l (<w> x : r) = <w,l> x : r\l

Here's an example.

AB |> (x : (A|B)*) ~> env

Of course env = [(x,AB)]

In some intermediate steps we find

pdPat (<> x : (A|B)*) A = <A> x : (A|B)*
-- shortcut see earlier calculation
-- and *remember* that we consumed A

pdPat (<A> x : (A|B)*) B = <AB> x : (A|B)*


What about pair patterns?


pdPat (p1,p2) l = if isEmpty (strip p1)
then (pdPat p1 l, p2) | (??, pdPat p2 l)
else (pdPat p1 l, p2)

stip simply extracts the regular expression.
The question is what to replace ?? with ?

If p1 is empty this means that all further matching will helping in p2.
But we simply can't throw away p1 because we record the variable
binding in the pattern itself.
We also can't simply keep p1, we must indicate that the matching
uing p1 is finished.

The idea is to replace ?? by the pattern where we
make all regular expresssions empty (if possibly).

For example, consider the pattern

(<AB> x : (A|B)*, <> y : C*)

where x : (A|B)* has already consumed AB

and C is the remaining input.

We expect

pdPat (<AB> x : (A|B)*, <> y : C*) C
= (<ABC> x : Phi, <y> y : C*) | (<AB> x : (), <C> y : C*)

(A|B)* \ C = Phi

which shows that from (<ABC> x : Phi, <y> y : C*)
we can't obtain any valid pattern binding.

But (A|B)* is empty. Hence, we can stop matching
using the subpattern <AB> x : (A|B)* which we indicate
by replacing (A|B)* with () (the empty sequence).

In general, to stop matching uinsg p1 we need to make
the pattern p1 empty by replacing all regular expressions
which contain () by () and all others we replace by Phi.

Here's the implementation of pattern derivatives in Haskell.

The pattern language, we assume that patterns are linear


> data Pat a where
> PVar :: Int -> Word a -> RE a-> Pat a
> PPair :: Pat a -> Pat a -> Pat a
> PChoice :: Pat a -> Pat a -> Pat a

-- PVar var w r
-- variables var are represented by Ints
-- w represents the part we have already seen
-- r represents the remaining part we yet have to match


Combinators

> pat_var n r = PVar n [] r
> pat_pair p1 p2 = PPair p1 p2
> pat_choice p1 p2 = PChoice p1 p2

pretty printing

> instance Show a => Show (Pat a) where
> show (PVar i w r) = ("x"++show i ++ "::" ++ show r)
> show (PPair p1 p2) = ("<" ++ show p1 ++ "," ++ show p2 ++ ">")
> show (PChoice p1 p2) = "(" ++ show p1 ++ "|" ++ show p2 ++ ")"


(partial) derivative of patterns

> pdPat :: Eq a => Pat a -> a -> Pat a
> pdPat (PVar x w r) l = PVar x (w ++ [l]) (partDeriv r l)
> pdPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (pdPat p1 l) p2) (PPair (mkEmpPat p1) (pdPat p2 l))
> else PPair (pdPat p1 l) p2
> pdPat (PChoice p1 p2) l =
> PChoice (pdPat p1 l) (pdPat p2 l)

> strip :: Pat a -> RE a
> strip (PVar _ w r) = r
> strip (PPair p1 p2) = Seq (strip p1) (strip p2)
> strip (PChoice p1 p2) = Choice (strip p1) (strip p2)

replace all (<w> x : r) by (<w> x: <>) if isEmpty r
otherwise yield (<w> x: Phi)

> mkEmpPat :: Pat a -> Pat a
> mkEmpPat (PVar x w r)
> | isEmpty r = PVar x w Empty
> | otherwise = PVar x w Phi
> mkEmpPat (PPair p1 p2) = PPair (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PChoice p1 p2) = PChoice (mkEmpPat p1) (mkEmpPat p2)

An example to see the derivation operation on patterns in action

(<> x : A*, <> y : A*) and input AA

-->A (<A> x : A*, <> y : A*) | (<> x : (), <A> y : A*)

-->A
|
(<AA> x : A*, <> y : A*) | (<A> x : Phi, <A> y : A*)
| (<A> x : (), <A> y : A*) | | (<> x : (), <AA> y : A*)


Observation: We not only calculate one possible matching,
we calculate all matchings!

(No performance penalty thanks to lazy evaluation)


The binding of pattern variables to words

> type Env a = [(Int,Word a)]

> patMatch :: Eq a => Pat a -> Word a -> [Env a]
> patMatch p (l:w) =
> patMatch (pdPat p l) w
> patMatch (PVar x w r) [] =
> if isEmpty r then [[(x,w)]] else []
> patMatch (PChoice p1 p2) [] =
> (patMatch p1 []) ++ (patMatch p2 [])
> -- indet choice
> patMatch (PPair p1 p2) [] =
> (patMatch p1 []) `combine` (patMatch p2 [])
> -- build all possible combinations
> where
> combine xss yss = [ xs ++ ys | xs <- xss, ys <- yss]

> longPatMatch :: Eq a => Pat a -> Word a -> Maybe (Env a)
> longPatMatch p w =
> first (patMatch p w)
> where
> first (env:_) = return env
> first _ = Nothing


> shortPatMatch :: Eq a => Pat a -> Word a -> Maybe (Env a)
> shortPatMatch p w =
> last (patMatch p w)
> where
> last [env] = return env
> last (_:xs) = last xs
> last _ = Nothing