Wednesday, 5 November 2008

Playing with regular expressions: Intersection

We implement symbolic intersection among regular expression
by adapting Antimirov's regular expression rewriting algorithm.
By symbolic we mean that regular expression are not transformed to
automata. We rewrite regular expressions by using partial derivatives.

The algorithm (highlights):


r1 `intersect` r2 =

let [l1,...,ln] = lettersOf (r1 | r2)
in

(l1, partDeriv r1 l1 `intersect` partDeriv r2 l1)
| ...
| (l1, partDeriv r1 ln `intersect` partDeriv r2 ln)


The idea is that each regular expression is turned into a
'disjunction' of clauses where each clause starts with
a letter, followed by the partial derivative of the regular
expression wrt that letter.

You probably will know Brzozowski's derivative operation
which says that


derivative r l = [ w | (l,w) in L(r) ]


If we would use derivatives the above algorithm
is non-terminating. For example,


deriv (Star (L 'A')) 'A'
--> <<>,'A'*>

deriv (deriv (Star (L 'A')) 'A') 'A'
--> (<{},'A'*>|<<>,'A'*>)


and so on. The nice property of partial derivatives is that
they are 'finite'. Each regular expression has a finite
number of partial derivatives. For example,


partDeriv (Star (L 'A')) 'A'

--> <<>,'A'*>

partDeriv (partDeriv (Star (L 'A')) 'A') 'A'
--> <<>,'A'*>


Please check the paper

Valentin M. Antimirov: Partial Derivatives of Regular Expressions and Finite Automaton Constructions.
Theor. Comput. Sci. 155(2): 291-319 (1996)

for more details on derivatives, partial derivatives and their connection
(roughly, derivatives correspond to DFA states and partial derivatives correspond to NFA states).


There's another source of non-termination which we explain via
the following example.


A* `intersect` A*

--> (A, A* `intersect` A*)


where we simplified partDeriv A* A --> <<>,A*> to A*

The point to note is that the proof tree is infinite. The standard approach
is to use co-induction which then yields a finite (recursive) proof.

OK, here's finally the Haskell implementation:


> {-# LANGUAGE GADTs #-}

We're just using the GADT syntax to write 'ordinary' algebraic data types

> module RegExpIntersection where

> import List


> data RE a where
> Phi :: RE a -- empty language
> Empty :: RE a -- empty word
> L :: a -> RE a -- single letter taken from alphabet a
> Choice :: RE a -> RE a -> RE a -- r1 + r2
> Seq :: RE a -> RE a -> RE a -- (r1,r2)
> Star :: RE a -> RE a -- r*
> Var :: Int -> RE a -- first-order variables to represent regular equations
> deriving Eq

A word is a list of alphabet letters

> type Word a = [a]

Pretty printing of regular expressions

> instance Show a => Show (RE a) where
> show Phi = "{}"
> show Empty = "<>"
> show (L c) = show c
> show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")
> show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")
> show (Star r) = (show r ++ "*")

Turning a list of regular expressions into a regular expression by using | (choice)

> resToRE :: [RE a] -> RE a
> resToRE (r:res) = foldl Choice r res
> resToRE [] = Phi


Computing the set of letters of a regular expression

> sigmaRE :: Eq a => RE a -> [a]
> sigmaRE (L l) = [l]
> sigmaRE (Seq r1 r2) = nub ((sigmaRE r1) ++ (sigmaRE r2))
> sigmaRE (Choice r1 r2) = nub ((sigmaRE r1) ++ (sigmaRE r2))
> sigmaRE (Star r) = sigmaRE r
> sigmaRE Phi = []
> sigmaRE Empty = []


Testing if a regular expression is empty (empty word)

> isEmpty :: RE a -> Bool
> isEmpty Phi = False
> isEmpty Empty = True
> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
> isEmpty (Star r) = True
> isEmpty (L _) = False

Testing if a regular expression contains nothing

> isPhi :: RE a -> Bool
> isPhi Phi = True
> isPhi Empty = False
> isPhi (Choice r1 r2) = (isPhi r1) && (isPhi r2)
> isPhi (Seq r1 r2) = (isPhi r1) || (isPhi r2)
> isPhi (Star r) = False
> isPhi (L _) = False


Brzozowski's derivative operation
deriv r l denotes the regular expression where the
"leading l has been removed"
(not necessary for the intersection algorithm,
only included to illustrate the difference
to the upcoming partial derivative algorithm)

> deriv :: Eq a => RE a -> a -> RE a
> deriv Phi _ = Phi
> deriv Empty _ = Phi
> deriv (L l1) l2
> | l1 == l2 = Empty
> | otherwise = Phi
> deriv (Choice r1 r2) l =
> Choice (deriv r1 l) (deriv r2 l)
> deriv (Seq r1 r2) l =
> if isEmpty r1
> then Choice (Seq (deriv r1 l) r2) (deriv r2 l)
> else Seq (deriv r1 l) r2
> deriv (this@(Star r)) l =
> Seq (deriv r l) this



(a variant) of Antimirov's partial derivative operation

Antimirov demands that partDeriv (Star (L 'A')) 'A' yields [A*]
whereas our version yields [<<>,'A'*>].
The difference is not essential here.

> partDeriv :: Eq a => RE a -> a -> [RE a]
> partDeriv Phi l = []
> partDeriv Empty l = []
> partDeriv (L l') l
> | l == l' = [Empty]
> | otherwise = []
> partDeriv (Choice r1 r2) l = nub ((partDeriv r1 l) ++ (partDeriv r2 l))
> partDeriv (Seq r1 r2) l
> | isEmpty r1 =
> let s1 = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> s2 = partDeriv r2 l
> in nub (s1 ++ s2)
> | otherwise = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> partDeriv (Star r) l = [ (Seq r' (Star r)) | r' <- partDeriv r l ]


Here's finally the partial derivative based intersection algorithm.


> type Env a = [((RE a, RE a), RE a)]


> intersectRE :: Eq a => RE a -> RE a -> RE a
> intersectRE r1 r2 = intersectC [] r1 r2

> intersectC :: Eq a => Env a -> RE a -> RE a -> RE a
> intersectC env r1 r2
> | r1 == Phi || r2 == Phi = Phi
> | r1 == Empty || r2 == Empty = Empty
> | otherwise =
> case lookup (r1,r2) env of
> Just r -> r
> Nothing ->
> let letters = sigmaRE (r1 `Choice` r2)
> env' = ((r1,r2),r):env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> r' = resToRE $ map (\l -> Seq (L l) (intersectC env' (r1l l) (r2l l))) letters
> r = if (isEmpty r1) && (isEmpty r2)
> then Choice r' Empty
> else r'
> in r

The problem with the algorithm is that we use recursion in Haskell to model
recursive proofs. For example, try

intersect rAstar rAstar




Converting a regular equation into a regular expression

We assume that variables x appears (if at all) at position (r,Var x)
convert2 traveres the regexp and yields (r1,r2)
where the invariant is that r1 is part of the loop and r2 is the base case

> convert :: Int -> RE a -> RE a
> convert x r = let (r1,r2) = convert2 x r
> in Seq (Star r1) r2

> convert2 :: Int -> RE a -> (RE a, RE a)
> convert2 x Empty = (Empty, Empty)
> convert2 x (Var y)
> | x == y = (Empty,Empty)
> | otherwise = (Empty, Var y) -- can this happen?
> convert2 x (r@(Seq l r1))
> | contains x r1 = let (r2,r3) = convert2 x r1
> in (Seq l r2, r3)
> | otherwise = (Empty, r)
> convert2 x (Choice r1 r2) = let (r1', r1'') = convert2 x r1
> (r2', r2'') = convert2 x r2
> in (Choice r1' r2', Choice r1'' r2'')

> contains :: Int -> RE a -> Bool
> contains x (Var y) = x == y
> contains x (Seq r1 r2) = contains x r1 || contains x r2
> contains x (Star r) = contains x r
> contains x (Choice r1 r2) = contains x r1 || contains x r2
> contains x _ = False


We use first-order syntax to represent recursive proofs/regular expressions.
We then use convert to actually build the non-recursive regular expression.


> intersectRE2 :: Eq a => RE a -> RE a -> RE a
> intersectRE2 r1 r2 = intersectC2 1 [] r1 r2

> intersectC2 :: Eq a => Int -> Env a -> RE a -> RE a -> RE a
> intersectC2 cnt env r1 r2
> | r1 == Phi || r2 == Phi = Phi
> | r1 == Empty || r2 == Empty = Empty
> | otherwise =
> case lookup (r1,r2) env of
> Just r -> r
> Nothing ->
> let letters = sigmaRE (r1 `Choice` r2)
> env' = ((r1,r2),Var cnt):env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> r' = resToRE $ map (\l -> Seq (L l) (intersectC2 (cnt+1) env' (r1l l) (r2l l))) letters
> r = if (isEmpty r1) && (isEmpty r2)
> then Choice r' Empty
> else r'
> in convert cnt r


For testing purposes, it's handy to have a function which tests for (semantic)
equality among regular expressions (again written using partial derivatives).


> type EnvEq a = [(RE a, RE a)]

> eqRE :: Eq a => RE a -> RE a -> Bool
> eqRE r1 r2 = eqREC [] r1 r2

> eqREC :: Eq a => EnvEq a -> RE a -> RE a -> Bool
> eqREC env r1 r2
> | isEmpty r1 && (not (isEmpty r2)) = False
> | isPhi r1 && (not (isPhi r2)) = False
> | otherwise =
> if elem (r1,r2) env
> then True
> else let letters = sigmaRE (r1 `Choice` r2)
> env' = (r1,r2):env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> in and $ map (\l -> eqREC env' (r1l l) (r2l l)) letters

examples

> rA = L 'A'
> rAstar = Star rA
> rB = L 'B'
> rBstar = Star rB
> rAorBstar = Star (Choice rA rB)

Some sample runs (using intersectRE2, intersectRE won't terminate for interesting cases)

*RegExpIntersection> intersectRE2 rAstar rBstar
<((<>|<>)|<>)*,((<'A',{}>|<'B',{}>)|<>)>
*RegExpIntersection> eqRE Empty (intersectRE2 rAstar rBstar)
True
*RegExpIntersection> intersectRE2 rAorBstar rBstar
<((<>|<>)|<>)*,((<'A',{}>|<'B',<((<>|<'B',<>>)|<>)*,((<'A',{}>|<>)|<>)>>)|<>)>
*RegExpIntersection> eqRE rBstar (intersectRE2 rAorBstar rBstar)
True
*RegExpIntersection> intersectRE2 rAorBstar rAorBstar
<((<>|<>)|<>)*,((<'A',<((<'A',<>>|<'B',<>>)|<>)*,((<>|<>)|<>)>>|<'B',<((<'A',<>>|<'B',<>>)|<>)*,((<>|<>)|<>)>>)|<>)>
*RegExpIntersection> eqRE rAorBstar (intersectRE2 rAorBstar rAorBstar)
True
*RegExpIntersection> eqRE rAstar (intersectRE2 rAorBstar rAorBstar)
False
*RegExpIntersection> eqRE rBstar (intersectRE2 rAorBstar rAorBstar)
False


The regular expressions generated by the convert function are 'very verbose'.
There's space for lots of improvement.

15 comments:

Anonymous said...

Thanks for the interesting post and code, as well as references to partial derivatives!

Some questions: 1. when you said "to use co-induction which then yields a finite (recursive) proof," do you mean the use of convert?

2. Before the definition of convert you said "variables x appears .. at position (r,Var x)," do you mean x appears only at right most positions in the (now a mixture of regular and context-free) grammar? What constraints do we assume on the input of convert, so that we can always convert a regular expression r, possibly with x in it, to r1*r2?

Thanks!

Pseudonym said...

Here's my version from 2003. I didn't bother implementing intersection or set difference at the time.

Set difference turns out to be slightly more useful in practice than intersection. One example is in C-style comments; they are most naturally defined as "/*", followed by any string not containing "*/", followed by "*/".

Martin Sulzmann said...

Thanks for the comment Andrew. I had a quick scan through your code. You follow the 'traditional' approach by converting regular expressions to DFAs?

The point of my code is that we can implement standard regular expression operations by rewriting them (without having to via an explicit DFA construction).

My post is a response to
http://www.iis.sinica.edu.tw/~scm/tag/regular-expression/
I think I've screwed up some of my blogs settings, so Shin couldn't leave a comment.

Pseudonym said...

Yes, this code was intended to implement DFAs, rather than just compute set intersection.

One reason why I like your method better is that the standard algorithm sometimes ends up with "infinite failure". This regular expression, for example:

(a|b)* - b*(ab*)*

recognises the empty set. The DFA produced looks like this:

q0 -> a q0 | b q0

where q0 is not a final state. This does indeed accept the empty set, but it only does so after consuming the entire string.

Obviously, this behaviour is undesirable in practice. When we wrote the lexer generator for Mercury, we solved the problem by making the subsequent DFA minimisation step do the dirty work. (You put a known "empty set" state with no transitions to or from it into the DFA, then perform state minimisation, then any state which ends up in the same partition as that special state must also be an "empty set" state.)

However, that always sat badly with me. DFA minimisation should be an optional optimisation step, rather than required for correctness.

Martin Sulzmann said...

I see.

Antimirov shows that his partial derivative construction leads to fairly 'optimal' DFAs.

Here are some references:

Rewriting Extended Regular Expressions.
Antimirov, Valentin M. and Mosses

Valentin M. Antimirov: Partial Derivatives of Regular Expressions and Finite Automaton Constructions

Pseudonym said...

Antimirov shows that his partial derivative construction leads to fairly 'optimal' DFAs.

Which makes it all the more wrong to require DFA minimisation for correctness. :-)

I suspect that non-minimal DFAs may be more common in lexer generators, where you're matching against multiple regular expressions in parallel. I have no justification for this, though.

Martin Sulzmann said...

My PhD student Kenny Lu's forthcoming thesis contains more applications of partial derivatives in the context of regular expression pattern matching. I'll summarize the main results in a blog post, once the thesis is through the entire reviewing process.

Gowtham said...

It is a very nice post. Lot helpful. I have some questions on how to use the operations/comparisons over regular expressions. I would not say that my question is totally related to this post, but for sure you people will have great ideas.

In brief, if I have multiple regular expressions to match, To do it very fast, Can I establish some relationship among them and skip some regular expressions using the relationship. I guess partial ordering does some ordering, but could not find good info about it on net.

Martin Sulzmann said...

"multiple regular expressions to match"

Can you pl elaborate, short example?

Are you saying you'd like to reuse the partial results of a previous match?

Gowtham said...

Thanks for reply. So the problem is something like this in brief. Suppose I have lots and lots of regular expressions to match from a set of documents. I want to do this very fast.
Eg: Match R1, R2....Rn over files F1, F2....Fm
As far as I can think of, we can do two kinds of improvements
1. Have a regex matching engine that is optimized to kind of regexes I pass. Don't worry about this for now.
2. Can we optimize the regexes by removing some of them or merging some of them, there by reducing the total time taken to match all these regexes over the set of docs.

So with respect to second point (filtering regexes), if I can do some kind of partial ordering OR find some kind of related regexes where I can discard some of them. Eg: If I can say R1 is related to R2 in one way, disregard R2, if R1 does not match. Something on this lines.

Gowtham said...

Thanks for reply. So the problem is something like this in brief. Suppose I have lots and lots of regular expressions to match from a set of documents. I want to do this very fast.
Eg: Match R1, R2....Rn over files F1, F2....Fm
As far as I can think of, we can do two kinds of improvements
1. Have a regex matching engine that is optimized to kind of regexes I pass. Don't worry about this for now.
2. Can we optimize the regexes by removing some of them or merging some of them, there by reducing the total time taken to match all these regexes over the set of docs.

So with respect to second point (filtering regexes), if I can do some kind of partial ordering OR find some kind of related regexes where I can discard some of them. Eg: If I can say R1 is related to R2 in one way, disregard R2, if R1 does not match. Something on this lines.

Martin Sulzmann said...

If R1 doesn't match a string s,
i.e. s not in L(R1) and
R2 <= R1, then R2 doesn't match
the string either.
That's what you're looking for?

Gowtham said...

Yes, I am looking for something like that. Where can I read more about how to relate two regular expressions? Any existing libraries or examples which can do this?

Also I have other questions/thoughts/ideas. But for now the above thing is more important and any direction in this aspect is much appreciated. Thanks

Dan said...

Thanks for the interesting code and references! I've been playing with it a bit, and I think I've identified a couple of bugs, one seemingly trivial, the other I'm not sure how to fix.

The simple one is in the base cases for intersect. The line "r1 == Empty || r2 == Empty = Empty" implies that the intersection of <> with anything other than {} is <>, but it seems that the intersection of <> and 'a' should be {}, as no input matches both expressions. Another manifestation of this is that the intersection of 'a' and <'a','b'> returns 'a', rather than {}. A simple fix is to test the non-Empty argument with isEmpty, returning Phi if it doesn't match the empty string.

The not-so-simple one involves differences in the expression following a kleene star. The intersection of <'a'*,'b'> and <'a'*,'c'> should be {}, but intersect returns (modulo simplification) <'a','a'*>.

I can see that taking partial derivatives of the original expressions with respect to 'a' leads to a recursive call with the same arguments, and that replacing that result with a variable somehow throws away the distinction between the expressions, but I don't understand the approach well enough to see a solution.

Martin Sulzmann said...

Thanks for your comments Dan.

Correct. The case "r1 == Empty || r2 == Empty = Empty" as suggested by you.

The second problem arises because of a bug in the convert function.
Conversion of the regular equation
1 = (a,1) yields a* which is obviously incorrect.

The base case in convert2 needs to be (Empty,Phi) instead of (Empty,Empty). There's also the case dealing with letters missing.
I'll update the code on hackage in the next few days.