Comments on Martin Sulzmann's Blog: Playing with regular expressions: Intersection

Thanks for your comments Dan. Correct. The case "...

2009-05-29T14:57:31.533-07:00

Thanks for your comments Dan.

Correct. The case "r1 == Empty || r2 == Empty = Empty" as suggested by you.

The second problem arises because of a bug in the convert function.
Conversion of the regular equation
1 = (a,1) yields a* which is obviously incorrect.

The base case in convert2 needs to be (Empty,Phi) instead of (Empty,Empty). There's also the case dealing with letters missing.
I'll update the code on hackage in the next few days.

Thanks for the interesting code and references! I&...

2009-05-28T13:26:20.599-07:00

Thanks for the interesting code and references! I've been playing with it a bit, and I think I've identified a couple of bugs, one seemingly trivial, the other I'm not sure how to fix.

The simple one is in the base cases for intersect. The line "r1 == Empty || r2 == Empty = Empty" implies that the intersection of <> with anything other than {} is <>, but it seems that the intersection of <> and 'a' should be {}, as no input matches both expressions. Another manifestation of this is that the intersection of 'a' and <'a','b'> returns 'a', rather than {}. A simple fix is to test the non-Empty argument with isEmpty, returning Phi if it doesn't match the empty string.

The not-so-simple one involves differences in the expression following a kleene star. The intersection of <'a'*,'b'> and <'a'*,'c'> should be {}, but intersect returns (modulo simplification) <'a','a'*>.

I can see that taking partial derivatives of the original expressions with respect to 'a' leads to a recursive call with the same arguments, and that replacing that result with a variable somehow throws away the distinction between the expressions, but I don't understand the approach well enough to see a solution.

Yes, I am looking for something like that. Where c...

2009-05-06T23:50:00.000-07:00

Yes, I am looking for something like that. Where can I read more about how to relate two regular expressions? Any existing libraries or examples which can do this?

Also I have other questions/thoughts/ideas. But for now the above thing is more important and any direction in this aspect is much appreciated. Thanks

If R1 doesn't match a string s, i.e. s not in ...

2009-05-06T22:40:00.000-07:00

If R1 doesn't match a string s,
i.e. s not in L(R1) and
R2 <= R1, then R2 doesn't match
the string either.
That's what you're looking for?

Thanks for reply. So the problem is something like...

2009-05-06T13:09:00.000-07:00

Thanks for reply. So the problem is something like this in brief. Suppose I have lots and lots of regular expressions to match from a set of documents. I want to do this very fast.
Eg: Match R1, R2....Rn over files F1, F2....Fm
As far as I can think of, we can do two kinds of improvements
1. Have a regex matching engine that is optimized to kind of regexes I pass. Don't worry about this for now.
2. Can we optimize the regexes by removing some of them or merging some of them, there by reducing the total time taken to match all these regexes over the set of docs.

So with respect to second point (filtering regexes), if I can do some kind of partial ordering OR find some kind of related regexes where I can discard some of them. Eg: If I can say R1 is related to R2 in one way, disregard R2, if R1 does not match. Something on this lines.

Thanks for reply. So the problem is something like...

2009-05-05T20:15:00.000-07:00

"multiple regular expressions to match" Can you p...

2009-05-05T14:53:00.000-07:00

"multiple regular expressions to match"

Can you pl elaborate, short example?

Are you saying you'd like to reuse the partial results of a previous match?

It is a very nice post. Lot helpful. I have some q...

2009-05-05T14:38:00.000-07:00

It is a very nice post. Lot helpful. I have some questions on how to use the operations/comparisons over regular expressions. I would not say that my question is totally related to this post, but for sure you people will have great ideas.

In brief, if I have multiple regular expressions to match, To do it very fast, Can I establish some relationship among them and skip some regular expressions using the relationship. I guess partial ordering does some ordering, but could not find good info about it on net.

My PhD student Kenny Lu's forthcoming thesis conta...

2008-11-24T14:29:00.000-08:00

My PhD student Kenny Lu's forthcoming thesis contains more applications of partial derivatives in the context of regular expression pattern matching. I'll summarize the main results in a blog post, once the thesis is through the entire reviewing process.

Antimirov shows that his partial derivative constr...

2008-11-23T16:05:00.000-08:00

Antimirov shows that his partial derivative construction leads to fairly 'optimal' DFAs.

Which makes it all the more wrong to require DFA minimisation for correctness. :-)

I suspect that non-minimal DFAs may be more common in lexer generators, where you're matching against multiple regular expressions in parallel. I have no justification for this, though.

I see.Antimirov shows that his partial derivative ...

2008-11-22T00:26:00.000-08:00

I see.

Antimirov shows that his partial derivative construction leads to fairly 'optimal' DFAs.

Here are some references:

Rewriting Extended Regular Expressions.
Antimirov, Valentin M. and Mosses

Valentin M. Antimirov: Partial Derivatives of Regular Expressions and Finite Automaton Constructions

Yes, this code was intended to implement DFAs, rat...

2008-11-21T21:48:00.000-08:00

Yes, this code was intended to implement DFAs, rather than just compute set intersection.

One reason why I like your method better is that the standard algorithm sometimes ends up with "infinite failure". This regular expression, for example:

(a|b)* - b*(ab*)*

recognises the empty set. The DFA produced looks like this:

q0 -> a q0 | b q0

where q0 is not a final state. This does indeed accept the empty set, but it only does so after consuming the entire string.

Obviously, this behaviour is undesirable in practice. When we wrote the lexer generator for Mercury, we solved the problem by making the subsequent DFA minimisation step do the dirty work. (You put a known "empty set" state with no transitions to or from it into the DFA, then perform state minimisation, then any state which ends up in the same partition as that special state must also be an "empty set" state.)

However, that always sat badly with me. DFA minimisation should be an optional optimisation step, rather than required for correctness.

Thanks for the comment Andrew. I had a quick scan ...

2008-11-20T00:16:00.000-08:00

Thanks for the comment Andrew. I had a quick scan through your code. You follow the 'traditional' approach by converting regular expressions to DFAs?

The point of my code is that we can implement standard regular expression operations by rewriting them (without having to via an explicit DFA construction).

My post is a response to
http://www.iis.sinica.edu.tw/~scm/tag/regular-expression/
I think I've screwed up some of my blogs settings, so Shin couldn't leave a comment.

Here's my version from 2003. I didn't bother impl...

2008-11-19T20:40:00.000-08:00

Here's my version from 2003. I didn't bother implementing intersection or set difference at the time.

Set difference turns out to be slightly more useful in practice than intersection. One example is in C-style comments; they are most naturally defined as "/*", followed by any string not containing "*/", followed by "*/".

Thanks for the interesting post and code, as well ...

2008-11-08T19:33:00.000-08:00

Thanks for the interesting post and code, as well as references to partial derivatives!

Some questions: 1. when you said "to use co-induction which then yields a finite (recursive) proof," do you mean the use of convert?

2. Before the definition of convert you said "variables x appears .. at position (r,Var x)," do you mean x appears only at right most positions in the (now a mixture of regular and context-free) grammar? What constraints do we assume on the input of convert, so that we can always convert a regular expression r, possibly with x in it, to r1*r2?

Thanks!