$ \def\Vec#1{\mathbf{#1}} \def\vt#1{\Vec{v}_{#1}(t)} \def\v#1{\Vec{v}_{#1}} \def\vx#1{\Vec{x}_{#1}} \def\av{\bar{\Vec{v}}} \def\vdel{\Vec{\Delta}} $

Harald Kirsch

about this blog
2024-08-18

A Practical Use for Regular Language Intersection

Long story short: Monqjfa now has an intersection operator.

What is "Regular Language Intersection"?

A regular language, simply put, is the set of strings you can match with a given regular expression. The regular expressions you find in programming languages or command line programs, though, are often somewhat enhanced with tricks to extend the theoretical setup somewhat.

There is a correspondence between piecing together two regular expressions and combining the respective sets of strings. If we call $S(r)$ the set of strings matched by a regular expression $r$, then the regular expression x|y corresponds to the set union $S(x)\cup S(y)$. Similarly, xy corresponds to the strings $\{XY \mid X\in S(x) \wedge Y\in S(y)\}$.

Something normally not provided, is the set intersection. There is no regular expression syntax I know of (Java, Tcl, Python, grep, awk, sed, Javascript) which provides the respective operator, say &, such that x&y corresponds to the set of strings $S(x)\cap S(y)$.

But, hey, this surely is pure chichi from Theoretical Computer Science. Who needs this?

Well, recently I stumbled over a use case. Not that everyone out there is in dire need. This is from a pet project of mine, but it was fun.

Some Background about NFA/DFA

For use in string matching, a regular expression is transformed into a finite automaton. They come in two flavors, non-deterministic (NFA) and (DFA) deterministic. Most libraries or programs in use are based on the non-deterministic ones. They are more flexible to allow additional, practically useful tricks and are fast enough after many years of optimization. On the other hand, matching with a DFA is theoretically faster, because where in an NFA the algorithm may need to try several possible parallel branches, (that's the "non"-determinism), traversing a DFA's graph is deterministic and either matches or not. There is no "try out this path ... oh, no, try the other one", etc.

My younger self was curious what it takes to write a regular expression matcher using a DFA. Out came "monqjfa", which you can still find on github. What was it missing: the intersection operation.

Modernized monqjfa and the ooo program

Yet recently I decided to modernize the thing and set it up on codeberg. As a proof of concept, I wrote a grep like utility called Oooo.java for lack of a better name. The script to call it got the even more ingenious name ooo. Rather than partitioning the input by lines, as grep does, you may provide a split=regex on the command line to define what constitutes a "line". And you can provide several matching and transformation rules, for example "r=[0-9]+-><num>{}</num> where the {} is replaced by the match.

The algorithm could first search for a split match and then, in a second, third and more steps, search for the other regular expressions in the string terminated by the split match. But no, this is not how you want to do things if you have a DFA. Instead, all regular expressions, split included, are combined into one big DFA, with their expansion templates stored in the stop states. Then just one matching step is used each time, the longest match is found, the stored template is interpreted and the next match can be started on the rest of the input. No two or more passes over the lines are necessary.

Well, well. Now we want to find import.*foo. This looks completely harmless and everyone has done this with grep. What is the problem with a single DFA which basically encodes a regular expression like \n|import.*foo, where \n is the split regex to match the end of a line? What happens is, that if one line contains import but not foo, the matching treats the next \n as just another character of the .* in import.*foo and keeps gobbling characters until it eventually finds the foo.

Some regular expressions engines, for a reason like this, treat the dot-character not as "match every character", but rather as "match every character except line end". But since ooo's split allows just any odd regular expression, this does not work here.

The Intersection Operator

There we are at the use case for the intersection operator. And for good measure, the set complement operator too. How's that?

What we really would like to encode with a.*b in the above example is:

Match any string starting with a and ending with b except if it contains a match of the split regular expression.

Translated into set operations the "match ... except ... contains" can be translated into a set theoretic description as follows:

And here is how you would write this in the new monqjfa on codeberg:

NfaBuilder.matchRegex("a.*b&(\n)^")

Monqjfa understands the intersection operator & and, given any regular expression r, the caret operator r^ is a shortcut for (.*r.*)~ where ~ provides the complement, such that r^ means "any string not containing a match of r.

With the program ooo every rule is combined with the "does not contain a match of the split regex" automatically, so there you don't have to deal with the details. But now you know how it works internally.

And you may have fun experimenting with NFA, DFA operations. The Java class monq.jfa.FaToDot writes out any NFA or DFA in graphviz format so that the dot program can be used to view the graph of an automaton.

As an example, the graph below was generated with

java -cp monq-3.3.0.jar monq.jfa.FaToDot -dfa 'a.*b&X^' \
   | dot -Tsvg -Grankdir=LR >dfa.svg

This is what we talked above, except that the \n was replaced by an X for better readability in the graph.

The rectangular state on the left is the start state. With nothing but ab you immediately end up in the red stop state. In between is the elliptical state with self loops on many characters: \u000..W, Y..a and c..\uffff. This is for the .* but note what is missing: the X. There is nothing for it there, so it cannot appear between a and b.