A regular language, simply put, is the set of strings you can match with a given regular expression. The regular expressions you find in programming languages or command line programs, though, are often somewhat enhanced with tricks to extend the theoretical setup somewhat.
There is a correspondence between piecing together two regular
expressions and combining the respective sets of strings. If we call
$S(r)$ the set of strings matched by a regular expression $r$, then
the regular expression x|y
corresponds to the set union
$S(x)\cup S(y)$. Similarly, xy
corresponds to the strings
$\{XY \mid X\in S(x) \wedge Y\in S(y)\}$.
Something normally not provided, is the set intersection. There is
no regular expression syntax I know of (Java, Tcl, Python, grep,
awk, sed, Javascript) which provides the respective operator,
say &
, such that x&y
corresponds to
the set of strings $S(x)\cap S(y)$.
But, hey, this surely is pure chichi from Theoretical Computer Science. Who needs this?
Well, recently I stumbled over a use case. Not that everyone out there is in dire need. This is from a pet project of mine, but it was fun.
For use in string matching, a regular expression is transformed into a finite automaton. They come in two flavors, non-deterministic (NFA) and (DFA) deterministic. Most libraries or programs in use are based on the non-deterministic ones. They are more flexible to allow additional, practically useful tricks and are fast enough after many years of optimization. On the other hand, matching with a DFA is theoretically faster, because where in an NFA the algorithm may need to try several possible parallel branches, (that's the "non"-determinism), traversing a DFA's graph is deterministic and either matches or not. There is no "try out this path ... oh, no, try the other one", etc.
My younger self was curious what it takes to write a regular expression matcher using a DFA. Out came "monqjfa", which you can still find on github. What was it missing: the intersection operation.
Yet recently I decided to modernize the thing and set it up on
codeberg. As a proof of concept, I wrote
a grep
like utility
called Oooo.java for lack of a better name. The
script to call it got the even more ingenious
name ooo
. Rather than partitioning the input by
lines, as grep
does, you may provide
a split=regex
on the command line to define what
constitutes a "line". And you can provide several matching and
transformation rules, for
example "r=[0-9]+-><num>{}</num>
where
the {}
is replaced by the match.
The algorithm could first search for a split
match and
then, in a second, third and more steps, search for the other
regular expressions in the string terminated by
the split
match. But no, this is not how you want to do
things if you have a DFA. Instead, all regular
expressions, split
included, are combined into one big
DFA, with their expansion templates stored in the stop states. Then
just one matching step is used each time, the longest match is
found, the stored template is interpreted and the next match can be
started on the rest of the input. No two or more passes over the
lines are necessary.
Well, well. Now we want to
find import.*foo
. This looks completely harmless and
everyone has done this with grep
.
What is the problem with a single DFA which basically encodes a
regular expression like \n|import.*foo
,
where \n
is the split
regex to match the
end of a line? What happens is, that if one line
contains import
but not foo
, the matching
treats the next \n
as just another character of
the .*
in import.*foo
and keeps gobbling
characters until it eventually finds the foo
.
Some regular expressions engines, for a reason like this, treat the
dot-character not as "match every character", but rather as "match
every character except line end". But
since ooo
's split
allows just any odd
regular expression, this does not work here.
There we are at the use case for the intersection operator. And for good measure, the set complement operator too. How's that?
What we really would like to encode with a.*b
in the
above example is:
Match any string starting witha
and ending withb
except if it contains a match of thesplit
regular expression.
Translated into set operations the "match ... except ... contains" can be translated into a set theoretic description as follows:
And here is how you would write this in the new monqjfa on codeberg:
NfaBuilder.matchRegex("a.*b&(\n)^")
Monqjfa understands the intersection operator &
and, given any regular expression r
, the caret
operator r^
is a shortcut for (.*r.*)~
where ~
provides the complement, such
that r^
means "any string not containing a match
of r
.
With the program ooo
every rule is combined with the
"does not contain a match of the split regex" automatically, so
there you don't have to deal with the details. But now you know how
it works internally.
And you may have fun experimenting with NFA, DFA operations. The
Java class monq.jfa.FaToDot
writes out any NFA or DFA
in graphviz
format so that the dot
program can be used to view the
graph of an automaton.
As an example, the graph below was generated with
java -cp monq-3.3.0.jar monq.jfa.FaToDot -dfa 'a.*b&X^' \ | dot -Tsvg -Grankdir=LR >dfa.svg
This is what we talked above, except that the \n
was
replaced by an X
for better readability in the
graph.
The rectangular state on the left is the start state. With nothing
but ab
you immediately end up in the red stop state. In
between is the elliptical state with self loops on many characters:
\u000..W
, Y..a
and c..\uffff
. This is for the .*
but note
what is missing: the X
. There is nothing for it there,
so it cannot appear between a
and b
.
Dies ist kein richtiges Rezept, da wir das jedesmal nach Belieben etwas variieren und ich die Zutatenmengen nie wirklich abgewogen habe.
Pane Carasau ist ein traditionelles Flatenbrot aus Sardinien. Wir bekommen es manchmal beim italienischen Lebensmittelhändler um die Ecke oder bei unserem Bioshop. Frisch gekauft ist es sehr knusprig, in dem Auflauf wird es schön flauschig.
Die Mengen sind Schätzungen.
Previous entries in my Java Wishlist series:
I hope I don't need to cite the
billion dollar mistake or mention that Typescript has
explicit null
typing?
Neither do I need to mention that there are various annotation
frameworks which allow to annotate a field or parameter as to
whether it may be null
or not. Though having half a
gazillion frameworks suggests that you want to use none of them for
compatibility reasons.
The consensus, at least in the "strong typing" community, seems to
be that it is a good thing to explicitly declare whether a value may
be null
or not.
Luckily we have a draft JEP, "Null-Restricted Value Class Types", which will provide a partial solution. For value classes.
In very short terms: an object of a value class "lacks identity",
meaning (my paraphrasing) the compiler may create byte-for-byte
copies if it seems useful, like creating an object copy on the stack
rather than using an object reference. Much like
an int
, as opposed to Integer
.
The syntax for non-null
values, so far, is an
exclamation mark suffix on the type:
Complex! ipi = new Complex(0, Math.PI);
I wish we could have this for every type, not only for value types.