JavaScript/TypeScript has no checked exceptions and this bothers me.
As far as I know, Java
introduced checked
exceptions whereby a method signature advertises that a method or
constructor may throw an exception of a specific type. For example
the write
method of an OutputStream
is declared as:
public void write(int b) throws IOException { ... }
When another method calls it, it must do one of two things:
catch(IOException e) {...}
clause to catch
and handle it, or
try { out.write(0xa); } catch (IOException e) { // do something reasonable }
throws IOException
.
void writeStuff(...) throws IOException { out.write(0xa); }
In the first case it is blatantly obvious that calling write()
either works or not and this other case is handled in some
way by the catch
clause.
The second case can be somewhat less obvious if the body
of writeStuff
gets longer as it is not clear from which call the
declared IOexception
originates.
Java also provides unchecked exceptions where neither a catch
clause nor a throws
declaration is required. Nevertheless
a catch
clause may be written somewhere up the call stack after
carefully reading the documentation and/or inspecting the signatures of called
methods down the stack in all their breadth to figure out which unchecked
exception may pop out.
The unchecked exceptions are what JavaScript solely provides. Any function may throw stuff any time and upwards the call hierarchy you need to do as described above: do some hard work on docs and with your IDEs to figure out whether to better catch something or not.
A third approach is for a method or function to return either the normal result or the "exception". In TypeScript this could be written as:
function f(in: SomeIputSource): string | IOException { ... }
In contrast to throwing exceptions, this requires even more explicit action
at the caller. It is similar to a checked exception, yet without the automatic
re-throw we have if the caller declares throws IOException
, as is
possible in Java. This is
what Rust
opted for.
So basically there are three ways to deal with what is customarily called "errors" or "exceptions":
There is an ongoing discussion about whether or not to use checked or unchecked exceptions, just do an Internet search. From the wording above you may have guessed that I am not fan of unchecked exceptions. The arguments are all made.
Or not? I think Rust comes close, but let me explain.
The Result<R,E>
construct used
in Rust
is a great step the the right direction as it puts the "normal" and the "other"
result nearly on par. To improve the base for discussion and guide the thinking
I suggest:
Do not call the "other" result type "error" or "exception".
I cannot come up with a better name currently than "other". Nearly 50 years of training nudge me into calling it a lot of things, but all with a negative connotation. But the other result should not be seen as bad, exceptional, error, unsuccesful, failed or whatever. Opening a file with a name for which there is no file is normal business. Why treat it differently than the lookup of a key in a map. It may mapped to a value or not. Both is possible, neither is good or bad, they are on par and should be treated respectively.
The same holds for requesting data from an HTTP server. The resource was changed, the server is down, the network is down, the data format is different from what the software at this point expects. There are gazillion reasons why we might not get the nicely formatted JSON describing the customer address, yet all those gazillion reasons are "exceptional"? Seriously? Maybe getting the nicely formatted JSON should be called the exception. :-)
As another, slightly more tricky example consider the compilation of a regular expression into an internal structure. Two differing use cases can be considered (Java example):
Pattern num = Pattern.compile("[0-9]+");
Pattern grep =
Pattern.compile(System.getProperty("pattern"));
The first example uses a pattern literal string provided by the
developer. Being forced to handle an alternative result, like a description of a
potential syntax error would be extremely cumbersome. Ideally the compiler (as
many IDEs do today) would verify the pattern already, but the second best
solution in this case, as an exception from my avoidance of unchecked
exceptions, would be an unchecked exception or, as in Rust,
a panic!
.
In the second example, the regular expression is obtained from an external
source, like provided by a user, so it is normal business that the string
provided is not a regular expression pattern. Rather than just returning,
say, null
as in the case of a key lookup in a map, it is quite
natural to return a description of why the string is not a pattern,
customarily called a syntax error.
Java opted for the former, so when trying to compile non-literal strings, the developer is not reminded by the compiler: "hey, the string may not be a pattern". The developer must waste precious brain capacity to remember that this may be the case and add the respective handling voluntarily.
But I think the two examples above show that both use cases are relevant and in such a case I see only one way out: there should actually be two pattern compile function. One with "other" result and one doing the "unchecked exception" or "panic" thing.
I think the debate about checked or unchecked exceptions would be over quickly if we stop calling checked exceptions "exceptions" at all. As soon as we accept that those are just a second type of result, on par with what we call the main result so far, it should become much more easy to recognize whether we really are in an "unchecked exception"/"panic" use case or not.
Personally I'll want to start using the Rust approach in TypeScript too, in particular as union types make this so easy. The explicit handling on each level in the call hierarchy may look slightly cumbersome. Yet: the more explict your code, the less you have to remember or reconstruct when you come back to it in a year's time.
And for quality software, its function must be so blatantly obvious that everybody thinks: wow, this is trivial code, even a monkey could've programmed this.
If your're a junior developer and are proud of "wow, I wrote this complex code and I understand it in every detail", make sure to change employer to avoid looking at this code in some time, because you will hate yourself.
When working on my Java library for building up and matching text with finite automata, I came across a design problem with regard to the API for building up the finite automata (FA).
The class used to build up FA is called Nfa
in monq for
Nondeterministic Finite Automaton, because that's what it represents. Building
up larger automata from smaller ones amounts to applying binary operations to
two automata, for example.
Nfa nfa1 = ... Nfa nfa2 = ... Nfa nfa3 = sequence(nfa1, nfa2);
This looks much like your typical binary operation:
String a = "hello"; String b = " world!"; String c = a + b;
From primitive types and String
objects we expect that
a
and b
are not changed, anda
or b
do not
change c
.This is completely natural to expect and "good behaviour" for API
operations on objects. So it would be great to have this also for
the sequence
operation for Nfa
objects described
above.
BUT: I want to be able to create extremely huge Nfa
,
think of encoding writing variants of dictionary words into regular
expressions and combining all of them into one Nfa
.
The safest path to guarantee 1. and 2. above is to let
implement Nfa
immutable. This, however, would obviously require
to make full copies of the parameters of the example sequence
operation. As mentioned above, would get extremely tough on memory. Further,
it would make some operations on the Nfa
objects slow, due to the
copying which are otherwise next to trivial.
Not all objects are immutable, and for good reasons. We
have StringBuilder
or HashMap
which are of course
changed when adding elements. Yet the statements
Mapm1 = ... Map m2 = ... m1.addAll(m2);
do not change m2
, only m1
and later changes
to m2
are not reflected in m1
. This is, however, only
possible, because most of the skeletton structure of m2
is
actually copied or reproduced in m1
by addAll
.
I consider this too expensive for combining Nfa
. In most
operations, a time and space efficient way to combine two FA is to just create
one or more link between their object graphs. But as a result, both 1. and
2. above are violated: Both input graphs may be changed and later fumbling
with one of the input objects actually fumbles with the combined object's
graph too. Weird!
I see three ways out, one of them not actually possible with Java.
Lets assume we go weird, wiring object graphs in place and lets extend the example from above as:
Nfa nfa1 = ... Nfa nfa2 = ... Nfa nfa3 = sequence(nfa1, nfa2); nfa1.invert();
This would potentially either fail or at least result in quite unexpected results for two reasons:
Nfa
which meanwhile
became part of a bigger automaton. Consistency conditions for a stand-alone
automaton may no longer hold and the operation may either fail with an
exception or, even worse, do some nonsense.nfa1
is part of nfa3
, the inversion
operation will change nfa3
too, quite unexpectedly.A regular language, simply put, is the set of strings you can match with a given regular expression. The regular expressions you find in programming languages or command line programs, though, are often somewhat enhanced with tricks to extend the theoretical setup somewhat.
There is a correspondence between piecing together two regular
expressions and combining the respective sets of strings. If we call
$S(r)$ the set of strings matched by a regular expression $r$, then
the regular expression x|y
corresponds to the set union
$S(x)\cup S(y)$. Similarly, xy
corresponds to the strings
$\{XY \mid X\in S(x) \wedge Y\in S(y)\}$.
Something normally not provided, is the set intersection. There is
no regular expression syntax I know of (Java, Tcl, Python, grep,
awk, sed, Javascript) which provides the respective operator,
say &
, such that x&y
corresponds to
the set of strings $S(x)\cap S(y)$.
But, hey, this surely is pure chichi from Theoretical Computer Science. Who needs this?
Well, recently I stumbled over a use case. Not that everyone out there is in dire need. This is from a pet project of mine, but it was fun.
For use in string matching, a regular expression is transformed into a finite automaton. They come in two flavors, non-deterministic (NFA) and (DFA) deterministic. Most libraries or programs in use are based on the non-deterministic ones. They are more flexible to allow additional, practically useful tricks and are fast enough after many years of optimization. On the other hand, matching with a DFA is theoretically faster, because where in an NFA the algorithm may need to try several possible parallel branches, (that's the "non"-determinism), traversing a DFA's graph is deterministic and either matches or not. There is no "try out this path ... oh, no, try the other one", etc.
My younger self was curious what it takes to write a regular expression matcher using a DFA. Out came "monqjfa", which you can still find on github. What was it missing: the intersection operation.
Yet recently I decided to modernize the thing and set it up on
codeberg. As a proof of concept, I wrote
a grep
like utility
called Oooo.java for lack of a better name. The
script to call it got the even more ingenious
name ooo
. Rather than partitioning the input by
lines, as grep
does, you may provide
a split=regex
on the command line to define what
constitutes a "line". And you can provide several matching and
transformation rules, for
example "r=[0-9]+-><num>{}</num>
where
the {}
is replaced by the match.
The algorithm could first search for a split
match and
then, in a second, third and more steps, search for the other
regular expressions in the string terminated by
the split
match. But no, this is not how you want to do
things if you have a DFA. Instead, all regular
expressions, split
included, are combined into one big
DFA, with their expansion templates stored in the stop states. Then
just one matching step is used each time, the longest match is
found, the stored template is interpreted and the next match can be
started on the rest of the input. No two or more passes over the
lines are necessary.
Well, well. Now we want to
find import.*foo
. This looks completely harmless and
everyone has done this with grep
.
What is the problem with a single DFA which basically encodes a
regular expression like \n|import.*foo
,
where \n
is the split
regex to match the
end of a line? What happens is, that if one line
contains import
but not foo
, the matching
treats the next \n
as just another character of
the .*
in import.*foo
and keeps gobbling
characters until it eventually finds the foo
.
Some regular expressions engines, for a reason like this, treat the
dot-character not as "match every character", but rather as "match
every character except line end". But
since ooo
's split
allows just any odd
regular expression, this does not work here.
There we are at the use case for the intersection operator. And for good measure, the set complement operator too. How's that?
What we really would like to encode with a.*b
in the
above example is:
Match any string starting witha
and ending withb
except if it contains a match of thesplit
regular expression.
Translated into set operations the "match ... except ... contains" can be translated into a set theoretic description as follows:
And here is how you would write this in the new monqjfa on codeberg:
NfaBuilder.matchRegex("a.*b&(\n)^")
Monqjfa understands the intersection operator &
and, given any regular expression r
, the caret
operator r^
is a shortcut for (.*r.*)~
where ~
provides the complement, such
that r^
means "any string not containing a match
of r
.
With the program ooo
every rule is combined with the
"does not contain a match of the split regex" automatically, so
there you don't have to deal with the details. But now you know how
it works internally.
And you may have fun experimenting with NFA, DFA operations. The
Java class monq.jfa.FaToDot
writes out any NFA or DFA
in graphviz
format so that the dot
program can be used to view the
graph of an automaton.
As an example, the graph below was generated with
java -cp monq-3.3.0.jar monq.jfa.FaToDot -dfa 'a.*b&X^' \ | dot -Tsvg -Grankdir=LR >dfa.svg
This is what we talked above, except that the \n
was
replaced by an X
for better readability in the
graph.
The rectangular state on the left is the start state. With nothing
but ab
you immediately end up in the red stop state. In
between is the elliptical state with self loops on many characters:
\u000..W
, Y..a
and c..\uffff
. This is for the .*
but note
what is missing: the X
. There is nothing for it there,
so it cannot appear between a
and b
.