2025-06-30

Checked Exceptions are not Exceptional

JavaScript/TypeScript has no checked exceptions and this bothers me.

What are "checked exceptions"

As far as I know, Java introduced checked exceptions whereby a method signature advertises that a method or constructor may throw an exception of a specific type. For example the write method of an OutputStream is declared as:

  public void write(int b) throws IOException {
  ...
  }

When another method calls it, it must do one of two things:

Use a catch(IOException e) {...} clause to catch and handle it, or

  try {
    out.write(0xa);
  } catch (IOException e) {
    // do something reasonable
  }

declare itself also as throws IOException.

  void writeStuff(...) throws IOException {
    out.write(0xa);    
  }

In the first case it is blatantly obvious that calling write() either works or not and this other case is handled in some way by the catch clause.

The second case can be somewhat less obvious if the body of writeStuff gets longer as it is not clear from which call the declared IOexception originates.

Alternatives

Java also provides unchecked exceptions where neither a catch clause nor a throws declaration is required. Nevertheless a catch clause may be written somewhere up the call stack after carefully reading the documentation and/or inspecting the signatures of called methods down the stack in all their breadth to figure out which unchecked exception may pop out.

The unchecked exceptions are what JavaScript solely provides. Any function may throw stuff any time and upwards the call hierarchy you need to do as described above: do some hard work on docs and with your IDEs to figure out whether to better catch something or not.

A third approach is for a method or function to return either the normal result or the "exception". In TypeScript this could be written as:

  function f(in: SomeIputSource): string | IOException {
  ...
  }

In contrast to throwing exceptions, this requires even more explicit action at the caller. It is similar to a checked exception, yet without the automatic re-throw we have if the caller declares throws IOException, as is possible in Java. This is what Rust opted for.

So basically there are three ways to deal with what is customarily called "errors" or "exceptions":

Unchecked exceptions as a machine for surprises up the call stack and, for extra fun, at runtime in production. Coding the caller of a function throwing unchecked exceptions has the "benefit" that the developer does not need to think about it. The tester or the customer will tell if an exception is thrown.
Checked exceptions which allow automatic re-throw such that the caller merely declares throwing the same exception but otherwise no extra control structure is necessary — except, of course ultimately somewhere up the call stack, but not in each function along the stack.
Declaring the method to return either a "normal" result or an "exception". In this case the immediate caller must provide some control structure to handle either one or the other type of result.

There is an ongoing discussion about whether or not to use checked or unchecked exceptions, just do an Internet search. From the wording above you may have guessed that I am not fan of unchecked exceptions. The arguments are all made.

Or not? I think Rust comes close, but let me explain.

These are neihther "Exceptions" nor "Errors"

The Result<R,E> construct used in Rust is a great step the the right direction as it puts the "normal" and the "other" result nearly on par. To improve the base for discussion and guide the thinking I suggest:

Do not call the "other" result type "error" or "exception".

I cannot come up with a better name currently than "other". Nearly 50 years of training nudge me into calling it a lot of things, but all with a negative connotation. But the other result should not be seen as bad, exceptional, error, unsuccesful, failed or whatever. Opening a file with a name for which there is no file is normal business. Why treat it differently than the lookup of a key in a map. It may mapped to a value or not. Both is possible, neither is good or bad, they are on par and should be treated respectively.

The same holds for requesting data from an HTTP server. The resource was changed, the server is down, the network is down, the data format is different from what the software at this point expects. There are gazillion reasons why we might not get the nicely formatted JSON describing the customer address, yet all those gazillion reasons are "exceptional"? Seriously? Maybe getting the nicely formatted JSON should be called the exception. :-)

As another, slightly more tricky example consider the compilation of a regular expression into an internal structure. Two differing use cases can be considered (Java example):

Pattern num = Pattern.compile("[0-9]+");
Pattern grep = Pattern.compile(System.getProperty("pattern"));

The first example uses a pattern literal string provided by the developer. Being forced to handle an alternative result, like a description of a potential syntax error would be extremely cumbersome. Ideally the compiler (as many IDEs do today) would verify the pattern already, but the second best solution in this case, as an exception from my avoidance of unchecked exceptions, would be an unchecked exception or, as in Rust, a panic!.

In the second example, the regular expression is obtained from an external source, like provided by a user, so it is normal business that the string provided is not a regular expression pattern. Rather than just returning, say, null as in the case of a key lookup in a map, it is quite natural to return a description of why the string is not a pattern, customarily called a syntax error.

Java opted for the former, so when trying to compile non-literal strings, the developer is not reminded by the compiler: "hey, the string may not be a pattern". The developer must waste precious brain capacity to remember that this may be the case and add the respective handling voluntarily.

But I think the two examples above show that both use cases are relevant and in such a case I see only one way out: there should actually be two pattern compile function. One with "other" result and one doing the "unchecked exception" or "panic" thing.

Conclusion

I think the debate about checked or unchecked exceptions would be over quickly if we stop calling checked exceptions "exceptions" at all. As soon as we accept that those are just a second type of result, on par with what we call the main result so far, it should become much more easy to recognize whether we really are in an "unchecked exception"/"panic" use case or not.

Personally I'll want to start using the Rust approach in TypeScript too, in particular as union types make this so easy. The explicit handling on each level in the call hierarchy may look slightly cumbersome. Yet: the more explict your code, the less you have to remember or reconstruct when you come back to it in a year's time.

And for quality software, its function must be so blatantly obvious that everybody thinks: wow, this is trivial code, even a monkey could've programmed this.

If your're a junior developer and are proud of "wow, I wrote this complex code and I understand it in every detail", make sure to change employer to avoid looking at this code in some time, because you will hate yourself.

2025-01-16

Building Elaborate Data Structures in Java

What's the problem?

When working on my Java library for building up and matching text with finite automata, I came across a design problem with regard to the API for building up the finite automata (FA).

An automaton is a an object graph.
Larger automata are created by combining smaller ones into larger ones.
Often small automata are added to a much larger automaton, meaning the combination operation has parameters of unequal size.
We also have unary operations which transform a given FA (object graph) to give it a different matching behaviour.

The class used to build up FA is called Nfa in monq for Nondeterministic Finite Automaton, because that's what it represents. Building up larger automata from smaller ones amounts to applying binary operations to two automata, for example.

  Nfa nfa1 = ...
  Nfa nfa2 = ...
  Nfa nfa3 = sequence(nfa1, nfa2);

This looks much like your typical binary operation:

  String a = "hello";
  String b = " world!";
  String c = a + b;

From primitive types and String objects we expect that

after these three operations, the values of a and b are not changed, and
later changes to a or b do not change c.

This is completely natural to expect and "good behaviour" for API operations on objects. So it would be great to have this also for the sequence operation for Nfa objects described above.

BUT: I want to be able to create extremely huge Nfa, think of encoding writing variants of dictionary words into regular expressions and combining all of them into one Nfa.

The Options

Go fully immutable

The safest path to guarantee 1. and 2. above is to let implement Nfa immutable. This, however, would obviously require to make full copies of the parameters of the example sequence operation. As mentioned above, would get extremely tough on memory. Further, it would make some operations on the Nfa objects slow, due to the copying which are otherwise next to trivial.

Go weird

Not all objects are immutable, and for good reasons. We have StringBuilder or HashMap which are of course changed when adding elements. Yet the statements

  Map m1 = ...
  Map m2 = ...
  m1.addAll(m2);

do not change m2, only m1 and later changes to m2 are not reflected in m1. This is, however, only possible, because most of the skeletton structure of m2 is actually copied or reproduced in m1 by addAll.

I consider this too expensive for combining Nfa. In most operations, a time and space efficient way to combine two FA is to just create one or more link between their object graphs. But as a result, both 1. and 2. above are violated: Both input graphs may be changed and later fumbling with one of the input objects actually fumbles with the combined object's graph too. Weird!

I see three ways out, one of them not actually possible with Java.

Full weirdness, kind of

Lets assume we go weird, wiring object graphs in place and lets extend the example from above as:

  Nfa nfa1 = ...
  Nfa nfa2 = ...
  Nfa nfa3 = sequence(nfa1, nfa2);
  nfa1.invert();

This would potentially either fail or at least result in quite unexpected results for two reasons:

The invert operation would act on an Nfa which meanwhile became part of a bigger automaton. Consistency conditions for a stand-alone automaton may no longer hold and the operation may either fail with an exception or, even worse, do some nonsense.
Since nfa1 is part of nfa3, the inversion operation will change nfa3 too, quite unexpectedly.

2024-08-18

A Practical Use for Regular Language Intersection

Long story short: Monqjfa now has an intersection operator.

What is "Regular Language Intersection"?

A regular language, simply put, is the set of strings you can match with a given regular expression. The regular expressions you find in programming languages or command line programs, though, are often somewhat enhanced with tricks to extend the theoretical setup somewhat.

There is a correspondence between piecing together two regular expressions and combining the respective sets of strings. If we call $S(r)$ the set of strings matched by a regular expression $r$, then the regular expression x|y corresponds to the set union $S(x)\cup S(y)$. Similarly, xy corresponds to the strings $\{XY \mid X\in S(x) \wedge Y\in S(y)\}$.

Something normally not provided, is the set intersection. There is no regular expression syntax I know of (Java, Tcl, Python, grep, awk, sed, Javascript) which provides the respective operator, say &, such that x&y corresponds to the set of strings $S(x)\cap S(y)$.

But, hey, this surely is pure chichi from Theoretical Computer Science. Who needs this?

Well, recently I stumbled over a use case. Not that everyone out there is in dire need. This is from a pet project of mine, but it was fun.

Some Background about NFA/DFA

For use in string matching, a regular expression is transformed into a finite automaton. They come in two flavors, non-deterministic (NFA) and (DFA) deterministic. Most libraries or programs in use are based on the non-deterministic ones. They are more flexible to allow additional, practically useful tricks and are fast enough after many years of optimization. On the other hand, matching with a DFA is theoretically faster, because where in an NFA the algorithm may need to try several possible parallel branches, (that's the "non"-determinism), traversing a DFA's graph is deterministic and either matches or not. There is no "try out this path ... oh, no, try the other one", etc.

My younger self was curious what it takes to write a regular expression matcher using a DFA. Out came "monqjfa", which you can still find on github. What was it missing: the intersection operation.

Modernized monqjfa and the ooo program

Yet recently I decided to modernize the thing and set it up on codeberg. As a proof of concept, I wrote a grep like utility called Oooo.java for lack of a better name. The script to call it got the even more ingenious name ooo. Rather than partitioning the input by lines, as grep does, you may provide a split=regex on the command line to define what constitutes a "line". And you can provide several matching and transformation rules, for example "r=[0-9]+-><num>{}</num> where the {} is replaced by the match.

The algorithm could first search for a split match and then, in a second, third and more steps, search for the other regular expressions in the string terminated by the split match. But no, this is not how you want to do things if you have a DFA. Instead, all regular expressions, split included, are combined into one big DFA, with their expansion templates stored in the stop states. Then just one matching step is used each time, the longest match is found, the stored template is interpreted and the next match can be started on the rest of the input. No two or more passes over the lines are necessary.

Well, well. Now we want to find import.*foo. This looks completely harmless and everyone has done this with grep. What is the problem with a single DFA which basically encodes a regular expression like \n|import.*foo, where \n is the split regex to match the end of a line? What happens is, that if one line contains import but not foo, the matching treats the next \n as just another character of the .* in import.*foo and keeps gobbling characters until it eventually finds the foo.

Some regular expressions engines, for a reason like this, treat the dot-character not as "match every character", but rather as "match every character except line end". But since ooo's split allows just any odd regular expression, this does not work here.

The Intersection Operator

There we are at the use case for the intersection operator. And for good measure, the set complement operator too. How's that?

What we really would like to encode with a.*b in the above example is:

Match any string starting with a and ending with b except if it contains a match of the split regular expression.

Translated into set operations the "match ... except ... contains" can be translated into a set theoretic description as follows:

$S(\text{.*\\n.*})$ is the set of strings containing at least one newline.
But we don't want to match any of those, rather the complement: $S(\text{.*\\n.*})^c$.
But not just the complement, rather any string in this complement and matching our primary regular expression: $S(\text{a.*b})\cap S(\text{.*\\n.*})^c$.

And here is how you would write this in the new monqjfa on codeberg:

NfaBuilder.matchRegex("a.*b&(\n)^")

Monqjfa understands the intersection operator & and, given any regular expression r, the caret operator r^ is a shortcut for (.*r.*)~ where ~ provides the complement, such that r^ means "any string not containing a match of r.

With the program ooo every rule is combined with the "does not contain a match of the split regex" automatically, so there you don't have to deal with the details. But now you know how it works internally.

And you may have fun experimenting with NFA, DFA operations. The Java class monq.jfa.FaToDot writes out any NFA or DFA in graphviz format so that the dot program can be used to view the graph of an automaton.

As an example, the graph below was generated with

java -cp monq-3.3.0.jar monq.jfa.FaToDot -dfa 'a.*b&X^' \
   | dot -Tsvg -Grankdir=LR >dfa.svg

This is what we talked above, except that the \n was replaced by an X for better readability in the graph.

The rectangular state on the left is the start state. With nothing but ab you immediately end up in the red stop state. In between is the elliptical state with self loops on many characters: \u000..W, Y..a and c..\uffff. This is for the .* but note what is missing: the X. There is nothing for it there, so it cannot appear between a and b.

Harald Kirsch

Pages

Checked Exceptions are not Exceptional

What are "checked exceptions"

Alternatives

These are neihther "Exceptions" nor "Errors"

Conclusion

Building Elaborate Data Structures in Java

What's the problem?

The Options

Go fully immutable

Go weird

Full weirdness, kind of

A Practical Use for Regular Language Intersection

What is "Regular Language Intersection"?

Some Background about NFA/DFA

Modernized monqjfa and the ooo program

The Intersection Operator

Harald Kirsch

Pages button svg

Checked Exceptions are not Exceptional

What are "checked exceptions"

Alternatives

These are neihther "Exceptions" nor "Errors"

Conclusion

Building Elaborate Data Structures in Java

What's the problem?

The Options

Go fully immutable

Go weird

Full weirdness, kind of

A Practical Use for Regular Language Intersection

What is "Regular Language Intersection"?

Some Background about NFA/DFA

Modernized monqjfa and the ooo program

The Intersection Operator

Pages