Is Your Programming Language Data-Oriented?

Konopka Kodes Blog
5 min readJun 2, 2021

For many years, there was a strict hierarchy in my mind: Static languages like Java and C++ are cumbersome and verbose, and require me to write lots of redundant type names and boilerplate code, while dynamic languages such as Python are cool and awesome and productive because defining a new integer variable is as easy as writing “x = 1”.

However, I’ve come to realize that a static language like Kotlin can rival Python in terms of concise code and development speed, and that Python’s faster development cycle is actually due to another reason. The reason some languages are painful to work with is that they are not designed to be a data-oriented. In this article I will demonstrate that using Java.

Java is not data-first

What does the following Java code print?

int[] a = {1, 2, 3};
System.out.println(a);

It prints nothing helpful ([I@2ff4acf0). A similar thing happens when you define an ArrayList of objects which don’t have a .toString() method.

How do you define a literal map in Java? Until Java 9, your best option was:

Map<String, String> myMap = new HashMap<String, String>();
myMap.put("a", "b");
myMap.put("c", "d");

Java 9 allows you to use Map.of(…), but this is done hackily with function overloads so it only works for up to 5 entries. As of 2021, there still is no simple way to do this in Java. Witness the full travesty here: https://stackoverflow.com/a/6802502/2111778

Contrast this with Python’s a = [1, 2, 3] and my_map = {"a": "b", "c": "d"}, which have reasonable definitions and string representations.

Java is designed with an object-oriented mindset, so most of your data will actually be hidden behind indirect references.

Lists and Maps in Java

Java’s list type is List<T>, which define a list which only accepts objects of a type T of your choice.* However, you cannot directly instantiate it, only implementations of it. List<T> is an interface for different list types which support basic list operations (adding to the back, indexing at any position, etc.). The idea is that you can keep this “List contract” and swap out implementations at any time.

               List<T>
/ \
ArrayList<T> LinkedList<T>

Similarly for Map<K, V> and Set<E>:

                      Map<K, V>
/ | \
HashMap<K, V> TreeMap<K, V> LinkedHashMap<K, V>

This seems logical, so what is the issue here? Any working programmer can tell you that 99% of the time, you want to use an ArrayList<T> and a HashMap<K, V>/HashSet<E>. In fact, that is exactly what Kotlin’s and Python’s list and map types are. Java’s way makes no sense: For new programmers, it increases the risk of choosing the wrong implementation. For experienced programmers, it is adding unnecessary line noise. Just make the right thing be the default.

Conclusion: Java hides useful classes deep in its package hierarchy under obscure names. Even the most basic print command is buried under System.out.println! In contrast, Python libraries usually put commonly-used functions at top-level rather than hiding them inside a huge module hierarchy. (The only exception I can think of is xml.etree.ElementTree, but is that really a surprise considering it’s XML? 😉)

* Little footnote: Remember T? The type T has to derive from Object, so it isn’t possible to use primitive types, so you cannot have a List<int> for example, only lists of objects. This introduces an additional layer of indirection and data hiding, because now you need some sort of IntHolder class to put ints into a list. (There is an ugly workaround called integer auto-(un)boxing, but in my experience it only works half the time.)

Testing

Writing tests in Java is a colossal pain.

In a past job I worked on a Java microservice which had 5k lines of business code, and 10k lines of test code. Most tests weren’t complicated, often just one positive and one negative case, but Java’s lack of true data classes, refusal to overload operators, and other data hiding tendencies made setting up even simple business objects incredibly cumbersome.

In Java, testing works per dependency injection: Rather than a piece of code getting a resource like a file or the current time directly, that code is instead placed in its own separate class which receives a TimeProvider and DateProvider member through its constructor. Since all of your class’s interactions with the outside world now go through these proxy objects, you can “simply” pass in a bunch of mock objects when testing that class.

Of course, this massively blows up your code size — at one job I was asked to factor out the code if (timestamp == null) timestamp = now(); into its own class which after adding imports and members and constructors resulted in a file 30 lines long. Setting up the tree graph of all mock objects needed to write just one test for just one method would often take over 50 lines, then stepping down the hierarchy to assert on every sub-object is also dozens of lines, so you can imagine how we ended up at 10k lines of tests.

In Python, you can patch most functions simply by name, be they your own or library functions (Sadly some C functions are harder to patch). This means you don’t have to contort yourself to writing code in an “injectable” way and you don’t need to create tons of mock objects during testing.

Private fields

In the Java community, it is considered good practice to declare the fields of a class as private and access them using getters/setters rather than directly (or deny access altogether). The idea is to allow hiding the internal representation of the field and allow an unchanging public class API. This may be useful when writing a library, but makes no sense when writing an application, where you would want to use an updated and improved variable name everywhere, and IDEs are plenty capable of changing it everywhere at once.

At a previous job, I had to work with an XML parsing library that contained a bug: After 10 parsing errors, it would stop outputting any more errors, to prevent denial-of-service attacks. This was intended to be per-call, but a bug was causing this limit to be for the program lifetime. Many expensive team hours went into investigating this issue, but since the field was private static we were unable to account for this bug by manually resetting the count.

(By the way, XML is another excellent case study of data hiding by drowning data among line noise and wasting programmer’s brain cycles: https://blog.codinghcodinghorrororror.com/xml-the-angle-bracket-tax/)

Do not hide your data inside of private fields. In Python, all fields and functions are public. Fields and functions not intended for outside use start with a single underscore (_) (read more here). Having “emergency access” to private fields is useful when debugging and in the rare case where there is no better workaround.

Conclusion

Java actively hides data from you. There is a saving grace: Among Java developers, debugger use is very common, which helps mitigate this issue a bit. You can set a breakpoint and rather than printing out data, you can go to the active variables pane and click-click-click your way deep into your object hierarchy to somehow get to your data. I’d much rather keep coding in data-oriented languages though!

--

--

Konopka Kodes Blog

25/M software engineer from Düsseldorf, Germany. Developer of Mundraub Navigator (Android app) and Jangine (chess engine).