Skip to main content

Are Rust, C++ and WASM the new tools for Data Engineering?

ยท 3 min read

Traditional tools for data engineering are suffering in performance and scalability. JVM-based tools are becoming outdated, while new languages are becoming increasingly popular. Will Rust and WASM replace the current data engineering JVM-based stack ?

Rust Programming

I started my career as a C/C++ developer 20 years go working on network protocols and embedded systems. Over time, I moved more and more to work in the data space and my level of abstraction started to move up in the stack, with obviously less control of what is going on under the hood. When you go from C/C++ to Java, everything seems rosy in the beginning but, soon, when you start struggling with memory allocation, garbage collection and similar things you realise that you are loosing the power you had in your hands during your old C/C++ days. The advantage of the JVM, though, is the pluggability.

If you design your software well, you can pretty much allow any customisation to be plugged in at a binary level, just by adding adding a new JAR to your classpath. Where things get trickier is however scriptability. In many situations you want your software to be scriptable using languages like Javascript. It is possible, but the level of integration between scripting languages and the JVM are not that great. And many times, performance is poor. Think for example at the Spark and Python integration. That required a bridge like Py4J to make it work, but at a huge performance cost. Now things got better with support for new formats like Arrow, but I remember the first version of PySpark was pretty crappy and almost unusable.

However, I have a feeling things are starting to change. People are realising that maybe, the JVM is not really the best option for building data intensive applications. But what's the alternative? Recently Rust started to become very popular also thanks to the support of the blockchain community and the developer community started to realise that it can be used to build large and scalable systems. And...where do we need scalability today? Data! We have to handle more and more data and, clearly, the current tooling is not scaling up. It proves the fact that Databricks went through a complete rewrite of Apache Spark in C++, with huge benefits in terms of performance and scalability. At the same time you see several startups taking a similar direction. Look at RedPanda, who is implementing a much leaner version of Kafka entirely in C++. Many companies are following and will follow this trend.

But how to allow pluggability in these systems? Meet WASM, the new kid on the block. WASM is fundamentally a machine-level language that can integrate seamlessly with C++ and Rust. The beauty of it is that WASM can be generated from multiple languages like C, C++, AssemblyScript (a variation of TypeScript), Rust, Kotlin and others. You can even compile a full Python interpreter to WASM and host the execution of a Python script! As more and more language will support compilation into WASM or LLVM, the possibilities are endless. Now I think you understand where I'm getting! By bridging together high performant languages like C++ or Rust with WASM we get teh best of both worlds: performance, scalability and pluggability. I truly believe in this new pattern and that is the reason why at Dozer, we are building the next generation Data APIs stack entirely using Rust and WASM. Stay tuned!