While it looked like a quiet summer for Kamu, after releasing the Open Data Fabric lots of work has actually been going on in our secret repos and hidden branches. After we solidified our architectural direction in the ODF it was time to re-assess our technology stack to put ourselves on track for a steady pace of improvements in next few years.
The Old Stack #︎
In the early days we were placing a big bet on Apache Spark to be our primary data transformation engine. In fact we expected most of our code to run as an application inside of the Spark framework. So it made sense for us to pick the Scala language to integrate with Spark easily. This also meant we could easily tap into the decades worth of progress in data-related technologies as part of the JVM ecosystem.
While it served the purpose of our MVP well, the data ecosystem of Scala+JVM presented us a number of significant challenges:
- Binary size bloat - We had to dismiss the idea of distributing the
kamu-cli
as an embedded Spark application pretty early, simply because a framework like Spark even when compiled is over 230 MiB large. Size would become a major adoption blocker for us. It is quite hard to believe that 230 MiB of binary code is justified to manipulate something as simple as structured/tabular data. - Dependency hell - An application like
kamu-cli
interacts with many-many protocols and formats to achieve its goal, so we had a large number of dependencies: HTTP/FTP client libraries, Parquet readers, codecs for different compression formats, SQL connectors, UI libs etc. Trying to combine all these dependencies in a single binary which on top of all links with Spark and therefore thousands of its dependencies basically guaranteed that our project was in a constant state of incompatible dependency versions (a problem known as “dependency hell”). Combined with the previous problem this forced us to isolate the coordinator application from Spark completely and use Docker to simplify the distribution of our version of Spark. This didn’t solve the dependency problem completely as in order to connect to Spark’s SQL/Thrift server you needed lots of very old libraries with lots of transitive dependencies. - Latency - Even after separating Spark into Docker the size of
kamu-cli
binary was about 90 MiB, and it took JVM over a second to start executing the app. This doesn’t sound too significant (especially compared to > 15s startup time for Spark) but it resulted in a very poor user experience of our CLI. Responsiveness of tools is extremely important. - Performance - We think that modern data frameworks have been quite far distanced from the hardware due to abstraction layers of JVM and the like. The lengths these projects go through to gain more performance (e.g. off-heap memory management, JIT compilation, GC optimizations) are basically fighting the self-imposed constraints that come from JVM. Not to get too deep into this topic, our belief is that in the long term we will see data science coming much closer “to the metal” than it is now to fully utilize the potential of modern CPUs/GPUs.
Many of these problems have led us to the idea of separating data processing Engines (like Spark and Flink) from the Coordinator binary, letting the coordinator to be small, nimble, fast, and user friendly, while the complexity of modern data frameworks remains well hidden in the Docker images.
The New Stack #︎
While we are hoping that frameworks will also take a turn towards better decomposition, shedding VM layers, and having more “mechanical sympathy” - we decided not to wait for it ourselves. We picked Rust - the most promising system programming language available today and placing a bet that it will have a great future in data science.
We are happy to announce that kamu-cli
version 0.30.0
has been a full rewrite of the app into Rust. We believe this leaves us with a very strong foundation to evolve our product in the many years to come.
It already resulted in:
- Much smaller binary (only 4 MiB!)
- Blazing fast responsiveness
- Surprisingly even better portability (wasn’t that one of the main selling point of JVM?)
- No signs of dependency hell
This concludes this quick progress update. I’ll leave you with a quick demo of our new slick UI:
Managing data has never been so pleasant before!