Every company I meet today has a data platform. And if they don’t have one, they want one. Problem is that building and maintaining a data platform is not trivial. First, multiple tools need to be integrated together: Airflow, Spark, Presto, Kafka, Flink, Snowflake, and potentially many more, but, more importantly, a dedicated engineering team must be setup to maintain and make sure everything runs smoothly. And, what usually happens is that, after data has been accumulated for months and months, the cost of running such infrastructure is higher than the benefit.
So the question is: do you really need a data platform ?
Get back to the basics
Let’s take an example of a mid-size company embarking into an adventure of building a data platform. Generally, they do it for two purposes:
- Data analytics: Being able to generate analytical dashboards from historical data
- AI and advanced use cases such as real-time user personalisation
Typically, for the first use case you’d setup a Snowflake or Databricks instance and dump all your data there. But wait! Do you really need it? Very likely you will not have Petabytes of data to manage. How about something leaner?
If you are familiar with the data space, you’d have probably recently heard about tools like Pola.rs, Datafusion or DuckDB! If you have not heard about them, they are small and highly efficient OLAP query engines that can achieve impressive performance. The reason why they are so efficient is because their authors have made the decision to go back to the basics. Forget about distributed data processing frameworks like Apache Spark with inefficient network shuffling! Forget about 20 years old languages like Java or Scala (With all the GC problems they bring along)! Embrace simplicity using lower-level languages like C/C++, even better, Rust, and squeeze every CPU cycle to get as much performance as possible.
So, it’s pretty trivial to dump all your data from your OLTP databases into an S3 bucket, and bring up multiple ad-hoc instances of DuckDB, Pola.rs or DataFusion, run all your OLAP queries and shuts everything down. All for a negligible TCO. Multiple companies realised the potential of such approach and are building what I call a “Poor’s men data platforms” around these tools. MotherDuck is doing this with DuckDB, for example.
How about real-time ?
While this lean approach is very easy to achieve for batch workloads, it is not that trivial when we start addressing more complex use cases like AI or real time personalisation. Real-time is a lot harder and, in many scenarios, the goal of real-time use cases is not just producing analytical dashboards, but it is the full integration of the data with customer-facing applications, enabling another level of interactivity. The simplest example is probably user personalisation. For such a use case, data from multiple sources need to be combined, an ML model might be applied, and, in some scenarios, data should be updated based on user behaviour. All this in real time!
Achieving this today is not trivial. Some companies have given up to handle all this in real time, because it’s simply too complex and expensive. Think for instance of how reverse ETL and personalisation APIs are really implemented today in most cases: everything is still batch! Data is pulled from your sources using tools like AirByte or Fivetran and loaded into your Snowflake or Databricks. Then, every day or hour, you run your DBT jobs which extract the data you need, run your ML models, and load the results into some cache or low-latency database for serving. Companies are trying to come up with solutions to simplify the process, but everything is still: batch!
If you want something more than this, it is definitely possible! But it is complex! You need an entire infrastructure that is capable of handling real-time data (i.e. Kafka), a stream processing engine (i.e. Spark Streaming, Flink, Kafka Streams), one or multiple low latency data-store depending on the query patterns of your application (i.e Redis, Aerospike, ElasticSearch), an API layer and, most importantly, a data engineering team capable of putting all these pieces together!
Enter the Data Apps world
So, is there a way to achieve the sample simplicity of DuckDB or Pola.rs for something like this? Probably yes, and the answer is Data Apps. What are data apps? There is really no proper definition for it, but, the way I like to describe a data app is:
A self-contained monolith application, that is capable of efficiently serving data, and, at the same time, react to data changes in real-time and perform complex operations such as joins, aggregations, ML predictions, notifications, and more.
The definition is on-purpose generic. But, fundamentally I see Data Apps as the bridge between source systems and user-facing applications, enabling a high level data interactivity and actionability.
Forget about streams, caches, pipelines, etc! Just put a data app backend between the source systems and the user application and magic can happen!
Some of these ideas have been pioneered by a very successful tool called StreamLit: a Python framework allowing data scientists to quickly prototype data apps using Python. While StreamLit is a beautiful and powerful tool, it has not yet unlocked the full potential of data apps, especially when an entire ecosystem on the backend side has to be connected.
The software engineer’s perspective
As a full-stack engineer I want to superpower of a full data engineering team!
The bottom line
If all this is possible, it means all the complexity needed for a typical data platform with a lambda or kappa architecture is gone. Batch workloads can be easily handled using tools like DuckDB and real-time workflows can be easily handled by a bunch of real-time data apps disseminated in the organization sitting between the source systems and the users.
The philosophy behind all this is what led us to create Dozer. A real-time data app backend specifically targeted to full-stack and frontend engineers. Our mission is to give data superpowers to the full-stack developer!