Have you ever found yourself re-inventing the wheel as a data engineer to ultimately build some APIs? In this article, I want to go through the motivations that led us to build something like Dozer.
A few years ago, we implemented a unified API layer at DBS Bank in Singapore that would serve all mobile banking users across all regions, languages and products with Personalisation features. This project cost several million dollars and involved more than 50 people and took us more than 2 years.
You must be wondering "Wait! That's a lot of people to build some APIs for a mobile app?" 60% of the effort went into building the core infrastructure and plumbing necessary to move and transform data (ETL tools), cache and index data (Caching Layer) and build APIs on top of the cache. Consistency was paramount considering it was a bank. This project was on a massive scale and involved a lot of things. I spoke about it here in this video:
As a data engineer, I found myself having to reinvent the wheel and solve the same exact problem several times throughout my career. Most often this involved integrating with and maintaining several tools that are over-equipped to serve data efficiently to consumer-facing applications.
I spoke about this problem with my Co-Founder Vivek and it turned out he had to solve the same problem in different contexts several times as well. We started iterating over this concept and had an "Aha!" moment when we came across Bulldozer, an internal platform developed internally by Netflix to solve the same problem.
Since then, the idea has evolved a lot and now there are much more sophisticated tools available in the market but our core vision remained the same and we started building Dozer together.
We envisioned offering an open-source and extensible alternative to developers that can go from data sources to APIs in a seamless fashion while still not giving up on performance considerations.
TL;DR: A plug-and-play data cache that instantly gives you blazing-fast APIs
We took a horizontal and opionionated approach that cuts across different product categories. We ended up building parts of a streaming database, an ETL tool, a cache with search capabilities and API generation on the fly.
Dozer hypercharges your existing databases, data warehouses and data lakes with easily integratable search, analytical and real-time capabilities.
Read more about our Architecture here
That's a tall statement to make and we write this article with respect for some of the amazing software and tools we were inspired by. We want to take the rest of the article to talk through how we arrived at a solution.
Why not Hasura, Flink, ElasticSearch, and Airbyte?
All of the above are great tools and they are power packed with features. But as a data engineer, I still have to integrate many tools and put together an end-to-end solution.
Why not use a streaming database or a warehouse ? Streaming databases solve a lot of fundamental database problems that I don't always care about. I care about getting updates in real-time and being able to serve APIs.
Why not use tools such as Hasura and GraphQL on top of your databases? This is a great solution on first look and obviously, Hasura as software is very successful. We wanted not to rely on the underlying SQL engine of the data sources for the queries.
We personally also prefer gRPC over GraphQL for low latency. We are also interested in data and read APIs not so much an end-to-end API orchestration platform.
We have a page about comparision, where we put down some of the commonly asked questions from early investors if you are interested.
We wanted developers to have full power and control over building data apps all the way from a data source to APIs. Let's discuss some of the points below.
- Single binary as opposed to multitude of tools
- Blazing Fast APIs
- Well documented data contracts
- Horizontal scalability & Cloud compatibility
- Transform using SQL
- gRPC and REST
The concept of data contracts has recently become popular. Especially in a distributed environment where responsibilities are split between domains, data contracts play a big role especially in managing dependencies and managing apps with confidence.
Dozer automatically generates Protobuf definitions and statically typed support. Even with REST, Dozer generates Open API documentation. This offers a statically typed experience for developers to avoid mistakes.
Correctness and Consistency
When managing several tools and especially with bad data quality, it is easy to make mistakes in ensuring data consistency.
Combine across sources
In many scenarios, data is often distributed between several microservices and business units. Creating an API that combines and aggregates data across sources requires a complex engineering build. Dozer lets you join and aggregate across different sources in real-time enabling powerful features such as customer personalization.
Data latency is a measure of the time from when data is generated to when it is queryable. In a distributed context, it is very difficult to manage data latency SLAs, especially considering end-to-end from sources to APIs.
After data processing, data stores such as ElasticSearch are typically used to power data APIs. Vivek personally solved many similar problems with ElasticSearch deployment. Dozer offers gRPC and streaming out of cached data with a subset of ElasticSearch features. One of our design goals is to solve the most commonly used patterns very well and deeply integrated in the main product.
Scalability is a huge challenge when it involves a complex deployment of data pipelines + streaming database + API Servers + Caches. Each of them presents a layer of scalability problems.
Dozer can be run as a single process for simple applications or can be run in a distributed fashion where writing and reading are decoupled. This is a cost-effective approach where reading has a very low overhead and can be scaled on demand. Dozer API servers can be deployed on serverless platforms such as AWS Lambda.
There are a magnitude of other problems that we haven't touched upon such as
- Data Privacy
- Observability & Lineage etc
We will cover some of these topics in separate articles.
Open Source & Extensible
Dozer is an open-source first company and the platform is designed to be extensible. Developers can build connectors to new data sources and transformations. Dozer is built on Rust, which is known for its performance and safety. WASM support will be added soon.
We also will talk about our Rust experience in detail in a separate article.
Dozer is still at a very early stage and we are actively developing. Please fork us at Github and join our Discord channel
You can find samples in our docs portal.