Skip to main content

Like AirByte + ElasticSearch, but 15x faster

· 6 min read
Matteo

AirByte and ElasticSearch are often used together to extract and serve data more efficiently. However, this might not be the best solution.

TL;DR:

IngestionLookupComplex Query*
AirByte+ElasticSearch~480s~24ms~250ms**
Dozer~30s~1.3ms~1.3ms**

*Complex Query refers to queries with aggregations and joins
**Complex queries performance cannot be directly compared. Read below for more details.


AirByte and ElasticSearch are often used together to offload read traffic from relational databases like PostgreSQL or transform raw data from JSON or CSV files into APIs. Below is a typical setup:

Airbyte+ElasticSearch setup

AirByte periodically pulls data from databases or files and populates ElasticSearch through its REST APIs. After data is cached and indexed, it can be easily consumed by microservices or front-end applications through ElasticSearch’s APIs.

Let’s give it a try with the NY Taxi Dataset

I decided to experiment with this setup and the NY Taxi Dataset. The initial goal was to populate ElasticSearch with ~14 million rows, loading data from a compressed parquet file of ~350 MB.

I tried multiple times, but the operation failed continuously, due to JVM memory constraints. I tried to play with the JVM's -Xms and -Xmx parameters (2g, 4g, 8g) but without any success; the sync job was always failing with a 137 exit code. I coudn't increase the JVM heap any further, as I was running both AirByte and ElasticSearch docker images in the same machine.

AirByte failure

Eventually, I decided to downsize the data file and opted for a smaller one. This time I used a ~2.5 million rows parquet file. This time, synchronization ran smoothly, taking about 8 minutes.

AirByte success

Let’s test it with Dozer

One of the use cases where Dozer thrives is exactly the one I described above. Dozer embeds a data movement engine, a reactive SQL transformation engine, and a caching engine in a single Rust binary. It might seem too much at first, but this design allows us to achieve performances that are an order of magnitude better.

Dozer loaded and indexed the same ~2.5 million records in less than 30 seconds. Results were cached in its internal database and immediately available through gRPC and REST APIs.

Dozer ingestion

Dozer could ingest and index all records at a rate ranging between 50k and 100k records/sec.

How about querying?

Another equally important aspect is query performance. Once all data was ingested I ran a simple record lookup on ElasticSearch and Dozer.

On average, ElasticSearch’s REST APIs latency was around 24ms.

Dozer offers two types of APIs: REST and gRPC. For performance reasons, we obviously always recommend using the latter.

Using the gRPC endpoint, the latency was about 1.3 ms, almost 20 times faster than ElasticSearch.

I also wanted to see how Dozer would perform under stress. I queried Dozer’s cache 100k times, with a concurrency of 100. These are the results:

Dozer query latency

Dozer still performed pretty nicely, with an average response time below 6ms.

How about complex queries?

This is where an apple-to-apple comparison between AirByte+ElasticSearch and Dozer is not really possible. Let’s see why.

The AirByte+ElasticSearch solution allows you to move and serve data very quickly, but if transformations, aggregations, or joins are required, the solution is not straightforward.

We can perform these operations as part of the ElasticSearch query, but this would dramatically degrade the query performance and put a lot of pressure on ElasticSearch.

Alternatively, we can dump all data into an intermediate storage (i.e a Data Warehouse like Snowflake), perform all operations there in a batch fashion, and then export it to ElasticSearch. This option will result in something like this:

Airbyte, Snowflake and Elastic

I’d say this is the most common approach, used by many companies nowadays.

In contrast, Dozer does not require this intermediate step. Instead, it allows you to perform transformations, aggregations, and joins while data is in transit. This approach has multiple advantages:

  • Data in the cache is always fresh: transformations are executed in real-time as events happen. There is no need to wait for batch transformations to complete before exporting data to the caching layer.

  • Guaranteed query performance: data is always pre-materialized. This always guarantees predictable query latency.

  • No intermediate data warehouse: as data is always transformed in real-time, there is no additional cost for any external data warehouse.

Dozer’s real-time transformation engine

Going back to our previous NY Taxi dataset example, suppose we want to calculate the MIN and MAX trip time between location IDs and join the main trips dataset with a reference zones file. A SQL query to achieve it would look like this:

SELECT 
puz.Zone as pickup_location,
doz.Zone as dropout_location,
MIN(trips.trip_distance) as min_trip_distance,
MAX(trips.trip_distance) as max_trip_distance
FROM trips
INNER JOIN zones puz ON trips.PULocationID = puz.LocationID
INNER JOIN zones doz ON trips.DOLocationID = doz.LocationID
GROUP BY puz.Zone, doz.Zone;

First I tested a similar query on ElasticSearch; latency was exceeding 250ms.

With Dozer, these transformations can be performed in real-time in a reactive fashion. As raw data flows in, the query is incrementally executed and results are materialized in the cache.

With Dozer, running the above query on the entire stream of 2.5m records and materializing the results into the cache takes less than 5 seconds.

Because of this approach, Dozer is not required to join or aggregate data on the fly when data is queried, resulting in a predictable query latency of 1.3ms like before.

Conclusion

Both solutions have advantages ad disadvantages. However, if the goal is to quickly combine and transform multiple sources and create an efficient Data API layer, we believe Dozer offers several advantages, both in terms of cost and ease of use.

In some scenarios, users might still need a data warehouse for post-fact analytics. In those cases, Dozer can source data from data warehouses such as Snowflake or Deltalake.

All tests were run on a MacBook Pro equipped with the M1 Pro processor, 16g of RAM and Dozer 0.1.13. The 15m records dataset is available here, while the 2.5m records dataset is available here.

Dozer is fully open source and available from our GitHub