Skip to main content

Dozer performance analysis using a MySQL database

· 6 min read
Sagar

In this blog post, we embark on a journey into the realm of data analytics using a vast IMDb dataset. Leveraging the power of Dozer, we will conduct a series of experiments to unveil meaningful insights hidden within the world of movies and TV shows. Our experiments will involve SQL queries, data transformations, and performance analysis.

What is Dozer?

Dozer is a powerful open-source data API backend that simplifies the process of ingesting realtime data from various sources and serving it through REST and gRPC endpoints. It supports a wide range of data sources, including databases, and object storage services etc. You can learn more about Dozer from its official documentation and its GitHub repository.

Prerequisites

Before we begin, make sure you have the following:

  • MySQL server up and running.
  • Dozer installed. For more installation instructions, visit the Dozer documentation.
  • Basic knowledge of SQL for writing data transformation queries.
  • For the complete guide, visit here.

Project Structure

The IMDB-Analytics sample is available in dozer-samples repository.

The dataset used here is taken from CMU 15-445/645 Coursework which also has many interesting ideas for SQL queries.

Schema

This sample is divided into four experiments which describe different configurations, configuration 1 runs dozer directly from source to cache, whereas configuration 2,3,4 perform the same task using different type of SQL queries.

Step 1: Understand the dataset

The dataset has six tables: akas , crew , episodes , people , ratings , title, with roughly 17 million data records in total.

TableNo of Rows
akas4_947_919
crew7_665_208
episodes157_869
people1_782_303
ratings188_159
title1_375_462

Steps to load the data into MySQL can be found here

Step 2: Running Dozer

For running Dozer with configuration 2 i.e. directly from source to cache run the following command.

dozer clean -c exp2-config.yaml
dozer build -c exp2-config.yaml
dozer run -c exp2-config.yaml

For running with any other configuration file, simply replace the configuration file name in the above command.

This will start Dozer and it will begin ingesting data from the specified MySQL database and populating the cache. You can see the progress of the execution from the console. The results of the SQL queries will be available at the specified endpoints.

Step 3: Querying the Dozer APIs

Dozer generates automatc REST and gRPC APIs based on the endpoint configuration provided in the dozer config. We can now query the Dozer endpoints to get the results of our SQL queries. You can query the cache using gRPC or REST.

Dozer API provides the option to add query such as order by, limit and many more... to improve the visibility of the data on the endpoints.

The query described can be passed to the REST endpoints GET:localhost:8080/endpoint1 to order the data in descending order w.r.t titles.

{
"$order_by":{"titles":"desc"}
}

Performance Analysis

Hardware Specifications

CPUCoresMemory
Ryzen7 4800H816GB

It is also important to note, the experiments are being run on a NVMe SSD which offers a higher read speed than conventional SATA storage systems.

Direct Configuration

Insights

Insights

  • Roughly took 7 mins to process all the records.
  • Pipeline latency is very low (~0.0002s) as there is no transformation involved.
Start TimeEnd TimeElapsed
11:51:23 AM11:58:04 PM~ 7 mins

It is interesting to note here that about 17 million records were ingested and outputted on the endpoints in about 7 minutes. That roughly equates to 40 thousand records completely processed per second!

Double JOINs, CTEs and JOINs, Sub queires

The goal is to get the count of action titles any actor/actress has acted in, and then ordering by titles to find the hero/heroine with most action titles!

Double JOIN

 select p.name, c.category, count(1) as titles
into out_table
from people p
join crew c on p.person_id = c.person_id
join titles t on t.title_id = c.title_id
where (c.category = 'actor' or c.category = 'actress')
and t.genres like '%Action%'
group by p.name,c.category;

CTE & JOINs

 with acting_crew as (
select person_id, category
from crew
where category = 'actor' or category = 'actress'
),

action_titles as (
select title_id
from titles
where genres like '%Action%'
)

select p.name, r.category, count(1) as titles
into out_table
from people p
join acting_crew r on p.person_id = r.person_id
join action_titles a on a.title_id = r.title_id
group by p.name,r.category;

Sub Queries

 SELECT p.name, r.category, COUNT(1) AS titles
INTO out_table
FROM (
SELECT person_id, category, title_id
FROM crew
WHERE category = 'actor' OR category = 'actress'
) r
JOIN (
SELECT title_id
FROM titles
WHERE genres LIKE '%Action%'
) a ON a.title_id = r.title_id
JOIN people p ON p.person_id = r.person_id
GROUP BY p.name, r.category;

The statistics for all the experiments mentioned are quite similar since Dozer SQL Engine automatically optimizes the queries to improve upon processing speed. Statistics shown here are for CTEs and JOINs, for viewing all the statistics visit here

Source

Insights

Stores

Insights

  • Roughly took 2 mins to process all the records.
  • Took less time than direct configuration since,
    • less data was ingested as well as given out.
    • excellent pipleline latency even with two JOIN operations.
  • Pipeline latency stayed under 0.25s.
Start TimeEnd TimeElapsed
11:04:40 PM11:06:34 PM~ 2 mins

Conclusion

As you can see, Dozer makes it easy to ingest real-time data from an MySQL database and expose it as queryable APIs. With just a simple configuration, you can write powerful SQL queries that will give out meaningful data analysis, and instantly get low-latency data APIs. This makes Dozer a powerful tool for quickly building data products.

For more information and examples, check out the Dozer GitHub repository and dozer-samples repository. Happy coding, Happy Data APIng! 🚀👩‍💻👨‍💻