Skip to main content

Mastering Dozer SQL Queries: A Comprehensive Overview

· 9 min read
Sagar

Welcome to our comprehensive guide on using Dozer SQL, where we explore the versatility of Dozer SQL queries you can perform with this powerful tool. In this guide, we'll be working with two datasets from NYC's TLC Trip Record Data. The guide is divided into sections, ranging from simple filtering and aggregation to more advanced topics like JOINs, Common Table Expressions (CTEs), subqueries, UNION operations, and window functions such as TTL, Tumble etc. So, let's dive in and explore various Dozer SQL queries!

What is Dozer?

Dozer is a powerful open-source data API backend that simplifies the process of ingesting realtime data from various sources and serving it through REST and gRPC endpoints. It supports a wide range of data sources, including databases, and object storage services etc. You can learn more about Dozer from its official documentation and its GitHub repository.

Prerequisites

Before we begin, make sure you have the following:

  • Dozer installed. For more installation instructions, visit the Dozer documentation.
  • Basic knowledge of SQL for writing data transformation queries.
  • For the complete guide, visit here.

Basic query structure

Since Dozer fetches data from sources, and puts it out at the endpoints therefore it is crucial for every Dozer SQL statement to have a top-level SELECT clause as well as atleast one FROM clause.

These are pretty standard rules for a normal retrieve statement in SQL, however, Dozer SQL also requires the use of an INTO clause. This will specify the name of the table that the endpoints will use to fetch the data from.

Hence, the basic statement structure is,

 SELECT A INTO C FROM B;

The datatypes and casting compatible with Dozer SQL are described in the documentation for datatypes and casting.

Dozer SQL also supports primitive scalar function described in documentation for scalar functions.

Contents

The guide is broadly divided into 7 sections, with each section describing how to use a particular function or clause.

For directly jumping into executing Dozer SQL queries follow,

Downloading and understand the dataset

We will use two tables i.e. trips and taxi_zone_lookup throughout this guide. Both of these can be joined over LocationID which is a primary key in taxi_zone_lookup and PULocationID,DOLocationID from trips reference this.

trips

table 1

taxi_zone_lookup

table 2

Steps to load the data into MySQL can be found here

1. Filtering

Let us write a query to calculate the tips and trip_miles for rides that cover more than 100 miles, along with an additional filter which ensures that the result will be for hvfhs_license_num ending with 03 only.

For this we can use the WHERE clause.

SQL Query and Structure

  SELECT tips, trip_miles
INTO table1
FROM trips
WHERE trip_miles > 100
AND hvfhs_license_num LIKE '%03';

filtering_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

2. Aggregation

Let us see few aggregations operation possible with Dozer. For more information on aggregation operations, head over to the aggregation documentation

Let us write two queries,

  • Query to find the average_tips rounded upto two decimal places as well as the total trips made for each Pickup location.

  • Query to find the maximum tipped Pickup location as well as the minimum tipped Pickup location.

SQL Query and Structure

Query 1

  SELECT PULocationID, ROUND(AVG(tips),2) as avg_tips, COUNT(1) as trip_count
INTO table1
FROM trips
GROUP BY PULocationID;

Query 2

  SELECT MAX_VALUE(tips,PULocationID) as max_tipping_location, MIN_VALUE(tips,PULocationID) as min_tipping_location
INTO table2
FROM trips;

aggregation_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

3. JOIN

Let's see how to JOIN two sources with Dozer. The two tables i.e. taxi_zone_lookup and trips can be JOINed over the LocationID.

Let us write a query to calculate the average tips for rides taken from a particular Zone. Afterwards we will order this to find the zones where the customers have higher chances of tipping a good amount. Additionally a HAVING clause is also added to filter out the results.

SQL Query and Structure

  SELECT zp.Zone, ROUND(AVG(t.tips),2) as avg_tips, COUNT(1) as trip_count
INTO table1
FROM trips t
JOIN taxi_zone_lookup zp ON t.PULocationID = zp.LocationID
GROUP BY zp.Zone
HAVING ROUND(AVG(t.tips),2) > 1.50;

join_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

4. Common Table Expressions (CTEs)

CTE functions potentially improve readability, as well as give us the option to reuse queries.

Let us write a query to calculate the average tips as well as number of pickups in high frequency zones. Notice how we can have both the CTEs i.e. good_tip_pickups & high_frequency_pickups in one CTE as well, but in this example let us go ahead with two CTEs to describe how to write multiple CTE functions.

SQL Query and Structure

 WITH good_tip_pickups AS (
SELECT PULocationID, avg(tips) as avg_tip
FROM trips
GROUP BY PULocationID
HAVING avg(tips) > 1.00
),
high_frequency_pickups AS (
SELECT PULocationID, count(1) as trip_count
FROM trips
GROUP BY PULocationID
HAVING count(1) > 100
)
SELECT Zone, avg_tip, trip_count
INTO table1
FROM taxi_zone_lookup tzl
JOIN good_tip_pickups gtp ON tzl.LocationID = gtp.PULocationID
JOIN high_frequency_pickups hfp ON tzl.LocationID = hfp.PULocationID;

cte_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

5. Sub queries

Let us see how to write nested sub queries.

Let us write a query to calculate the average tips for rides taken from a particular Zone. The outermost query JOINs the taxi_zone_lookup with a nested subquery hfz which does the job of filtering and grouping the trips table. hfz contains another sub query which ensures prior filtering of trips with license plates that end with 3.

SQL Query and Structure

SELECT tzl.Zone, hfz.avg_tips, hfz.trip_count
INTO table1
FROM taxi_zone_lookup tzl
JOIN (
SELECT PULocationID, COUNT(1) AS trip_count, AVG(tips) AS avg_tips
FROM (
SELECT PULocationID, tips
FROM trips
WHERE hvfhs_license_num LIKE '%3'
)
GROUP BY PULocationID
HAVING COUNT(1) > 100 AND AVG(tips) > 1.00
) hfz ON tzl.LocationID = hfz.PULocationID

squery_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

6. UNION

Let us write a query to calculate the average tips and total trips for a particular Zone. Here we are considering the ride to be taken in a Zone if either of the Pickup Location or Dropoff Location match the Zone, which requries the use of UNION. The UNION is taken inside a CTE all_trips which is then joined with taxi_zone_lookup.

SQL Query and Structure

WITH all_trips AS (
SELECT PULocationID as LocationID, SUM(tips) AS sum_tip, COUNT(1) AS total_count
FROM trips
WHERE PULocationID != DOLocationID
GROUP BY PULocationID
HAVING COUNT(1) > 100 AND AVG(tips) > 1.00

UNION

SELECT DOLocationID as LocationID, SUM(tips) AS sum_tip, COUNT(1) AS total_count
FROM trips
GROUP BY DOLocationID
HAVING COUNT(1) > 100 AND AVG(tips) > 1.00
)
SELECT tzl.Zone, SUM(sum_tip)/SUM(total_count) AS avg_tips, SUM(total_count) AS trip_count
INTO table1
FROM taxi_zone_lookup tzl
JOIN all_trips at ON tzl.LocationID = at.LocationID
GROUP BY tzl.Zone;

union_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

7. Window functions

Let us see how to use the TUMBLE and HOP functions using Dozer SQL. These functions are often useful with real time analytics over vast streams of incoming data. To read more about window functions read the documentation.

Here we describe two queries,

  • Query to calculate the sum of tips obtained for a particular Pickup location over a 5 minutes window.

  • Query to calculate the sum of tips obtained for a particular Pickup location over a 5 minutes window but the windows overlap by 2 minutes. i.e. the 5 minutes is divided into,

    • 2 minutes overlapping with past window
    • 1 minute non overlapping
    • 2 minutes overlapping with next window

SQL Query and Structure

Query 1

  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table1
FROM TUMBLE(trips, pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;

Query 2

  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table2
FROM HOP(trips, pickup_datetime, '2 MINUTE', '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;

wfunctions_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

8. TTL function

The TTL function provides a way to manage the memory usage in Dozer, particularly when dealing with vast streams of data. By setting up a TTL, it ensures that only relevant (or recent) data is held in memory, providing a balance between data retention and memory efficiency. TTL is based on the record's timestamp, ensuring that data eviction is contextually relevant.

To read more about window functions read the documentation.

Here we describe two queries that will only use fresh data obtained over a 5 minute window,

  • Query to calculate the sum of tips obtained for a particular Pickup location grouped over a 2 minutes window.

  • Query to calculate the sum of tips obtained for a particular Pickup location grouped over a 3 minutes window but the windows overlap by 1 minutes. i.e. the 3 minutes is divided into,

    • 1 minutes overlapping with past window
    • 1 minute non overlapping
    • 1 minutes overlapping with next window

SQL Query and Structure

Query 1

  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table1
FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;

Query 2

  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table2
FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;

ttl_graph

To checkout the configuration file click here, additionally to run the configuration file checkout the instructions.

Conclusion

As you can see, Dozer SQL lets you write powerful SQL queries that will give out meaningful data analysis, and instantly get low-latency data APIs. This makes Dozer a powerful tool for quickly building data products.

For more information and examples, check out the Dozer GitHub repository and dozer-samples repository. Happy coding, Happy Data APIng! 🚀👩‍💻👨‍💻