Skip to main content

Real-Time Data Ingestion from AWS S3 using Dozer: A Comprehensive Tutorial

· 7 min read
Abhishek

In this tutorial, we will guide you through the process of setting up Dozer to ingest real-time data from an AWS S3 bucket and expose it as queryable APIs. This tutorial is designed to be a comprehensive guide, providing you with all the necessary steps, from setting up your environment to running and querying your data from AWS S3 with Dozer. We will be using a sample stock trading dataset, but the principles can be applied to any dataset stored in an AWS S3 bucket.

What is Dozer?

Dozer is a powerful open-source data API backend that simplifies the process of ingesting realtime data from various sources and serving it through REST and gRPC endpoints. It supports a wide range of data sources, including databases, and object storage services etc. You can learn more about Dozer from its official documentation and its GitHub repository.

Why to use Dozer?

Dozer is designed to simplify the process of ingesting data from various sources and serving it through APIs.Through one single configuration file, it is possible to create the entire pipeline required for APIs that perform complex data transformations and are Real-time. For this example we will be using the Dozer S3 connector which allows us to continously monitor and ingest data from S3 buckets, and serve it through APIs. With a streaming SQL engine and cache fully built in Rust, APIs can be built easily which update not in hours or minutes but in seconds!

Prerequisites

Before we begin, make sure you have the following:

  • An AWS account with access to S3 services.
  • The AWS CLI installed and configured with your AWS credentials.
  • Python installed for running the data generation script.
  • Dozer installed. You can install Dozer using the following command: cargo install --git https://github.com/getdozer/dozer dozer-cli --locked. For more installation instructions, visit the Dozer documentation.
  • Basic knowledge of SQL for writing data transformation queries.

Project Structure

The AWS S3 connector sample is available in dozer-samples repository.

Our project consists of the following:

  • A Python script for generating and uploading data to an S3 bucket.
  • A Dozer configuration file (YAML) that defines the data sources, transformations, and APIs.

Step 1: Generate and Upload Data to S3

If you don't have a dataset ready, you can use the provided Python script to generate a dataset and upload it to an S3 bucket. The script generates a dataset of stock trading data with fields such as date, ticker, open price, high price, low price, close price, and volume. It then uploads this dataset to a specified S3 bucket.

Python Script to genrate and upload data to s3

Understanding the Dataset

Our dataset consists of stock trading data, with each record representing a trade. Each record includes the date of the trade, the ticker symbol of the stock, the opening price, the highest price, the lowest price, the closing price, and the volume of the trade. The dataset contains approximately 2 million records, providing a substantial amount of data for our analysis.

Here is a sample of the dataset:

Date,Ticker,Open,High,Low,Close,Volume
2025-01-02,AAPL,150.00,152.00,148.00,150.00,5000
2025-01-02,GOOG,1200.00,1210.00,1190.00,1200.00,2000

If you already have some dataset in your S3 bucket, you can skip this step and proceed to the next one.

Step 2: Configure Dozer

The next step is to configure Dozer. This involves creating a YAML configuration file that defines the data sources, transformations, and APIs.

Dozer uses a YAML configuration file to specify the data sources, the tables to ingest, the SQL queries to run on the ingested data, and the endpoints to serve the results.

Here's an example of a Dozer configuration file:

connections:
- config : !S3Storage
details:
access_key_id: {{YOUR_ACCESS_KEY}}
secret_access_key: {{YOUR_SECRET_KEY}}
region: {{YOUR_REGION}}
bucket_name: {{YOUR_BUCKET_NAME}}
tables:
- !Table
name: stocks
config: !CSV
path: stocks # path to files or folder inside a bucket
extension: .csv
name: s3

sql: |
-- Ticker Analysis
SELECT Ticker, AVG(Close) AS average_close_price, SUM(Volume) AS total_volume
INTO ticker_analysis
FROM stocks
WHERE Date >= '2025-01-01' AND Date < '2025-02-01'
GROUP BY Ticker;

-- Daily Analysis
SELECT Date, AVG(Close) AS average_close_price, SUM(Volume) AS total_volume
INTO daily_analysis
FROM stocks
GROUP BY Date;

-- Highest Daily Close Price
SELECT Date, MAX(Close) AS highest_close_price
INTO highest_daily_close
FROM stocks
GROUP BY Date;

-- Lowest Daily Close Price
SELECT Date, MIN(Close) AS lowest_close_price
INTO lowest_daily_close
FROM stocks
GROUP BY Date;

sources:
- name: stocks
table_name: stocks
connection: !Ref s3
columns:

endpoints:
- name: ticker_analysis
path: /analysis/ticker
table_name: ticker_analysis

- name: daily_analysis
path: /analysis/daily
table_name: daily_analysis

- name: highest_daily_close
path: /analysis/highest_daily_close
table_name: highest_daily_close

- name: lowest_daily_close
path: /analysis/lowest_daily_close
table_name: lowest_daily_close

telemetry:
metrics: !Prometheus # You can check the dozer metrics at http://localhost:9000

In this configuration file, we define a connection to our AWS S3 bucket and specify the tables we want to ingest. We also define several SQL queries that will be run on the ingested data and the endpoints where the results of these queries will be available.

Step 3: Running Dozer

Once the configuration file is set up, we can start Dozer by running the following command in the terminal:

dozer -c dozer-config.yaml

This will start Dozer and it will begin ingesting data from the specified AWS S3 bucket and populating the cache. You can see the progress of the execution from the console. The results of the SQL queries will be available at the specified endpoints.

dozer console log of ingesting data from aws s3

Step 4: Querying the Dozer APIs

Dozer generates automatc REST and gRPC APIs based on the endpoint configuration provided in the dozer config. We can now query the Dozer endpoints to get the results of our SQL queries. You can query the cache using gRPC or REST. For example, to get the ticker analysis, we can send a GET request to the /analysis/ticker Here are some example queries:

gRPC

dozer grpc query for aws s3 ingested data

REST

dozer REST endpint query for aws s3 ingested data

This will return the average closing price and total volume for each ticker for the specified date range.

Step 5: Append New Data & Query

In the second part of this tutorial example, we will demonstrate how the append mode of object storgae works, Without interrupting Dozer process, we will drop another file with the same schema in our S3 bucket. Dozer, being designed for real-time data ingestion, will automatically detect the new;y added data files in the bucket and start ingesting it. This means you don't need to change any configuration if you have recurring data to process and new files are getting stored on a recurring basis.

file upload to s3 to test dozer with aws s3 append mode

In this tutorial, we initially ingest 2 million records from the first file. When we drop a new file into the S3 bucket, it contain approximately 50k records, Dozer automatically starts ingesting this new data. You can see the console log of dozer and query the new data.

Please note that the ingestion time may vary based on several factors such as the size of the file, the number of records, the complexity of the data, and the resources available to Dozer. However, the beauty of Dozer lies in its ability to allow you to query the data as it's being ingested, so you can start getting insights from your data almost immediately.

Conclusion

As you can see, Dozer makes it easy to ingest real-time data from an S3 bucket and expose it as queryable APIs. With just a simple configuration, you can connect any data source, combine them in real-time, and instantly get low-latency data APIs. This makes Dozer a powerful tool for quickly building data products.

For more information and examples, check out the Dozer GitHub repository and dozer-samples repository. Happy coding, Happy Data APIng! 🚀👩‍💻👨‍💻