What are the challenges of implementing efficient and scalable data APIs?
Thanks to the adoption of cloud data warehouse platforms like Snowflake or Databricks every organization is producing more and more data. This data is later processed by data analysts to extract insights or by data scientists to build predictive models to help business decisions. Data analysts generally use tools like DBT to write SQL transformations, while Data Scientists generally prefer using a Python stack or AutoML tools like DataRobot or H2O. In both cases, all results are written back to the same data warehouse for easier accessibility.
To consume this data, companies have started building analytical dashboards, which are playing an important role in monitoring the health of the business and help drive strategic decisions. More recently, companies started to realize the value of this data in other contexts. Reverse ETL like Hightouch or Census, for instance, unlocks its value in operational use cases by making insights or predictions available in cloud SaaS applications. This is very useful, for example, to improve the efficiency of an e-mail marketing campaign.
Use cases, however, are not just limited to internal consumption. In multiple scenarios, it's extremely useful to expose this data directly to the end-user as part of the product experience. Think of the fintech industry, for example, where companies need to make this data readily available from the user's mobile app in order to improve their product's UX.
This seems a very easy task to achieve, but in reality, it can require a lot of work from a diverse group of people. Let's understand more! Data Warehouses like Snowflake or Databricks are specifically designed for analytical purposes. This means they are not suited for low-latency querying and point lookups. However, these are the typical requirements in a microservice serving customer applications or a mobile app; fast response time is a prerequisite to implementing a good UX for the user. For this reason, data sitting in the data warehouse needs to be moved to a different type of storage that is capable of offering these capabilities. During this process, data must be properly prepared and indexed and an API layer must be created in front of it, so that product engineers can build their applications on top. All this process is quite challenging and requires a lot of data engineering work.
Let's look at some challenges in detail:
- Data Models: microservices and front-end developers are used to working with hierarchical data models (like JSON or Protobuf) while Data Analysts and Data Scientists are more comfortable with tabular data. In order to better fit API use cases, it is ideal to put in place mechanisms to automatically denormalize and transform data from tabular to hierarchical representations.
- Data Integrity: In some situations incremental movement of data is okay, but in some other scenarios it is required that a dataset is replaced completely with a new version of the data. In these cases, it is important to ensure that an "all-or-nothing" pattern is applied, preventing the mixup of old and new data during deployment.
- Seamless to Consumer: Once a new version of the data is deployed it is important that the consumer will start using the new version of the data in an automated fashion.
- Easy Rollbacks: In cases where wrong data gets deployed, it must be possible to rollback to an older version with minimal effort in order to avoid any disruption in user functionality.
- Fine-Grained observability and RCA: It is possible that, for any reason, some wrong data is served to the user. In those situations, it is essential to have a proper observability tool that is capable of tracking each API user request and tracing it back to the source data.
- Low-latency: The way data is represented and indexed depends very much on the consumption pattern. Sometimes it is necessary to look up data by a primary key, some other times by multiple secondary keys, some other times by a geographic location, and so on. A storage layer that sits in front of APIs must be able to satisfy these kinds of lookups very efficiently and at extremely low latency.
- Auto-scaling: APIs need to handle spikes of traffic efficiently. This is generally achieved with auto-scaling. This is an easy task when a stateless API server needs to be auto-scaled, but it is much harder when APIs and storage need to be scaled together.
All the challenges I described above are what we are solving with Dozer. We are aiming to automate the data extraction and preparation process to make it efficiently serviceable through APIs. Stay tuned for more!