Event-based architectures have been gaining popularity for some time. With increased adoption has come a flood of options for aggregating and analyzing events. Which databases are optimized for ingesting streaming events and analyzing them in real time? The answer is complex, nuanced and heavily dependent on the precise problem being solved.
This post is intended to help anyone seeking to make a selection from a difficult to understand landscape. We’ll start by evaluating three options for running real-time analytics on AWS Kinesis event streams. This analysis of Kinesis analytics is by no means exhaustive, but I hope it’s useful as a quick overview of popular options, their ideal use cases and associated tradeoffs.
About Using Event Data
Events are messages that are sent by a system to notify operators or other systems about a change in its domain. Events are commonly used by systems in the following ways:
- Reacting to changes in other systems; e.g. when a payment is completed, send the user a receipt.
- Recording changes that can then be used to recompute state as needed, e.g. a transaction log.
- Supporting separation of data access (read/write) mechanisms like CQRS.
- Aiding in the understanding and analysis of the current and past state of a system.
I’ll focus on the use of events to help understand, analyze and diagnose problems using various OLAP databases and AWS Kinesis data streams.
AWS Kinesis
Kinesis is Amazon’s solution for collecting and processing streaming data in real time. It’s a fully managed service within the Amazon Web Services (AWS) cloud, which obviates the need to manage infrastructure. Kinesis is modeled after Apache Kafka: both are general-purpose publish/subscribe messaging services, both are horizontally scalable, and both are high performance. The primary difference between the two solutions is configurability and administration. Kafka is far more configurable on vectors like retention, performance and auto-scaling, but in turn requires a large team and weeks of setup. Teams looking to reduce operational burden often find a good fit in Kinesis, saving their engineering teams time on setup and maintenance. Additionally, for teams developing primarily in the AWS ecosystem, Kinesis plays nicely with other AWS services. While this blog post won’t dive deeply into Kinesis’ capabilities, it’s worth quickly noting three:
- Kinesis Data Streams enable continuous capture of gigabytes of data per second from an enormous number of sources.
- Kinesis Data Firehose allows for easy ETL into AWS data stores and other OLAP databases for real-time Kinesis analytics.
- Kinesis Data Analytics allows teams to process streaming data in real-time. This tool is useful for partitioning data into time windows for SQL querying, but is not a full-blown OLAP database.
Building Events Analytics
More than ever, organizations are recognizing the value of, and necessity to, analyze events data in real time. Perhaps an ecommerce company would like to offer product recommendations based on in situ shopper behavior. Or, a construction company might need access to material logistics data in seconds. Such use cases require fundamental architectural changes. We’ve covered these topics in detail in Analytics on Kafka Event Streams Using Druid, Elasticsearch and Rockset, for events, and in 7 Reference Architectures for Real-Time Analytics, for other common real-time analytics use cases.
To abbreviate the analysis, I’ll be evaluating solutions using the following criteria:
- Batch vs. real-time analytics
- The availability of common features like joins, inserts/updates and rollups
- Requirements for data preparation
- Performance for selective vs. aggregate queries
Druid
Druid is a common, high-performance OLAP database; it provides a columnar data store that supports streaming sources (events) and fast queries. One of Druid’s most attractive characteristics is its ability to run analytics against enormous amounts of data. It’s most commonly found at huge enterprises, such as Walmart, Twitter and Alibaba.
Druid + Kinesis might be for you if:
- You need real-time access to petabytes of data and/or trillions of events.
- You have un-nested, predictable data.
- You’re using
GROUP BY
queries for aggregate analytics across many rows in a single table. - Your use case is network performance monitoring or clickstream analytics.
It might be time to look elsewhere if:
- Your events are deeply nested and you need to access them via SQL.
- Your data source doesn’t contain type-enforcement at the column level.
- You need to write SQL with complex joins across tables.
- Your team cannot afford the medium-to-high operational overhead required to set up Druid. Performance engineering requires significant effort even after setup.
- Your use case is ad hoc or drill down analyses of Kinesis events. These are typically difficult in Druid; it’s better suited for answering predefined questions.
- Your queries are selective (they return a small number of records). Druid does a full scan of your data instead of using indexes. This affects performance.
- You’re trying to run real-time queries on the HDFS partition.
- You need to backfill old data. All older segments are read-only and immutable. If events arrive late and have to update historical segments, those segments need to be rewritten.
Druid Kinesis Specifics
- Druid has built-in support for Kinesis ingestion, which you can read about in the Kinesis documentation. Note that this requires manual configuration and management.
- Setup tends to take a few hours once Druid is configured, but be sure to consider the high operational cost required to set up, maintain and tune Druid.
Druid Summary
Druid is ideal for real-time analytics on Kinesis streams if incoming data is highly predictable, teams can afford the considerable overhead, and complex SQL features like rollups and joins are not required. If you’re looking for something easy to use, quick to set up, and flexible, this is not the solution for you.
Elasticsearch
Elasticsearch is a search and analytics engine commonly used for ad hoc analysis on logs or text. It’s become more popular as an events-analytics database, but unlike the other products in this article, it’s a bit easier to pin down.
Elasticsearch + Kinesis might be for you if:
- You already know you need an inverted index for selective queries.
- Your use case is highly performant full text search or log analytics.
It might be time to look elsewhere if:
- You have high write rates. If new events are generated at more than 10s of megabytes per second, you might run into trouble.
- You’re looking to write OLAP queries in SQL.
- You need to query nested data.
- You need to join multiple tables within Elasticsearch or between Elasticsearch and another database.
- You’re looking for a general purpose OLAP database.
Elasticsearch Kinesis Specifics
Elasticsearch supports both Kinesis data streams and sending data directly to Firehose from the producer (which requires more configuration).
Elasticsearch Summary
Elasticsearch is a popular tool for achieving full-text search, especially for log analytics, but is less useful as a fully-featured analytics engine for events data.
Redshift
Amazon Redshift is a high performance, massively parallel processing (MPP) data warehouse designed for query latencies of second/minutes. It has one standout advantage over the other tools we’ve looked at so far: like Kinesis, it lives in the AWS ecosystem.
Redshift + Kinesis might be for you if:
- You need to execute complex aggregation queries across large datasets for low-concurrency workloads.
- You need to be able to join tables.
- Your use case is historical business intelligence (with low QPS) or log analytics.
It might be time to look elsewhere if:
- You’re looking to deliver sub-second query results for real-time analytics. Your workload requires traditional insertions/updates. Redshift has some limitations.
- You’re trying to build an application. At 50 queries across all queues, Redshift cannot handle many users querying simultaneously.
- You need to move data quickly from Kinesis to Redshift via Firehose. Latencies are tens of minutes at best.
- You’re especially cost sensitive. Redshift does not disaggregate compute and storage, which can have significant effects on cost. Make sure to do sufficient research on pricing.
Redshift Kinesis Specifics
Redshift Summary
An analytics solution leveraging both Redshift and Kinesis can be powerful given a modest number of users running analytical queries on relatively fresh data.
Rockset
You didn’t think you’d finish a Rockset blog post without hearing about Rockset, did you? I’ll do my best to evaluate it objectively! It turns out that Rockset is quite a good fit for querying both event streams and databases in real time. Developers can ingest events with read permissions in the cloud using our built-in connectors or directly by writing into Rockset using our JSON Write API.
Rockset + Kinesis might be for you if:
It might be time to look elsewhere if:
- Your use case primarily involves batch workloads, i.e. traditional, aggregated business intelligence.
- Your use case is log analytics or full-text search. There are better options discussed in this article!
- You need an on-prem solution.
Rockset Kinesis Specifics
Rockset is fully managed and has a built-in Kinesis integration, which helps prioritize developer leverage and reduce operational overhead. Ingest, storage and compute are all scaled automatically and there is little need for capacity planning, sharding or tuning. Check out our in-depth documentation to leverage Rockset’s Kinesis integration; the only work required is configuring AWS Firehose’s IAM policies.
Rockset Summary
Rockset works great for teams looking to run real-time analytics on Kinesis with extremely low overhead in many common use cases. The best way to learn about how Rockset fits into your current stack is to see Rockset in action. Create an integration with your Kinesis service and give it a spin.
If you’d like to chat with our team or schedule a demo, don’t hesitate to reach out. Head over to the Rockset homepage, enter your email, and we’ll be in touch shortly.
Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.