Industries across the globe produce a staggering amount of data, and it continues to multiply at an exponential rate. This big data often comes in the form of live streams, known as streaming data, which has become a critical part of modern enterprise data architecture. This live data can come from server logs, IoT sensors, and clickstream data from websites and apps. Tracking and analyzing this data has become integral to support a variety of business functions.
However, it’s tricky to work with streaming data for two reasons. First, you have to collect large amounts of data from streaming sources that generate events every minute. Second, in its raw form, streaming data lacks structure and schema, which makes it tricky to query with analytic tools.
Today, there’s an increasing need to process, parse, and structure streaming data before any proper analysis can be done on it. For instance, what happens when someone uses a ride-hailing app? The app uses real-time data of location tracking, traffic data, and pricing to provide the most suitable driver. It also estimates how much time it will take to reach the destination based on real-time and historical data. The entire process from the user’s end takes a few seconds. But what if the app fails to collect and process any of this data on time? There’s no value to the app if the data processing isn’t done in real time.
Traditionally, batch-oriented approaches are used for data processing. However, these approaches are unable to handle the vast streams of data generated in real time. To address these issues, many organizations are turning to stream processing architectures as an effective solution for processing vast amounts of incoming data and delivering real-time insights for end-users.
What is stream processing?
How does stream processing work?
Stream processing vs batch processing
Stream processing use cases
Striim: A unified stream processing and real-time data integration platform
What is stream processing?
Stream processing is a data processing paradigm that continuously collects and processes real-time or near real-time data. It can collect data streams from multiple sources and rapidly transform or structure this data, which can be used for different purposes. Examples of this type of real-time data include information from social media networks, e-commerce purchases, in-game player activity, and log files of web or mobile users.
As Alok Pareek mentions in his explanation of stream processing, the main characteristics of data stream processing include:
Data arrives as an ongoing stream of events
It requires high throughput processing
It requires low latency processing
Alok Pareek shares the basic characteristics of data stream processing
Stream processing can be stateless or stateful. State here is used to refer to the state of the data, that is, how does previous data affect the processing of current data. In a stateless stream, the processing of current events is independent of the previous ones. Suppose you are analyzing weblogs and you need to calculate how many visitors are viewing your page at any moment in time. Since the result of your preceding second doesn’t affect the current second’s outcome, it’s a stateless operation.
With stateful streams, there’s context as current and preceding events share their state. This context can help past events shape the processing of current events. For instance, a global brand would like to check the number of people buying a specific product every hour. Stateful stream processing can help to process the users who buy the product in real time. This data is then shared in a state, so it can be aggregated after one hour.
How does stream processing work?
Modern applications process two types of data: bounded and unbounded. Bounded data refers to a dataset of finite size — one where you can count the number of elements in the dataset easily. Bounded data has a known ending point. For instance, a bookstore wants to find the number of books sold at the end of the day. This data is bounded because a fixed number of books were sold throughout the day and sales operations ended for the day, which means it has a known ending point.
On the other hand, unbounded data refers to a dataset that is theoretically infinite in size. No matter how advanced modern information systems are, their hardware has a limited number of resources, especially when it comes to storage capacity and memory. It’s not economical or practical to handle unbounded data with traditional approaches.
Stream processing can use a number of techniques to process unbounded data. It partitions data streams by taking a current fragment so they can become fixed chunks of records that can be analyzed. Based on the use case, this current fragment can be from the last two minutes or the last hour, or even the last 200 events. This fragment is referred to as a window. You can use different techniques to window data and process the windowing outcomes.
Next, data manipulation is applied to data accumulated in a window. This can include the following:
Basic operations (e.g., filter)
Aggregate (e.g., sum, min, max)
This way, each window has a result value.
Stream processing architecture
A stream processing architecture can include the following components:
Stream processor: A stream producer (also known as a message broker) uses an API to fetch data from a producer — a data source that emits streams to the stream processor. The processor converts this data into a standard messaging format and streams this output regularly to a consumer.
Real-time ETL tools: Real-time ETL tools collect data from a stream processor to aggregate, transform, and structure it. These operations ensure that your data can be made ready for analysis.
Data analytics tool: Data analytics tools help analyze your streaming data after it’s aggregated and structured properly. For instance, if you need to send streaming events to applications without compromising on latency, then you can process and persist your streams into a cluster in Cassandra. You can set up an instance in Apache Kafka to send outputs of streams of changes to your apps for real-time decision-making.
Data storage: You can save your streaming data into a number of storage mediums. This can be a message broker, data warehouse, or data lake. For example, you can store your streaming data to Snowflake, which lets you perform real-time analytics with BI tools and dashboards.
Stream processing vs. batch processing
Batch processing is all about processing batches containing a large amount of data, which is usually data at rest. Stream processing instead works with continuous streams of data where there is no start or end point in time for the data. This data is then fed to a streaming analytics tool in real time to generate instant results.
Batch processing requires that the batch data is first loaded into a file system, a database, or any other storage medium before processing can initiate. This doesn’t mean that stream processing cannot deal with large amounts of data. However, batch processing is more practical and convenient if there’s no need for real-time analytics. It’s also easier to write code for batch processing. For example, a fitness-based product company goes through its overall revenues generated from multiple stores across the country. If it wants to look at the data at the end of the day, batch processing is good enough to meet its needs adequately.
Stream processing is better when you have to process data in motion and deliver analytics outcomes rapidly. For instance, the fitness company now wants to boost brand interest after airing a commercial. It can use stream processing to feed social media data into an analytics tool for real-time audience insights. This way, it can determine audience response and look into ways to amplify brand messaging in real time.
Stream processing use cases
The ability of stream processing architectures to analyze real-time data can have a major impact in several areas.
Stream processing architectures can be pivotal in discovering, alerting, and managing fraudulent activities. They go through time-series data to analyze user behavior and look for suspicious patterns. This data can be ingested through a data ingestion tool (e.g., Striim) and can include the following:
User identity (e.g., phone number)
Behavioral patterns (e.g., browsing patterns)
Location (e.g., shipping address)
Network and device (e.g., IP information, device model)
This data is then processed and analyzed to find hidden fraud patterns. For example, a retailer can process real-time streams to identify credit card fraud during the point of sale. To do this, it can correlate customers’ interactions with different channels and transactions. In this way, any transaction that’s unusual or inconsistent with a customer’s behavior (e.g., using a shipping address from a different country) can be reviewed instantly.
Accenture found that 91% of buyers are more likely to purchase from brands that offer personalized recommendations. Today, businesses need to go the extra mile and improve their customer experience by introducing workflows that automate personalization.
Personalization with batch processing has some limitations. Since it uses historical data, it fails to take advantage of data providing insights into a user’s real-time interactions that are happening at the very moment. In addition, it fails at hyper-personalization since it’s unable to use these real-time streams with customers’ existing data.
Let’s take a seller that deals in computer hardware. Their target market includes both office workers and gamers. With stream processing, the seller can process real-time data to determine which visitors are office workers that need hardware like printers and which are gamers who are more likely to be looking for graphic cards that can run the latest games.
Log analysis is one of the processes engineering teams use to identify bugs by reviewing computer-generated records (also known as logs).
In 2009, PayPal’s network infrastructure faced a technical issue, causing it to go offline for one hour. This downtime led to a loss of transactions worth $7.2 million. In such circumstances, engineering teams don’t have a lot of time; they have to find the root cause of failure quickly via log analysis. To do this, their methods of collecting, analyzing, and understanding data in real time are key to solving the issue. Stream processing architecture makes it a natural solution. Today, PayPal uses stream processing frameworks and recently processed 5.34 billion payments in the fourth quarter of 2021.
Stream processing can improve log analysis by collecting raw system logs, classifying their structure, converting them into a consistent and standardized format, and sending them to other systems.
Usually, logs contain basic information like operation performed, network address, and time. Stream processing can add meaning to this data by identifying log data related to remote/local operations, authentication, and system events. For instance, the original log stores user IP addresses but doesn’t tell their physical location. Stream processing can collect geolocation data to identify their location and add it to your systems.
Sensor-powered devices collect and send large amounts of data quickly, which is valuable to organizations. They can measure a wide variety of data, such as air quality, electricity, gases, time of flight, luminance, air pressure, humidity, temperature, and GPS. After this data is collected, it must be transmitted to remote servers where it can be processed. One of the challenges that occur during this process is the processing of millions of records sent by the device’s sensors every second. You might also need to perform different operations like filtering, aggregating, or discarding irrelevant data.
Stream processing can process data from sensors, which includes data integration from different sources and perform various actions like normalizing data and aggregating it. To transform this data into meaningful events, it can use a number of techniques, including:
Assessment: Storing all data from sensors isn’t practical since a lot of it isn’t relevant. Stream processing applications can standardize this data after collecting it and perform basic transformations to determine if any further processing is required. Irrelevant data is then discarded, saving processing bandwidth.
Aggregation: Aggregation involves performing a calculation on a set of values to return a single output. For instance, let’s say a handbag company wants to identify fraudulent gift card use by looking over its point-of-sale (POS) machine’s sensor data. It can set a condition that tells it when gift card redemptions cross the $1,000 limit within 15 minutes. It can use stream processing to aggregate metrics continuously by using a sliding time window to look for any suspicious pattern. A sliding time window is used to group records from a data stream over a specific period. A sliding window of a length of one minute and a sliding interval of 15 minutes will contain records that arrive in a one-minute window and are evaluated after every 15 minutes.
Correlation: With stream processing, you can connect to streams over a specific interval to determine how a series of events occurred. For instance, in our POS example, you can set a rule that condition x is followed by conditions y and z. This rule can include an alert that notifies the management as soon as gift card redemptions in one of the outlets are 300% more than the average of other outlets.
Striim: a unified stream processing and real-time data integration platform
Striim is a unified streaming and real-time data integration platform with over 150 connectors to data sources and targets. Striim gives users the best of both worlds: real-time views of streaming data plus real-time delivery to data targets (e.g. data warehouses) for larger-scale analysis and report-building. All of this is possible across hybrid and multi-cloud environments.
If you’re looking to improve your organization’s processing and management of streaming data, stream processing can be a good solution. However, you need to make sure that you have the right tools to implement stream processing effectively. Striim can be your go-to tool for ingesting, processing, and analyzing real-time data streams. As a unified data integration and streaming platform — with over 150 connectors to data sources and targets — Striim brings many capabilities under one roof.
Striim can perform various operations on data streams, such as filtering, masking, aggregation, and transformation. Furthermore, streaming data can be enriched with in-memory, refreshable caches of historical data. WAction Store, a fault-tolerant, distributed results store, maintains aggregate state. WAction Stores can be continuously queried with Tungsten Query Language (TQL), Striim’s own streaming SQL engine. TQL is 2-3x faster than Kafka’s KSQL and can help you to write queries more efficiently. Streaming data can also be visualized with custom dashboards (e.g. to detect cab-booking hotspots).
Execution time for different types of queries using Striim’s TQL vs KSQL