Saturday, September 23, 2023
No menu items!
HomeData IntegrationStreamSets and the Path to Efficient Data Integration

StreamSets and the Path to Efficient Data Integration

The business value of data analytics is well understood. Insights drawn from data can yield impressive business results in almost every industry context imaginable. A significant challenge remains, however: data is seldom all in the right place or format. Data integration is essential for getting the most out of dispersed, heterogeneous data sources.

Some organizations have their data in OLTP databases, while others may have it in Kafka systems. OLTP databases contain transactional data related to day-to-day business operations, such as customer orders, inventory management, and financial transactions. On the other hand, Kafka systems are a popular messaging platform for streaming data, which includes log files, clickstream data, and machine-generated data from sensors and IoT devices. It’s important to use a data integration platform like StreamSets to aggregate data from different sources into one location because it enables businesses to analyze data more easily and make informed decisions. Without a data integration platform, data may be scattered across multiple systems, making it difficult and time-consuming to manage and analyze. In this article, PeerSpot members share examples of how they leveraged StreamSets to ingest data from similar sources, and describe the resulting outcomes they experienced. 

Data Integration Use Cases

PeerSpot members are using StreamSets for a wide range of data integration processes. For example, Abhishek K., a Technical Lead at Sopra Steria, a tech services company working with a healthcare provider, “designed a StreamSets pipeline to connect with relational database sources.” According to Abhishek K., “We did generate a schema from the source data loaded into Azure Data Lake Storage (ADLS)… This was one of our batch use cases. We really like StreamSets because it is intuitive and requires no coding from our end.”

StreamSets has allowed his team to solve their real-time streaming use cases. His company was, as he put it, “streaming data from source Kafka topic to Azure Event Hubs. The trigger-based streaming pipeline moved data when it appeared in a Kafka topic.” The streaming pipeline continuously streamed data from Kafka to Azure for further analysis.

“I personally love working on StreamSets,” Abhishek K. added. “It is part of my day-to-day activities. For a US healthcare service provider company, we designed a StreamSets pipeline to connect to relational database sources. One of their batch use cases was generating schema, “from the source data loaded into Azure Data Lake Storage (ADLS).”

StreamSets helps his team get things right the first time, especially when it comes to data drift. He can utilize the key features such as, “data rules, matrix rules, or capabilities provided by StreamSets that we can set.”

If the source schema deviates, “StreamSets will automatically notify us or send alerts in automated fashion about what is going wrong.” A great benefit included with StreamSets is how it is agile in keeping track of changes or deviations. “StreamSets also provides Change Data Capture (CDC). As soon as the source data is changed, it can capture that and update the details into the required destination.”

Data ingestion is where Srinivasan S., a Senior Data Engineer at an energy/utilities company, uses the StreamSets platform. StreamSets enables him to, “ingest data to a data lake,” as ingestion and data absorption are usually continuous. Each log and event is immediately stored when it hits the stream processor.”

Data integration is what stood out for Prateek A., a Technical Program Manager at a university. He said, “We are working on a very large data analytics project, in which we are integrating large data sets to a platform from multiple sources. We need to create data pipelines. We are using StreamSets for all the data integration activities – for creating the pipelines, monitoring them and smoothly running all the data processes.”

Karthik R., a Principal Engineer at Tata Consultancy, the tech services firm, is responsible for enterprise data. He found StreamSets to be a great solution for “connecting to enterprise data stores such as OLTP databases or messaging systems such as Kafka.” He needed to find the best solution to connect to streaming databases and streaming products. He explained, “This ability is important because most of our use cases in recent times are of a streaming nature. We have to deliver certain messages or data as per our SLA, and the combination of Kafka and StreamSets helps us meet those timelines.”

StreamSets has provided benefits to his company because, as he said, “I’m not sure what I would have used to achieve the same five years ago.” Karthik is enthusiastic as, “the combination of Kafka and StreamSets has opened up a new world of opportunities to explore.”

He also appreciated StreamSets’ ability to provide orchestration for multiple jobs. He said, “For example, you can specify to let Job A run first, then Job B, and then Job C in an automated fashion. You don’t need any manual intervention. In one of my projects, I had a data hub from 10 different databases. It was all automated by using Kafka and StreamSets.”


StreamSets users have discovered that the solution delivers a number of business-facing benefits. According to a Senior Network Administrator at an energy/utilities company, for example, using StreamSets’ Data Collector has produced “better, up-to-date reports and has absolutely saved us money.”

This user went on to say that his team benefited from breaking down data silos. They are able to consume hundreds of more terabytes of data across different streams than before they had StreamSets. He elaborated, saying, “Then we aggregate it together so that we can do reporting that is not just for that one silo of people but for a number of different people across the entire organization.”

This is a win-win for his organization. “It has had a positive effect, enabling us to save money,” he said. His company “spends money more effectively.” It has “more up-to-date data in reports as well as in auditing.” An extra added bonus is that, “our safety processes are better too.”

The benefits of StreamSets carry across every aspect of his department as the software “has helped to scale our data operations. As a result, in addition to saving money and right-sizing, it’s helped our field operations and provided us with more management reporting.”

Sumesh G., a data professional at a tech vendor, said that thanks to StreamSets he now runs pipelines that scale horizontally, which improves efficiency and significantly reduces the company workload, “instead of using a static service to host the service.” He then commented, “This has improved efficiency and reduced our workload by around 85 percent. Initially, we started out with around 40 users. Now, there are 100 users. We have definitely scaled up, in terms of usage, with StreamSets.”

A centralized platform saves Sumesh G.’s company a great deal of time. He observed that StreamSets is “very intuitive and very effective, saving us a lot of resources with its built-in capabilities. No manual intervention is needed, and nobody needs to oversee it. It’s an ‘all-in-one’ deal for us. We are able to save 15 to 18 hours per week. Tasks that required three people can be done by StreamSets itself.”

As an example, his company can now pull thousands of records instantly reducing the need, in this instance, to dedicate resources for complex coding. “That has also been a very big plus for us. We also use it to connect our Apache Kafka with data lakes and as a result, this connection too has become much more efficient and quicker,” he said.

The financial benefits of integrating StreamSets became apparent to Sumesh G. He said, “Within about three months we were able to see benefits from the system. We saw a lot of time being saved and about a 30 percent increase in our overall efficiency.”

Ramesh K., a Senior Software Developer at a tech vendor also mentioned how the introduction of StreamSets has led to several benefits. He noted, “The efficiency of our entire process has increased a lot and we derive high value from it. The integration of data files from multiple sources is what makes it great software for us.” In their case, a 20 to 25% jump in efficiency has led to a return on investment driven by a 7 to 10% increase in revenue.The major driver of business value for Ramesh K. is that “there are minimal coding requirements. Any person, even someone with a non-technical background, can easily get accustomed to the software and start using it.”


Data integration and data pipelines may seem like esoteric areas of IT. However, given the power of analytics and related processes, they can actually be significant drivers of business value. As users of StreamSets explained in their PeerSpot reviews, the solution makes it easier to execute complex data integration challenges. The results include gains in efficiency, cost and time savings, and increases in revenue.

The post StreamSets and the Path to Efficient Data Integration appeared first on StreamSets.

Read MoreStreamSets



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments