Data is everywhere, in different formats and databases. Being able to integrate multiple, highly varied data sources is essential to running a business today. You have to be able to Extract-Transfer-Load (ETL) the data from each source into a database suitable for data analysis, like a data warehouse. In this article, PeerSpot’s real users of StreamSets discuss how the platform helps them with these challenges.
PeerSpot members employ StreamSets in a variety of use cases. For instance, a Data Engineer at an energy company uses StreamSets to “transport our Oracle raw datasets up to Microsoft Azure and then into SQL databases there.”
Abhishek K., a Technical Lead at Sopra Steria, a tech services company, takes advantage of StreamSets to scale access to their data pipelines.“We can design and create data pipelines, loading on-prem data to the cloud.” For them, that means moving data from on-premises to Azure and Google Cloud Platform (GCP). He added, “From there, once data is loaded, the data scientist and data analyst teams use that data to generate patterns and insights.”
Abhishek K.’s team designed a StreamSets pipeline that enables a healthcare provider to connect to relational database sources in batch processing. They generated “schema from the source data loaded into Azure Data Lake Storage (ADLS) or any cloud, like S3 or GCP.”
StreamSets also provides Abhishek K.’s team with solutions for the real-time streaming challenges that arise in a trigger-based streaming pipeline. He said, “We were streaming data from source Kafka topic to Azure Event Hubs. This was a trigger-based streaming pipeline, which moved data when it appeared in a Kafka topic. Since this pipeline was a streaming pipeline, it was continuously streaming data from Kafka to Azure for further analysis.”
Ingesting data into a data lake is the case for Srinivasan S., a Senior Data Engineer at an energy company. Describing the toolset as “quite simple to use for anybody who has an ETL or BI background,” Srinivasan S. went on to explain that using StreamSets has improved their time to value because it reduces development time. He also addedStreamSet has “the ability to easily transform and ‘up-skill’ a person who has already worked on an ETL or BI background.”
StreamSets eliminated the need for his team to hire IT programming staff or those with skills specifically in Python, DataOps or DevOps. He shared, “In the market, it is easier to find people with ETL or BI [business intelligence] skills than people with hardcore DevOps or programming skills. That is the major benefit that we are getting out of moving to a GUI-based tool like StreamSets.”
Many organizations have been able to streamline their data operations by leveraging StreamSets. According to Srinivasan S., ETL with StreamSets is “Very simple for anybody who has already worked on an ETL tool set, either for your data ingestion, ETL pipeline or data lake requirements. The UI and concepts are very similar to how you develop your extraction pipeline.” He then stressed, “The data resilience feature is good enough for our ETL operations, even for our production pipelines at this stage.” His team does not need to spend time building a customized framework, “since what is available out-of-the-box is good enough for a production pipeline.”
A Data Engineer at a consultancy, who uses StreamSets for ignesting data into the cloud, found the tool’s design experience “is very good” when implementing batch streaming and ETL pipelines.” He added, “StreamSets’ built-in data drift resilience plays a part in our ETL operations. We have some processors in StreamSets, and it will tell us what data has been changed and how data needs to be sent.”
Tata Consultancy, the global tech vendor, has a significant ETL operation, as one might imagine. Karthik R., one of their Principal Engineers, oversaw multiple jobs coming from various source systems and any change in columns left the data uninformed. StreamSets solved this issue because of its embedded data drift feature. He said, “We don’t have to spend that much time taking care of the columns and making sure they are in sync. All this is taken care of. We don’t have to worry about it. It is a very helpful feature to have.”
Also, his staff don’t necessarily need to be coders. “You can just drag and drop.” He said, “Rather than learning each and every technology and building your data pipelines, you can just plug and play at a faster pace.” For added context, he mentioned,“The nature of coding is changing, and the number of technologies is changing. The range is so wide right now. Even if I know Java or Oracle, it may not be enough in today’s times because we might have databases in Teradata. We might have Snowflake or other different kinds of databases.”
StreamSets primary use case is integrating data for the modern data ecosystem. Customers use StreamSets to ingest data from a wide variety of data sources across the enterprise – including databases, applications, APIs, messaging systems, event hubs, cloud object stores, and legacy data warehouses – into modern data platforms such as data lakes, cloud warehouses and data clouds, and sanitize, transform and conform data both in motion and in the platform.
Tata Consultancy’s Data Engineer favors StreamSets for data integration because it is “very effective in project delivery.” He explained, “At the end of June, I deployed all the integrations which I developed in StreamSets to production remit. The business users and customers are happy with the data flow optimizer from the SoPlex cloud. It all looks good.”
This user appreciates that ramping up is easy because no one on the team had to learn new technologies or how to use new tools. “Everything is in place and it comes as a package. They install everything. The package includes Python, Ruby, and others. I just need to configure the correct details in the pipeline and go ahead with my work.”
The resulting efficiency “has opened up a new world of opportunities to explore. I recently used orchestration wherein you can have multiple jobs and you can orchestrate them. For example, you can specify to let Job A run first, then Job B and then Job C in an automated fashion. You don’t need any manual intervention. In one of my projects, I had a data hub from 10 different databases. It was all automated by using Kafka and StreamSets.”
Going further, the Data Engineer revealed that StreamSets’ data drift resilience was able to reduce the time needed to fix data drift breakages. He said, “It has definitely saved around two to three weeks of development time. Previously, any kind of changes in our jobs used to require changing our code or table structure and doing some testing. It required at least two to three weeks of effort, which is now taken care of because of StreamSets.”
StreamSets’ reusable assets also helped to reduce his team’s workload. He put it this way: “We can use pipeline fragments across multiple projects, which saves development time.”
Data integration and ETL can be time-consuming and resource-intensive tasks. StreamSets provides a platform that reduces this burden on data engineering teams::
StreamSets Data Collector – Runs data ingestion pipelines (from any source to any destination) that perform record-based data transformations in streaming, CDC or batch modes.
StreamSets Transformer – Performs ETL, ELT and data transformations such as joins, aggregates, and unions directly on Apache Spark and Snowflake platforms along with support for higher order operations (UDFs and Scala).