Dataflow is the industry-leading platform that provides unified batch and streaming data processing capabilities, and supports a wide variety of analytics and machine learning use cases. It’s a fully managed service that comes with flexible development options (from Flex Templates and Notebooks to Apache Beam SDKs for Java, Python and Go), and a rich set of built-in management tools. It seamlessly integrates with not just Google Cloud products like Pub/Sub, BigQuery, Vertex AI, GCS, Spanner, and BigTable, but also third-party services like Kafka and AWS S3, to best meet your analytics and machine learning needs.
Dataflow’s adoption has ramped up in the past couple of years, and numerous customers now rely on Dataflow for everything from designing small proof of concepts, to large-scale production deployments. As customers are always trying to optimize their spend and do more with less, we naturally get questions related to cost optimization for Dataflow.
In this post, we’ve put together a comprehensive list of actions you can take to help you understand and optimize your Dataflow costs and business outcomes, based on our real-world experiences and product knowledge. We’ll start by helping you understand your current and near-term costs. We’ll then share how you can continuously evaluate and optimize your Dataflow pipelines over time. Let’s dive in!
Understand your current and near-term costs
The first step in most cost optimization projects is to understand your current state. For Dataflow, this involves the ability to effectively:
Understand Dataflow’s cost components
Predict the cost of potential jobs
Monitor the cost of submitted jobs
Understanding Dataflow’s cost components
Your Dataflow job will have direct and indirect costs. Direct costs reflect resources consumed by your job, while indirect costs are incurred by the broader analytics/machine learning solution that your Dataflow pipeline enables. Examples of indirect costs include usage of different APIs invoked from your pipeline, such as:
BigQuery Storage Read and Write APIs
Cloud Storage APIs
BigQuery queries
Pub/Sub subscriptions and publishing, and
Network egress, if any
Direct and indirect costs influence the total cost of your Dataflow pipeline. Therefore, it’s important to develop an intuitive understanding of both components, and use that knowledge to adopt strategies and techniques that help you arrive at a truly optimized architecture and cost for your entire analytics or machine learning solution. For more information about Dataflow pricing, see the Dataflow pricing page.
Predict the cost of potential jobs
You can predict the cost of a potential Dataflow job by initially running the job on a small scale. Once the small job is successfully completed, you can use its results to extrapolate the resource consumption of your production pipeline. Plugging those estimates into the Google Cloud Pricing Calculator should give you the predicted cost of your production pipeline. For more details about predicting the cost of potential jobs, see this (old, but very applicable) blog post on predicting the cost of a Dataflow job.
Monitor the cost of submitted jobs
You can monitor the overall cost of your Dataflow jobs using a few simple tools that Dataflow provides out of the box. Also, as you optimize your existing pipelines, possibly using recommendations from this blog post, you can monitor the impact of your changes on performance, cost, or other aspects of your pipeline that you care about. Handy techniques for monitoring your Dataflow pipelines include:
Use metrics within the Dataflow UI to monitor key aspects of your pipeline.
For CPU intensive pipelines, profile the pipeline to gain insight into how CPU resources are being used throughout your pipeline.
Experience real-time cost control by creating monitoring alerts on Dataflow job metrics which ship out of the box, and that can be good proxies for the cost of your job. These alerts send real-time notifications once the metrics associated with a running pipeline exceeds a predetermined threshold. We recommend that you only do this for your critical pipelines, so you can avoid notification fatigue.
Enable Billing Export into BiqQuery, and perform deep, ad-hoc analyses of your Dataflow costs that help you understand not just the key drivers of your costs, but also how these drivers are trending over time.
Create a labeling taxonomy, and add labels to your Dataflow jobs that help facilitate cost attribution during the ad-hoc analyses of your Dataflow cost data using BigQuery. Check out this blog post for some great examples of how to do this.
Run your Dataflow jobs using a custom Service Account. While this is great from a security perspective, it also has the added benefit of enabling the easy identification of APIs used by your Dataflow job.
Optimize your Dataflow costs
Once you have a good understanding of your Dataflow costs, the next step is to explore opportunities for optimization. Topics to be considered during this phase of your cost optimization journey include:
Your optimization goals
Key factors driving the cost of your pipeline
Considerations for developing optimized batch and streaming pipelines
Let’s look at each of these topics in detail.
Goals
The goal of a cost optimization effort may seem obvious: “reduce my pipeline’s cost.” However, your business may have other priorities that have to be carefully considered and balanced with the cost reduction goal. From our conversations with customers, we have found that most Dataflow cost optimization programs have two main goals:
1. Reduce the pipeline’s cost
2. Continue meeting the service level agreements (SLAs) required by the business
Cost factors
Opportunities for optimizing the cost of your Dataflow pipeline will be based on the key factors driving your costs. While most of these factors will be identified through the process of understanding your current and near term costs, we have identified a set of frequently recurring cost factors, which we have grouped into three buckets: Dataflow configuration, performance, and business requirements.
Dataflow configuration includes factors like:
Should your pipeline be streaming or batch?
Are you using Dataflow services like Streaming Engine, or FlexRS?
Are you using a suitable machine type and disk size?
Should your pipeline be using GPUs?
Do you have the right number of initial and maximum workers?
Performance includes factors like:
Are your SDK versions (Java, Python, Go) up to date?
Are you using IO connectors efficiently? One of the strengths of Apache Beam is a large library of IO connectors to different storage and queueing systems. Apache Beam IO connectors are already optimized for maximum performance. However, there may be cases where there are trade-offs between cost and performance. For example, BigQueryIO supports several write methods, each of them with somewhat different performance and cost characteristics. For more details, see slides 19 – 22 of the Beam Summit session on cost optimization.
Are you using efficient coders? Coders affect the size of the data that needs to be saved to disk or transferred to a dedicated shuffle service in the intermediate pipeline stages, and some coders are more efficient than others. Metrics like total shuffle data processed and total streaming data processed can help you identify opportunities for using more efficient coders. As a general rule, you should consider whether the data that appears in your pipeline contains redundant data that can be eliminated by both filtering out unused columns as early as possible, and using efficient coders such as AvroCoder or RowCoder. Also, remember that if stages of your pipeline are fused, the coders in the intermediate steps become irrelevant.
Do you have sufficient parallelism in your pipeline? The execution details tab, and metrics like processing parallelism keys can help you determine whether your pipeline code is taking full advantage of Apache Beam’s ability to do massively parallel processing (for more details, see parallelism). For example, if you have a transform which outputs a number of records for each input record (“high fan-out transform”) and the pipeline is automatically optimized using Dataflow fusion optimization, the parallelism of the pipeline can be suboptimal, and may benefit from preventing fusion. Another area to watch for is “hot keys.” This blog discusses this topic in great detail.
Are your custom transforms efficient? Job monitoring techniques like profiling your pipeline and viewing relevant metrics can help you catch and correct your inefficient use of custom transforms. For example, if your Java transform needs to check if the data matches a certain regular expression pattern, then compiling that pattern in the setup method and doing the matching using the precompiled pattern in the “process element” method is much more efficient. The simpler, but inefficient alternative is to use the String.matches() call in the “process element” method, which will have to compile the pattern every time. Another consideration regarding custom transforms is that grouping elements for external service calls can help you prepare optimal request batches for external API calls. Finally, transforms that perform multi-step operations (for example, calls to external APIs that require extensive processes for creating and closing the client for the API) can often benefit from splitting these operations, and invoking them in different methods of the ParDo (for more details, see ParDo life cycle).
Are you doing excessive logging? Logs are great. They help with debugging, and can significantly improve developer productivity. On the other hand, excessive logging can negatively impact your pipeline’s performance.
Business requirements can also influence your pipeline’s design, and increase costs. Examples of such requirements include:
Low end-to-end ingest latency
Extremely high throughput
Processing late arriving data
Processing streaming data spikes
Considerations for developing optimized data pipelines
From our work with customers, we have compiled a set of guidelines that you can easily explore and implement to effectively optimize your Dataflow costs. Examples of these guidelines include:
Batch and streaming pipelines:
Consider keeping your Dataflow job in the same region as your IO source and destination services.
Consider using GPUs for specialized use cases like deep-learning inference.
Consider specialized machine types for memory- or compute-intensive workloads.
Consider setting the maximum number of workers for your Dataflow job.
Consider using custom containers to pre-install pipeline dependencies, and speed up worker startup time.
Where necessary, tune the memory of your worker VMs.
Reduce your number of test and staging pipelines, and remember to stop pipelines that you no longer need.
Batch pipelines:
Consider FlexRS for use cases that have start time and run time flexibility.
Run fewer pipelines on larger datasets. Starting and stopping a pipeline incurs some cost, and processing a very small dataset in a batch pipeline can be inefficient.
Consider defining the maximum duration for your jobs, so you can eliminate runaway jobs, which keep running without delivering value to your business.
Summary
In this post, we introduced you to a set of knowledge and techniques that can help you understand the current and near-term costs associated with your Dataflow pipelines. We also shared a series of considerations that can help you optimize your Dataflow pipelines, not just for today, but also as your business evolves over time. We shared lots of resources with you, and hope that you find them useful. We look forward to hearing about the savings and innovative solutions that you are able to deliver for your customers and your business.
Cloud BlogRead More