Cloud Blog: Cut costs and boost efficiency with Dataflow’s new custom source reads

Source URL: https://cloud.google.com/blog/products/data-analytics/cut-costs-and-boost-efficiency-with-dataflows-new-source-reads/
Source: Cloud Blog
Title: Cut costs and boost efficiency with Dataflow’s new custom source reads

Feedly Summary: Scaling workloads often comes with a hefty price tag, especially in streaming environments, where latency is heavily scrutinized. So it makes sense we want our pipelines to run without bottlenecks — because costs and latency grow with inefficiencies!
This is especially true for most modern autotuning strategies: whenever there are hot keys or hot workers bottlenecking processing and building up backlogs, data freshness suffers. Apache Kafka is an example of a streaming environment that can create hot spots in the pipeline. An autoscaler may try to compensate after the fact with additional compute units. However, this is not only costly, it’s also slow. An autoscaler only reacts after there’s a backlog of accumulated messages and incurs overhead spinning up new workers.
To help, we recently introduced load balancing in Dataflow to help with source reads. By better distributing workloads and proactively relieving overwhelmed workers with load balancing, we are able to push more data with less resources and lower latencies.
Real user gains
The following are production pipelines from top Dataflow customers that have benefitted from the load balancing rollout. 
Case 1 – Load balancing reduced user-worker scaling events and allowed the pipeline to operate with better performance with fewer workers. (75% lower workers resulted in a daily cost reduction of 64% in Google Compute Engine, and the backlog dropped from ~1min to ~10s)

dataflow.googleapis.com/job/active_worker_instances

Work was offloaded to idle workers, for a jump in CPU utilization in 10th percentilecompute.googleapis.com/instance/cpu/utilization

Case 2 – Load balancing made it possible to scale with the input rate instead of stuck at maximum number of workers because of the high backlog. (57% lower worker on the same input rate resulted in a monthly cost reduction of 37% in Compute Engine, and peak backlog dropped from ~4.5 days to ~5.2 hours)

Note: Each color represents a single worker and its assigned workload, not yet available externally

Case 3 – Load Balancing allowed for +27% increased throughput with reduced backlog of ~1 day for the same number of workers.

dataflow.googleapis.com/job/estimated_bytes_produced_count

Load balancing at work
When a pipeline starts up, Dataflow doesn’t know in advance the amount of data coming in on any particular data source. In fact, it can change throughout the life of the pipeline. Therefore, when there are multiple topics involved, you may end up in the following situation:

If worker 1 is unable to keep up with the 30 MB/s load, then you will need to bring up a third worker to handle topic 2. You can achieve a better solution with load balancing: rebalance and let the pipeline keep up with just two workers.
With load balancing enabled, work is automatically and intelligently distributed by looking at the live input rate of each topic, preventing hot workers from bottlenecking the entire pipeline. This extends beyond unbalanced topics; it can also find per-key-level imbalances and redistribute keys among workers*, achieving balance at the core.

On by default
We turned on custom source load balancing in Dataflow’s production environment across all regions in July. This is available to all customers by default for all Dataflow streaming engine pipelines. Get started with Dataflow and Google Cloud Managed Service for Apache Kafka right from the Google Cloud console. Stay tuned for future updates and learn more by contacting the Google Cloud sales team.
* Load balancing can’t split individual keys into two so hot keys can only be mitigated. If fundamentally we can’t make a good split of work, load balancing would not be able to fix the pipeline.

AI Summary and Description: Yes

**Summary:** The text discusses recent advancements in load balancing within Google Cloud’s Dataflow, addressing issues of efficiency and cost in streaming environments. This new feature helps reduce latency and operational costs by intelligently distributing workloads across workers, potentially benefiting users in cloud computing and infrastructure optimization.

**Detailed Description:**
The content focuses on the importance of enhancing performance in streaming environments, particularly with the introduction of load balancing in Google Cloud’s Dataflow. Here are the major points highlighted:

– **Cost and Latency Challenges:**
– Streaming workloads often cause issues with latency and inefficiencies, leading to increased costs.
– Traditional autoscaling strategies can react to bottlenecks only after they occur, which can be both slow and expensive.

– **Introduction of Load Balancing:**
– The new load balancing feature proactively manages workloads and reduces the burden on overwhelmed workers.
– This leads to lower resource usage and improved data throughput, which is crucial for real-time data processing.

– **Real-World Benefits:**
– **Case Studies:**
– **Case 1:** Load balancing resulted in a 75% reduction in worker count, leading to a daily cost reduction of 64% and a decrease in backlog from approximately one minute to ten seconds.
– **Case 2:** This feature allowed for dynamic scaling in response to input rates, achieving a 57% reduction in workers, with a monthly cost reduction of 37% and a drop in peak backlog from 4.5 days to 5.2 hours.
– **Case 3:** Demonstrated a 27% increase in throughput while maintaining the same number of workers and reducing backlog significantly.

– **Operational Mechanism:**
– Load balancing actively analyzes incoming data rates and intelligently redistributes workloads among available workers, preventing hotspots and optimizing resource use.
– It can address imbalances not just among different data sources but down to the key level, although it cannot split individual keys.

– **Default Activation:**
– This feature has been enabled by default for all Dataflow pipelines since July, making it accessible to all customers.

Overall, the implementation of load balancing not only enhances the performance of Dataflow but also presents significant cost savings for companies leveraging cloud infrastructure for real-time data processing. This advancement is particularly relevant for professionals in cloud computing and infrastructure security, as it highlights the continuous innovation aimed at efficiency and cost-effectiveness in cloud service operation.