This page introduces the best practices for building and optimizing batch processing platforms with Google Kubernetes Engine (GKE), including best practices for:
- architecture
- Job management
- multi-tenancy
- security
- queueing
- storage
- performance
- cost efficiency
- monitoring
GKE provides a powerful framework for orchestrating batch workloads such as data processing, training machine learning models, running scientific simulations, and other high performance computing workloads.
These best practices are intended for platform administrators, cloud architects, and operations professionals interested in deploying batch workloads in GKE. The Reference Architecture for a Batch Processing Platform on GKE showcases many of the best practices discussed in this guide, and can be deployed in your own Google Cloud project.
How batch workloads work
A batch workload is a group of tasks that run to completion without user intervention. To define tasks, you use the Kubernetes Jobs resource. A batch platform receives the Jobs and queues them in the order they are received. The queue in the batch platform applies processing logic such as priority, quota, and allocable resources. By queueing and customizing the batch processing parameters, Kubernetes lets you optimize the use of available resources, minimize the idle time for scheduled Jobs, and maximize cost savings. The following diagram shows the GKE components that can be part of a batch platform.
Batch platform management
Traditionally, batch platforms have two main user personas, developers and platform administrators:
- A developer submits a Job specifying the program, the data to be processed, and requirements for the Job. Then, the developer receives confirmation of the Job submission and a unique identifier. Once the Job is complete, the developer would get a notification along with any output or results of the Job.
- A platform administrator manages and delivers an efficient and reliable batch processing platform to the developers.
A batch processing platform must meet the following requirements:
- The platform resources are properly provisioned to ensure that Jobs run with little to no user intervention required.
- The platform resources are configured according to the organization's security and observability best practices.
- The platform resources are used as efficiently as possible. In case of resource contention, the most important work gets done first.
Prepare the batch platform architecture in GKE
A GKE environment consists of nodes, which are Compute Engine virtual machines (VMs), that are grouped together to form a cluster.
The following table lists the key recommendations when planning and designing your batch platform architecture:
Recommendation | Resources | |
---|---|---|
Select a GKE mode of operation |
GKE has the following modes of operation available:
See the high-level comparison between Autopilot and Standard mode. |
|
Choose the machine type for your nodes |
GKE supports the following Compute Engine VMs series:
Each machine series is associated with one or more CPU platforms, such as Arm processors and x86 processors from Intel and AMD. Learn about the options that are currently available for your workload. |
|
Use hardware accelerators for your nodes |
You can also use hardware accelerators such as graphics processing units (GPUs) and Tensor Processing Units (TPUs) in GKE. Consider is GPU time-sharing strategy, which allows multiple containers to share time on the same physical GPU. This approach is useful for burstable and homogenous GPU workloads with low requests. Multi-instance GPUs to partition GPUs to share a single GPU resource across multiple containers at the same time.
|
|
Enable cluster autoscaler on Standard clusters | GKE automatically resizes the number of nodes in a given node pool based on the demands of your workloads. You don't need to manually add or remove nodes or over-provision your node pools. Instead, you only specify a minimum and maximum size for the node pool. We recommend you set cluster autoscaler with the following configuration:
With Autopilot clusters, you don't need to worry about provisioning nodes or managing node pools because node pools are automatically provisioned through node auto-provisioning, and are automatically scaled to meet the requirements of your workloads. |
|
Enroll your cluster in a release channel | GKE can automatically manage your cluster version and upgrades. Based on your release adoption model, you can enroll your cluster in the GKE available channels.
To learn more, see How to choose the best release channel for your clusters | |
Define a scope of maintenance to exclude for your cluster | With a upgrade scope exclusion window defined, GKE respects that long running batch workloads aren't disrupted for maintenance until completion.
To learn more, see Scope of maintenance to exclude. |
Manage the Job lifecycle
In Kubernetes, you run your workloads in a set of Pods. Pods are groups of single or multiple containers, with shared storage and network resources. Pods are defined by a Kubernetes specification.
A Job creates one or more Pods and continues to retry execution of the them until a specified number of Pods successfully terminate. As Pods complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the Job is complete.
The following table lists the key recommendations when designing and managing Jobs:
Recommendation | Resources |
---|---|
Select the Job completion mode | Specify the Completion
mode as Indexed . This configuration is useful when assigning
a partition of the data to be processed based on the Pod's index. The Pods
of a Job get an associated completion index. Deleting a Job cleans up
the Pods it created. Suspending a Job deletes its active Pods until the
Job is resumed again.
|
Set CronJobs for regular scheduled actions | Use CronJob for GKE to perform regular scheduled actions, such as backups, report generation, or scheduled training for machine learning models. |
Manage failures in a Job | Define the Kubernetes Pod failure
policy and Pod backoff failure limit to handle retriable and non-retriable
failures in a Job. This definition improves the cluster resources
consumption by avoiding unnecessary Pod retries and Job failures due to
Pod disruptions. For example, you can configure preemption,
API-initiated
eviction, or taint-based
eviction where Pods that don't have a toleration for the
|
Manage multiple Jobs as a unit | Use the JobSet API to manage
multiple Jobs as a unit to address workload patterns, such as one driver
(or coordinator) and multiple workers (for example, MPIJob ) while setting
Job defaults that are aligned with common patterns based on your use
cases. For example, you can create an Indexed Job by default, create a
headless service for predictable fully qualified domain names (FQDN) for
Pods, and set the associated Pod failure policy. |
Extend run time for a Pod that doesn't tolerate restarts | Set
the Kubernetes cluster-autoscaler.kubernetes.io/safe-to-evict
annotation to false in the Pod specification. The cluster
autoscaler respects the eviction rules set on Pods. These restrictions can
prevent a node from being deleted by the autoscaler if it contains a Pod with the
cluster-autoscaler.kubernetes.io/safe-to-evict annotation.
To learn more, see Considering Pod scheduling and disruption. |
Manage Multi-tenancy
GKE cluster multi-tenancy is an alternative to managing GKE resources by different users or workloads, named as tenants, in a single organization. The management of GKE resources might follow criteria such as tenant isolation, quotas and limit ranges, or cost allocation.
The following table lists the key recommendations when managing multi-tenancy:
Recommendation | Resources |
---|---|
Use namespaces to manage tenant isolation | You can separate each tenant and their Kubernetes resources into their own namespaces. |
Use policies to enforce tenant isolation | Define Policies to restrict API access, set quotas, constrain resource usage, and restrict what containers are allowed to do. These policies are scoped by namespaces. |
Set GKE cost allocation | Use GKE cost allocation get insights about cluster resource requests for each tenant on a namespace basis. |
Control access to the batch platform
GKE allows you to finely tune the access permissions of the workloads running on the cluster.
The following table lists the key recommendations when managing access and security
Recommendation | Resources |
---|---|
Set Workload Identity Federation for GKE | GKE allows workloads in your GKE cluster to
impersonate Identity and Access Management (IAM) service accounts to access
Google Cloud services. By using Workload Identity Federation for GKE, workloads can securely access
secrets stored outside GKE.
To learn more, see Workload Identity Federation for GKE and access secrets stored. |
Set cluster network isolation | Configure the network isolation of your clusters so that both the control plane endpoint and worker
nodes can have internal IP addresses.
To learn more, see About customizing network isolation. |
Use Shielded GKE Nodes | Configure Shielded GKE Nodes to provide strong, verifiable node identity and integrity to increase the security of GKE nodes. |
Physical isolation | For security reasons your workloads might need stronger isolation. Control scheduling with node taints to physically separate tenants in node-pools using node-taints and workload tolerations. This ensures that only the appropriate workloads to be scheduled on those node-pools. |
Queueing and fair sharing
To control resource consumption, you can assign resource quota limits for each tenant, queue incoming Jobs, and process Jobs in the order they were received.
The following table lists the key recommendations when managing queueing and fair sharing among batch workloads:
Recommendation | Resources |
---|---|
Use Kueue | Kueue is a kubernetes-native Job queueing system for batch, high performance computing, machine learning, and similar applications in a Kubernetes cluster. To help with fair sharing of cluster resources between its tenants, Kueue manages quotas and how Jobs consume them. Kueue makes the following decisions:
To learn how to implement a Job queueing system, see Implement a Job queuing system with quota sharing between namespaces on GKE. To learn more about Kueue, see Kueue concepts. |
Storage, performance, and cost efficiency
The efficient use of our GKE compute and storage resources can reduce costs. One strategy is to right-size and configure your compute instances to align with your batch processing needs while not sacrificing performance.
The following table lists the key recommendations when designing and managing storage and optimizing performance:
Recommendation | Resources |
---|---|
Use Compute Engine Persistent Disks | We recommend you use the following Compute Engine Persistent Disks configurations:
|
Use network-attached storage | Use the following network-attached storage alongside Persistent Disk for optimal storage performance:
|
Define Pub/Sub | Your batch workload may also read and write data. For example, you can use Pub/Sub and write the results to a data warehouse such as BigQuery from where reports and dashboards are updated. We recommend you use the following storage solutions:
|
Specify tuning parameters for your workload | We recommend you use the following configurations:
|
Optimize your workloads networking and latency | GKE supports compact placement policy for node pools which specifies that these nodes (and thus the workloads running on them) should be placed in closer physical proximity to each other within a zone. This is especially useful for tightly coupled and high performance workloads where low latency between different processes comprising the workload is a major concern. To learn more, see compact placement. |
Use Spot VMs | Spot VMs are Compute Engine virtual machine (VM) instances that are priced lower than standard Compute Engine VMs and provide no guarantee of availability. We recommend you use the following solutions:
|
Use image steaming | Use Image streaming to pull container images. GKE streams data from eligible images. This allows your workloads to initialize without waiting for the entire image to download, leading to significantly improved initialization times and better cost efficiency. |
Monitoring
GKE is integrated with observability and logging tools that help you monitor the reliability and efficiency of your cluster. The following table lists the key recommendations when enabling and using GKE observability tools:
Recommendation | Resources |
---|---|
Use Prometheus | GKE is integrated with Google Cloud Observability. Customize the metrics that you want GKE to send to Cloud Logging and Cloud Monitoring Google Cloud Managed Service for Prometheus is enabled for GKE clusters by default. We recommend that you use managed collection to eliminate the complexity of setting up and maintaining Prometheus servers. To learn more, see Managed Service for Prometheus. |
Use Cloud Monitoring dashboards | Use the Monitoring dashboards for GKE to see a high level overview of cluster and resource utilization and drill down and filter through various metrics and dimensions. To learn more, see Observing your GKE clusters. |
What's next
- Learn how to Deploy a batch system using Kueue
- See the Best practices for running cost-optimized Kubernetes applications on GKE