Data plane Identity
Dataproc on GKE uses GKE workload identity to allow pods within the Dataproc on GKE cluster to act with the authority of the default Dataproc VM service account (data plane identity). Workload identity requires the following permissions to update IAM policies on the GSA used by your Dataproc on GKE virtual cluster:
compute.projects.get
iam.serviceAccounts.getIamPolicy
iam.serviceAccounts.setIamPolicy
GKE workload identity links the following GKE Service Accounts (KSAs) to the Dataproc VM Service Account:
agent
KSA (interacts with Dataproc control plane):
serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/agent]
spark-driver
KSA (runs Spark drivers):serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-driver]
spark-executor
KSA (runs Spark executors):serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-executor]
Assign roles
Grant permissions to the
Dataproc VM service account
to allow the spark-driver
and spark-executor
to access project resources,
data sources, data sinks, and any other services required by your workload.
Example:
The following command assigns roles to the default Dataproc VM service account to allow Spark workloads running on Dataproc on GKE cluster VMs to access Cloud Storage buckets and BigQuery data sets in the project.
gcloud projects add-iam-policy-binding \ --role=roles/storage.objectAdmin \ --role=roles/bigquery.dataEditor \ --member="project-number[email protected]" \ "${PROJECT}"
Custom IAM configuration
Dataproc on GKE uses GKE workload identity to link the default Dataproc VM service account (data plane identity) to the three GKE service accounts (KSAs).
To create and use a different Google service account (GSA) to link to the KSAs:
Create the GSA (see Creating and managing service accounts).
gcloud CLI example:
Notes:gcloud iam service-accounts create "dataproc-${USER}" \ --description "Used by Dataproc on GKE workloads."
- The example sets the GSA name as "dataproc-${USER}", but you can use a different name.
Set environmental variables:
Notes:PROJECT=project-id \ DPGKE_GSA="dataproc-${USER}@${PROJECT}.iam.gserviceaccount.com" DPGKE_NAMESPACE=GKE namespace
DPGKE_GSA
: The examples set and useDPGKE_GSA
as the name of the variable that contains the email address of your GSA. You can set and use a different variable name.DPGKE_NAMESPACE
: The default GKE namespace is the name of your Dataproc on GKE cluster.
When you create the Dataproc on GKE cluster, add the following properties for Dataproc to use your GSA instead of the default GSA:
--properties "dataproc:dataproc.gke.agent.google-service-account=${DPGKE_GSA}" \ --properties "dataproc:dataproc.gke.spark.driver.google-service-account=${DPGKE_GSA}" \ --properties "dataproc:dataproc.gke.spark.executor.google-service-account=${DPGKE_GSA}" \
Run the following commands to assign necessary Workload Identity permissions to the service accounts:
- Assign your GSA the
dataproc.worker
role to allow it to act as agent:gcloud projects add-iam-policy-binding \ --role=roles/dataproc.worker \ --member="serviceAccount:${DPGKE_GSA}" \ "${PROJECT}"
Assign the
agent
KSA theiam.workloadIdentityUser
role to allow it to act as your GSA:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/agent]" \ "${DPGKE_GSA}"
Grant the
spark-driver
KSA theiam.workloadIdentityUser
role to allow it to act as your GSA:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-driver]" \ "${DPGKE_GSA}"
Grant the
spark-executor
KSA theiam.workloadIdentityUser
role to allow it to act as your GSA:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-executor]" \ "${DPGKE_GSA}"
- Assign your GSA the