Stay organized with collections
Save and categorize content based on your preferences.
The DataprocFileOutputCommitter feature is an enhanced
version of the open source FileOutputCommitter. It
enables concurrent writes by Apache Spark jobs to an output location.
Limitations
The DataprocFileOutputCommitter feature supports Spark jobs run on
Dataproc Compute Engine clusters created with
the following image versions:
Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory and spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
as a job property when you submit a Spark job
to the cluster.
Google Cloud CLI example:
gcloud dataproc jobs submit spark \
--properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--region=REGION \
other args ...