The DataprocFileOutputCommitter feature is an enhanced
version of the open source FileOutputCommitter
. It
enables concurrent writes by Apache Spark jobs to an output location.
Limitations
The DataprocFileOutputCommitter
feature supports Spark jobs run on
Dataproc Compute Engine clusters created with
the following image versions:
2.1 image versions 2.1.10 and higher
2.0 image versions 2.0.62 and higher
Use DataprocFileOutputCommitter
To use this feature:
Create a Dataproc on Compute Engine cluster using image versions
2.1.10
or2.0.62
or higher.Set
spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory
andspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
as a job property when you submit a Spark job to the cluster.- Google Cloud CLI example:
gcloud dataproc jobs submit spark \ --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \ --region=REGION \ other args ...
- Code example:
sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory") sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")