Run Spark jobs with DataprocFileOutputCommitter

The DataprocFileOutputCommitter feature is an enhanced version of the open source FileOutputCommitter. It enables concurrent writes by Apache Spark jobs to an output location.

Limitations

The DataprocFileOutputCommitter feature supports Spark jobs run on Dataproc Compute Engine clusters created with the following image versions:

  • 2.1 image versions 2.1.10 and higher

  • 2.0 image versions 2.0.62 and higher

Use DataprocFileOutputCommitter

To use this feature:

  1. Create a Dataproc on Compute Engine cluster using image versions 2.1.10 or 2.0.62 or higher.

  2. Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory and spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false as a job property when you submit a Spark job to the cluster.

    • Google Cloud CLI example:
    gcloud dataproc jobs submit spark \
        --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
        --region=REGION \
        other args ...
    
    • Code example:
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory")
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")