Executar jobs do Spark com o DataprocFileOutputCommitter
Mantenha tudo organizado com as coleções
Salve e categorize o conteúdo com base nas suas preferências.
O recurso DataprocFileOutputCommitter é uma versão
melhorada do FileOutputCommitter de código aberto. Ele
permite gravações simultâneas por jobs do Apache Spark em um local de saída.
Limitações
O recurso DataprocFileOutputCommitter é compatível com jobs do Spark executados em clusters do Compute Engine do Dataproc criados com as seguintes versões de imagem:
Defina spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory e spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
como uma propriedade do job ao enviar um job do Spark
para o cluster.
Exemplo da Google Cloud CLI:
gcloud dataproc jobs submit spark \
--properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--region=REGION \
other args ...
[[["Fácil de entender","easyToUnderstand","thumb-up"],["Meu problema foi resolvido","solvedMyProblem","thumb-up"],["Outro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Informações incorretas ou exemplo de código","incorrectInformationOrSampleCode","thumb-down"],["Não contém as informações/amostras de que eu preciso","missingTheInformationSamplesINeed","thumb-down"],["Problema na tradução","translationIssue","thumb-down"],["Outro","otherDown","thumb-down"]],["Última atualização 2025-09-04 UTC."],[[["\u003cp\u003eThe DataprocFileOutputCommitter is an enhanced version of FileOutputCommitter, designed to enable concurrent writes by Apache Spark jobs to an output location.\u003c/p\u003e\n"],["\u003cp\u003eThis feature is available for Dataproc Compute Engine clusters running image versions 2.1.10 and higher, or 2.0.62 and higher.\u003c/p\u003e\n"],["\u003cp\u003eTo utilize DataprocFileOutputCommitter, set \u003ccode\u003espark.hadoop.mapreduce.outputcommitter.factory.class\u003c/code\u003e to \u003ccode\u003eorg.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\u003c/code\u003e and \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e to \u003ccode\u003efalse\u003c/code\u003e when submitting a Spark job.\u003c/p\u003e\n"],["\u003cp\u003eWhen using the Dataproc file output committer, it is required that \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e is set to false in order to prevent conflicts with the created success marker files.\u003c/p\u003e\n"]]],[],null,["The **DataprocFileOutputCommitter** feature is an enhanced\nversion of the open source `FileOutputCommitter`. It\nenables concurrent writes by Apache Spark jobs to an output location.\n\nLimitations\n\nThe `DataprocFileOutputCommitter` feature supports Spark jobs run on\nDataproc Compute Engine clusters created with\nthe following image versions:\n\n- 2.1 image versions 2.1.10 and higher\n\n- 2.0 image versions 2.0.62 and higher\n\nUse `DataprocFileOutputCommitter`\n\nTo use this feature:\n\n1. [Create a Dataproc on Compute Engine cluster](/dataproc/docs/guides/create-cluster#creating_a_cloud_dataproc_cluster)\n using image versions `2.1.10` or `2.0.62` or higher.\n\n2. Set `spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory` and `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false`\n as a job property when you [submit a Spark job](/dataproc/docs/guides/submit-job#how_to_submit_a_job)\n to the cluster.\n\n - Google Cloud CLI example:\n\n ```\n gcloud dataproc jobs submit spark \\\n --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \\\n --region=REGION \\\n other args ...\n ```\n - Code example:\n\n ```\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.outputcommitter.factory.class\",\"org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\")\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\",\"false\")\n ```\n | The Dataproc file output committer must set `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false` to avoid conflicts between success marker files created during concurrent writes. You can also set this property in `spark-defaults.conf`."]]