Spark-Jobs mit DataprocFileOutputCommitter ausführen
Mit Sammlungen den Überblick behalten
Sie können Inhalte basierend auf Ihren Einstellungen speichern und kategorisieren.
Die Funktion DataprocFileOutputCommitter ist eine erweiterte Version von FileOutputCommitter. Sie ermöglicht gleichzeitige Schreibvorgänge von Apache Spark-Jobs an einen Ausgabespeicherort.
Beschränkungen
Die DataprocFileOutputCommitter-Funktion unterstützt Spark-Jobs, die in Dataproc Compute Engine-Clustern ausgeführt werden, die mit den folgenden Image-Versionen erstellt wurden:
Legen Sie spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory und spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false als Jobattribut fest, wenn Sie einen Spark-Job an den Cluster senden.
Beispiel für Google Cloud CLI:
gcloud dataproc jobs submit spark \
--properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--region=REGION \
other args ...
[[["Leicht verständlich","easyToUnderstand","thumb-up"],["Mein Problem wurde gelöst","solvedMyProblem","thumb-up"],["Sonstiges","otherUp","thumb-up"]],[["Schwer verständlich","hardToUnderstand","thumb-down"],["Informationen oder Beispielcode falsch","incorrectInformationOrSampleCode","thumb-down"],["Benötigte Informationen/Beispiele nicht gefunden","missingTheInformationSamplesINeed","thumb-down"],["Problem mit der Übersetzung","translationIssue","thumb-down"],["Sonstiges","otherDown","thumb-down"]],["Zuletzt aktualisiert: 2025-09-04 (UTC)."],[[["\u003cp\u003eThe DataprocFileOutputCommitter is an enhanced version of FileOutputCommitter, designed to enable concurrent writes by Apache Spark jobs to an output location.\u003c/p\u003e\n"],["\u003cp\u003eThis feature is available for Dataproc Compute Engine clusters running image versions 2.1.10 and higher, or 2.0.62 and higher.\u003c/p\u003e\n"],["\u003cp\u003eTo utilize DataprocFileOutputCommitter, set \u003ccode\u003espark.hadoop.mapreduce.outputcommitter.factory.class\u003c/code\u003e to \u003ccode\u003eorg.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\u003c/code\u003e and \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e to \u003ccode\u003efalse\u003c/code\u003e when submitting a Spark job.\u003c/p\u003e\n"],["\u003cp\u003eWhen using the Dataproc file output committer, it is required that \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e is set to false in order to prevent conflicts with the created success marker files.\u003c/p\u003e\n"]]],[],null,["The **DataprocFileOutputCommitter** feature is an enhanced\nversion of the open source `FileOutputCommitter`. It\nenables concurrent writes by Apache Spark jobs to an output location.\n\nLimitations\n\nThe `DataprocFileOutputCommitter` feature supports Spark jobs run on\nDataproc Compute Engine clusters created with\nthe following image versions:\n\n- 2.1 image versions 2.1.10 and higher\n\n- 2.0 image versions 2.0.62 and higher\n\nUse `DataprocFileOutputCommitter`\n\nTo use this feature:\n\n1. [Create a Dataproc on Compute Engine cluster](/dataproc/docs/guides/create-cluster#creating_a_cloud_dataproc_cluster)\n using image versions `2.1.10` or `2.0.62` or higher.\n\n2. Set `spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory` and `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false`\n as a job property when you [submit a Spark job](/dataproc/docs/guides/submit-job#how_to_submit_a_job)\n to the cluster.\n\n - Google Cloud CLI example:\n\n ```\n gcloud dataproc jobs submit spark \\\n --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \\\n --region=REGION \\\n other args ...\n ```\n - Code example:\n\n ```\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.outputcommitter.factory.class\",\"org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\")\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\",\"false\")\n ```\n | The Dataproc file output committer must set `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false` to avoid conflicts between success marker files created during concurrent writes. You can also set this property in `spark-defaults.conf`."]]