Menjalankan tugas Spark dengan DataprocFileOutputCommitter
Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Fitur DataprocFileOutputCommitter adalah versi
yang ditingkatkan dari FileOutputCommitter open source. Hal ini memungkinkan penulisan serentak oleh tugas Apache Spark ke lokasi output.
Batasan
Fitur DataprocFileOutputCommitter mendukung tugas Spark yang dijalankan di cluster Compute Engine Dataproc yang dibuat dengan versi image berikut:
Tetapkan spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory dan spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
sebagai properti tugas saat Anda mengirimkan tugas Spark
ke cluster.
Contoh Google Cloud CLI:
gcloud dataproc jobs submit spark \
--properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--region=REGION \
other args ...
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[[["\u003cp\u003eThe DataprocFileOutputCommitter is an enhanced version of FileOutputCommitter, designed to enable concurrent writes by Apache Spark jobs to an output location.\u003c/p\u003e\n"],["\u003cp\u003eThis feature is available for Dataproc Compute Engine clusters running image versions 2.1.10 and higher, or 2.0.62 and higher.\u003c/p\u003e\n"],["\u003cp\u003eTo utilize DataprocFileOutputCommitter, set \u003ccode\u003espark.hadoop.mapreduce.outputcommitter.factory.class\u003c/code\u003e to \u003ccode\u003eorg.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\u003c/code\u003e and \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e to \u003ccode\u003efalse\u003c/code\u003e when submitting a Spark job.\u003c/p\u003e\n"],["\u003cp\u003eWhen using the Dataproc file output committer, it is required that \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e is set to false in order to prevent conflicts with the created success marker files.\u003c/p\u003e\n"]]],[],null,["The **DataprocFileOutputCommitter** feature is an enhanced\nversion of the open source `FileOutputCommitter`. It\nenables concurrent writes by Apache Spark jobs to an output location.\n\nLimitations\n\nThe `DataprocFileOutputCommitter` feature supports Spark jobs run on\nDataproc Compute Engine clusters created with\nthe following image versions:\n\n- 2.1 image versions 2.1.10 and higher\n\n- 2.0 image versions 2.0.62 and higher\n\nUse `DataprocFileOutputCommitter`\n\nTo use this feature:\n\n1. [Create a Dataproc on Compute Engine cluster](/dataproc/docs/guides/create-cluster#creating_a_cloud_dataproc_cluster)\n using image versions `2.1.10` or `2.0.62` or higher.\n\n2. Set `spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory` and `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false`\n as a job property when you [submit a Spark job](/dataproc/docs/guides/submit-job#how_to_submit_a_job)\n to the cluster.\n\n - Google Cloud CLI example:\n\n ```\n gcloud dataproc jobs submit spark \\\n --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \\\n --region=REGION \\\n other args ...\n ```\n - Code example:\n\n ```\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.outputcommitter.factory.class\",\"org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\")\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\",\"false\")\n ```\n | The Dataproc file output committer must set `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false` to avoid conflicts between success marker files created during concurrent writes. You can also set this property in `spark-defaults.conf`."]]