Exécuter des tâches Spark avec DataprocFileOutputCommitter
Restez organisé à l'aide des collections
Enregistrez et classez les contenus selon vos préférences.
La fonctionnalité DataprocFileOutputCommitter est une version améliorée de FileOutputCommitter Open Source. Il permet aux jobs Apache Spark d'écrire simultanément dans un emplacement de sortie.
Limites
La fonctionnalité DataprocFileOutputCommitter est compatible avec les jobs Spark exécutés sur des clusters Compute Engine Dataproc créés avec les versions d'image suivantes :
Définissez spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory et spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false comme propriété de tâche lorsque vous envoyez une tâche Spark au cluster.
Exemple de Google Cloud CLI :
gcloud dataproc jobs submit spark \
--properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--region=REGION \
other args ...
Sauf indication contraire, le contenu de cette page est régi par une licence Creative Commons Attribution 4.0, et les échantillons de code sont régis par une licence Apache 2.0. Pour en savoir plus, consultez les Règles du site Google Developers. Java est une marque déposée d'Oracle et/ou de ses sociétés affiliées.
Dernière mise à jour le 2025/09/04 (UTC).
[[["Facile à comprendre","easyToUnderstand","thumb-up"],["J'ai pu résoudre mon problème","solvedMyProblem","thumb-up"],["Autre","otherUp","thumb-up"]],[["Difficile à comprendre","hardToUnderstand","thumb-down"],["Informations ou exemple de code incorrects","incorrectInformationOrSampleCode","thumb-down"],["Il n'y a pas l'information/les exemples dont j'ai besoin","missingTheInformationSamplesINeed","thumb-down"],["Problème de traduction","translationIssue","thumb-down"],["Autre","otherDown","thumb-down"]],["Dernière mise à jour le 2025/09/04 (UTC)."],[[["\u003cp\u003eThe DataprocFileOutputCommitter is an enhanced version of FileOutputCommitter, designed to enable concurrent writes by Apache Spark jobs to an output location.\u003c/p\u003e\n"],["\u003cp\u003eThis feature is available for Dataproc Compute Engine clusters running image versions 2.1.10 and higher, or 2.0.62 and higher.\u003c/p\u003e\n"],["\u003cp\u003eTo utilize DataprocFileOutputCommitter, set \u003ccode\u003espark.hadoop.mapreduce.outputcommitter.factory.class\u003c/code\u003e to \u003ccode\u003eorg.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\u003c/code\u003e and \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e to \u003ccode\u003efalse\u003c/code\u003e when submitting a Spark job.\u003c/p\u003e\n"],["\u003cp\u003eWhen using the Dataproc file output committer, it is required that \u003ccode\u003espark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\u003c/code\u003e is set to false in order to prevent conflicts with the created success marker files.\u003c/p\u003e\n"]]],[],null,["The **DataprocFileOutputCommitter** feature is an enhanced\nversion of the open source `FileOutputCommitter`. It\nenables concurrent writes by Apache Spark jobs to an output location.\n\nLimitations\n\nThe `DataprocFileOutputCommitter` feature supports Spark jobs run on\nDataproc Compute Engine clusters created with\nthe following image versions:\n\n- 2.1 image versions 2.1.10 and higher\n\n- 2.0 image versions 2.0.62 and higher\n\nUse `DataprocFileOutputCommitter`\n\nTo use this feature:\n\n1. [Create a Dataproc on Compute Engine cluster](/dataproc/docs/guides/create-cluster#creating_a_cloud_dataproc_cluster)\n using image versions `2.1.10` or `2.0.62` or higher.\n\n2. Set `spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory` and `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false`\n as a job property when you [submit a Spark job](/dataproc/docs/guides/submit-job#how_to_submit_a_job)\n to the cluster.\n\n - Google Cloud CLI example:\n\n ```\n gcloud dataproc jobs submit spark \\\n --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \\\n --region=REGION \\\n other args ...\n ```\n - Code example:\n\n ```\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.outputcommitter.factory.class\",\"org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory\")\n sc.hadoopConfiguration.set(\"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs\",\"false\")\n ```\n | The Dataproc file output committer must set `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false` to avoid conflicts between success marker files created during concurrent writes. You can also set this property in `spark-defaults.conf`."]]