Write to Pub/Sub Lite from Spark (streaming)
Stay organized with collections
Save and categorize content based on your preferences.
Write messages to a Pub/Sub Lite topic from a Spark cluster in the streaming mode.
Explore further
For detailed documentation that includes this code sample, see the following:
Code sample
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],[],[],[],null,["# Write to Pub/Sub Lite from Spark (streaming)\n\nWrite messages to a Pub/Sub Lite topic from a Spark cluster in the streaming mode.\n\nExplore further\n---------------\n\n\nFor detailed documentation that includes this code sample, see the following:\n\n- [Write Pub/Sub Lite messages by using Apache Spark](/pubsub/lite/docs/write-messages-apache-spark)\n\nCode sample\n-----------\n\n### Python\n\n\nTo authenticate to Pub/Sub Lite, set up Application Default Credentials.\nFor more information, see\n\n[Set up authentication for a local development environment](/docs/authentication/set-up-adc-local-dev-environment).\n\n from pyspark.sql import SparkSession\n from pyspark.sql.functions import array, create_map, col, lit, when\n from pyspark.sql.types import BinaryType, StringType\n import uuid\n\n # TODO(developer):\n # project_number = 11223344556677\n # location = \"us-central1-a\"\n # topic_id = \"your-topic-id\"\n\n spark = SparkSession.builder.appName(\"write-app\").getOrCreate()\n\n # Create a RateStreamSource that generates consecutive numbers with timestamps:\n # |-- timestamp: timestamp (nullable = true)\n # |-- value: long (nullable = true)\n sdf = spark.readStream.format(\"rate\").option(\"rowsPerSecond\", 1).load()\n\n # Transform the dataframe to match the required data fields and data types:\n # https://github.com/googleapis/java-pubsublite-spark#data-schema\n sdf = (\n sdf.withColumn(\"key\", lit(\"example\").cast(BinaryType()))\n .withColumn(\"data\", col(\"value\").cast(StringType()).cast(BinaryType()))\n .withColumnRenamed(\"timestamp\", \"event_timestamp\")\n # Populate the attributes field. For example, an even value will\n # have {\"key1\", [b\"even\"]}.\n .withColumn(\n \"attributes\",\n create_map(\n lit(\"key1\"),\n array(when(col(\"value\") % 2 == 0, b\"even\").otherwise(b\"odd\")),\n ),\n )\n .drop(\"value\")\n )\n\n # After the transformation, the schema of the dataframe should look like:\n # |-- key: binary (nullable = false)\n # |-- data: binary (nullable = true)\n # |-- event_timestamp: timestamp (nullable = true)\n # |-- attributes: map (nullable = false)\n # | |-- key: string\n # | |-- value: array (valueContainsNull = false)\n # | | |-- element: binary (containsNull = false)\n sdf.printSchema()\n\n query = (\n sdf.writeStream.format(\"pubsublite\")\n .option(\n \"pubsublite.topic\",\n f\"projects/{project_number}/locations/{location}/topics/{topic_id}\",\n )\n # Required. Use a unique checkpoint location for each job.\n .option(\"checkpointLocation\", \"/tmp/app\" + uuid.uuid4().hex)\n .outputMode(\"append\")\n .trigger(processingTime=\"1 second\")\n .start()\n )\n\n # Wait 60 seconds to terminate the query.\n query.awaitTermination(60)\n query.stop()\n\nWhat's next\n-----------\n\n\nTo search and filter code samples for other Google Cloud products, see the\n[Google Cloud sample browser](/docs/samples?product=pubsublite)."]]