Using data sources and destinations

Cloud Dataflow SQL supports reading from Cloud Pub/Sub topics and BigQuery tables, and writing to BigQuery tables.

To use Cloud Dataflow SQL and add Cloud Dataflow sources, you must switch to the Cloud Dataflow SQL UI.

Pub/Sub

Reading from a Pub/Sub topic

The Cloud Dataflow SQL UI provides a way to find Cloud Pub/Sub data source objects for any project that you have access to. You don't need to remember the full topic names.

To add a topic as a Cloud Dataflow source, the topic must have a schema. For an example that walks through assigning a schema, see Assign a schema to your Cloud Pub/Sub topic.

Adding a Pub/Sub topic as a Cloud Dataflow source

To add your Cloud Pub/Sub topics to the Resources section in the left navigation panel, you must add the topic as a Cloud Dataflow source.

  1. In the left navigation panel, click the Add Data drop-down list and select Cloud Dataflow sources.

    Under ADD DATA, select Dataflow sources

  2. In the Add Cloud Dataflow source panel that opens on the right, choose Cloud Pub/Sub topics. In the search box, search for a topic name. Select the topic and click Add.

    The following screenshot shows a search for the transactions topic:

    Search for the transactions Pub/Sub topic

  3. The Resources section in the left navigation panel will display the Cloud Pub/Sub topics you added. To see the list of topics, expand Cloud Dataflow sources and then expand Cloud Pub/Sub topics.

Viewing a topic's schema

To see the schema of a specific topic, expand Cloud Dataflow sources in the Resources section in the left navigation panel. Then, expand Cloud Pub/Sub topics. Clicking a topic shows information about the resource in the details panel, including the Schema. The schema includes the following fields:

  • event_timestamp: the field that the watermark tracks
  • attributes: the Cloud Pub/Sub message attributes
  • payload: the contents of the Cloud Pub/Sub message. Cloud Pub/Sub messages must be in JSON format, as described in the JSON streaming library reference page. For example, you can insert messages formatted as {"k1":"v1", "k2":"v2"} into a BigQuery payload nested row with two fields, named k1 and k2, with string data types. You must serialize all fields that are in the schema, regardless of the field's value (null or non-null). For example, for NULLABLE fields, you must serialize the field in the JSON with a NULL value. If your JSON does not contain the field, Cloud Dataflow SQL returns an error.

The following screenshot shows the schema for the transactions topic.

View the transactions schema

Using a topic in a query

You can specify a Cloud Pub/Sub topic with a fully-qualified, dot-separated list of identifiers that follow the Standard SQL lexical structure.

  pubsub.topic.project-id.topic-name

You must use backticks to enclose identifiers that contain characters which are not letters, numbers, or underscores.

For example, the following string specifies the Cloud Pub/Sub topic daily.transactions from the project dataflow-sql.

  pubsub.topic.`dataflow-sql`.`daily.transactions`

You can also populate the query editor by selecting the topic in the Resources section in the left navigation panel. Expand Cloud Dataflow sources and then expand Cloud Pub/Sub topics. When you click a topic in the navigation panel, you can then click the Query Topic button on the right side of the details panel to populate the query box with a basic query for that topic.

The following screenshot shows the populated query in the query editor:

The Query topic button populates the query box

Use the topic's schema to write your SQL query. When you enter a query in the Cloud Dataflow SQL UI, the query validator verifies the query syntax. A green check mark icon is displayed if the query is valid. If the query is invalid, a red exclamation point icon is displayed. If your query syntax is invalid, clicking on the validator icon provides information about what you need to fix.

The following data enrichment query adds an additional field sales_region to a Cloud Pub/Sub stream of events (transactions). The sales regions are from a BigQuery table (us_state_salesregions) that maps states to sales regions.

Enter your query in the editor

BigQuery

Reading from a BigQuery table

Adding a BigQuery table as a Cloud Dataflow source

You do not need to add your BigQuery table as a Cloud Dataflow source to add datasets and tables to the Resources section in the left navigation panel. You can expand pinned projects in the Resources section to view BigQuery datasets and tables. If your BigQuery table is in a project that is not listed in the Resources section, you can pin the project to add it to the list.

Viewing a table's schema

To see the schema of a specific table, find the project in the Resources section in the left navigation panel. Click the expand dataset icon icon next to your project to display the datasets in that project. Then, click the expand dataset icon next to any dataset to expand it and to show the tables within that dataset. You can also use the search box to search for a specific dataset or table.

Clicking the table shows information about the resource in the details panel, including the Schema.

Using a table in a query

You can specify a BigQuery table with a fully-qualified, dot-separated list of identifiers that follow the Standard SQL lexical structure.

  bigquery.table.project-id.my_dataset.my_table

You must use backticks to enclose identifiers that contain characters which are not letters, numbers, or underscores.

For example, the following string specifies a BigQuery dataset dataflow_sql_dataset and table us_state_salesregions from the project dataflow-sql.

  bigquery.table.`dataflow-sql`.dataflow_sql_dataset.us_state_salesregions

You can also populate the query editor with this string by selecting the table in the Resources section in the left navigation panel. Find your project in the Resources section, expand the project, and then expand the dataset that contains the table you want to use. When you click a table in the navigation panel, you can then click the Query Table button on the right side of the details panel to populate the query box with a basic query for that table.

The following screenshot shows the populated query in the query editor:

The Query table button populates the query box

Use the table's schema to write your SQL query. When you enter a query in the Cloud Dataflow SQL UI, the query validator verifies the query syntax. A green check mark icon is displayed if the query is valid. If the query is invalid, a red exclamation point icon is displayed. If your query syntax is invalid, clicking on the validator icon provides information about what you need to fix.

The following data enrichment query uses a BigQuery table (us_state_salesregions) that maps states to sales regions to add an additional field (sales_region) to a Cloud Pub/Sub stream of events.

Enter your query in the editor

Writing to a BigQuery table

When you create a Cloud Dataflow job to run your SQL query, you must specify a destination BigQuery table for the results. Keep the following prerequisites and considerations in mind:

  • The destination dataset must exist before you click Create Cloud Dataflow job.
  • If the destination table does not exist, the job creates a new table with the specified name.
  • If the destination table already exists, the table must be empty.
¿Te ha resultado útil esta página? Enviar comentarios:

Enviar comentarios sobre...

Si necesitas ayuda, visita nuestra página de asistencia.