Use Presto with Cloud Dataproc

Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Presto can query Hive, MySQL, Kafka and other data sources through connectors. This tutorial shows you how to:

  • Install the Presto service on a Cloud Dataproc cluster
  • Query public data from a Presto client installed on your local machine that communicates with a Presto service on your cluster
  • Run queries from a Java application that communicates with the Presto service on your cluster through the Presto Java JDBC driver.

Objectives

  1. Create a Cloud Dataproc cluster with Presto installed

  2. Prepare data. This tutorial uses the Chicago Taxi Trips public dataset, available in BigQuery.

    1. Extract the data from BigQuery
    2. Load the data into Cloud Storage as CSV files
    3. Transform data:
      1. Expose the data as a Hive external table to make the data queryable by Presto
      2. Convert the data from CSV format into Parquet format to make querying faster
  3. Send Presto CLI or application code queries using an SSH tunnel or Presto JDBC driver, respectively, to the Presto coordinator running on the cluster

  4. Check logs and monitor the Presto service through the Presto Web UI

Costs

This tutorial uses billable components of Google Cloud Platform, including:

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Google Cloud Platform users may be eligible for a free trial.

Before you begin

If you haven't already done so, create a Google Cloud Platform project and a Cloud Storage bucket to hold the data used in this tutorial.

  1. Setting up your project

    1. Sign in to your Google Account.

      If you don't already have one, sign up for a new account.

    2. Select or create a GCP project.

      Go to the project selector page

    3. Make sure that billing is enabled for your Google Cloud Platform project. Learn how to enable billing.

    4. Enable the Cloud Dataproc, Compute Engine, Cloud Storage, and BigQuery APIs.

      Enable the APIs

    5. Install and initialize the Cloud SDK.

  2. Creating a Cloud Storage bucket in your project to hold the data used in this tutorial.

    1. In the GCP Console, go to the Cloud Storage Browser page.

      Go to the Cloud Storage Browser page

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
    4. Click Create.

Install the Presto client

Install the Presto client software on your local machine.

  1. Download the presto-cli-nnn-executable.jar to your local machine, rename it to presto-cli, and make it executable.
    chmod a+x presto-cli
    

Create a Cloud Dataproc cluster

Optional ComponentsBeta Update: You can now ask Cloud Dataproc to install Presto on your cluster without using an initialization action (see Optional Components→Presto). To do this, change the gcloud dataproc clusters create command, below, to gcloud beta dataproc clusters create, and replace the --initialization-actions flag with --optional-components=PRESTO and --image-version=1.3-deb9 flags. If using the optional component, the Presto server will run on port 8060.

Create a Cloud Dataproc cluster using an initialization action to install the Presto service software on the cluster.

  1. Set environment variables:

    • PROJECT: your project ID
    • BUCKET_NAME: the name of the Cloud Storage bucket you created in Before you begin
    • REGION: region where the cluster used in this tutorial will be created, for example, "us-west1"
    • ZONE: zone within the cluster's region where the cluster used in this tutorial will be located, for example, "us-west1-a"
    • WORKERS: 3 - 5 workers are recommended for this tutorial
      export PROJECT=project-id
      export WORKERS=number
      export REGION=region
      export ZONE=zone
      export BUCKET_NAME=bucket-name
      
  2. Create a cluster using the default or a custom initialization script.

    Default

    1. Set a local INIT_ACTION variable to point to the location of the initialization script in Cloud Storage that will install the Presto service on a new Cloud Dataproc cluster.
      export INIT_ACTION=gs://dataproc-initialization-actions/presto/presto.sh
      
    2. Run the gcloud command-line tool on your local machine to create the cluster.
      gcloud dataproc clusters create presto-cluster \
          --project=${PROJECT} \
          --zone=${ZONE} \
          --num-workers=${WORKERS} \
          --scopes=cloud-platform \
          --initialization-actions=${INIT_ACTION}
      

      The Presto initialization action, which is co-located in the GitHub dataproc-initialization-actions repository, installs Presto at on your cluster's master node at /presto-server-version.

    Custom

    1. View the GitHub Presto initialization script to determine the initialization settings to modify. The initialization script settings include the following:
      • Presto version (see the GitHub Presto initialization script) for the currently set default version)
      • Presto HTTP port (default: 8080)
      • Connector (default: Apache Hive connector)
    2. Copy the Presto initialization action from Cloud Storage to your local machine.
      gsutil cp gs://dataproc-initialization-actions/presto/presto.sh local-path
      
    3. Modify the initialization script to customize the Presto configuration.
    4. Upload the custom Presto initialization action from your local machine to the Cloud Storage bucket you created in Before you begin.
      gsutil cp presto.sh gs://${BUCKET_NAME}/
      
    5. Set a local INIT_ACTION variable to point to the location of your custom script in Cloud Storage
      export INIT_ACTION=gs://${BUCKET_NAME}/presto.sh
      
    6. Run the gcloud command-line tool on your local machine to create the cluster.
      gcloud dataproc clusters create presto-cluster \
          --project=${PROJECT} \
          --zone=${ZONE} \
          --num-workers=${WORKERS} \
          --scopes=cloud-platform \
          --initialization-actions=${INIT_ACTION}
      

  3. To monitor or troubleshoot the Presto installation on your cluster, SSH into cluster VMs, then view the initialization actions logs.

Prepare data

Export the bigquery-public-data chicago_taxi_trips dataset to Cloud Storage as CSV files, then create a Hive external table to reference the data.

  1. On your local machine, run the following command to import the taxi data from BigQuery as CSV files without headers into the Cloud Storage bucket you created in Before you begin.
    bq --location=us extract --destination_format=CSV \
         --field_delimiter=',' --print_header=false \
           "bigquery-public-data:chicago_taxi_trips.taxi_trips" \
           gs://${BUCKET_NAME}/chicago_taxi_trips/csv/shard-*.csv
    
  2. Create Hive external tables that are backed by the CSV and Parquet files in your Cloud Storage bucket.
    1. Create the Hive external table chicago_taxi_trips_csv.
      gcloud dataproc jobs submit hive \
          --cluster presto-cluster \
          --execute "
              CREATE EXTERNAL TABLE chicago_taxi_trips_csv(
                unique_key   STRING,
                taxi_id  STRING,
                trip_start_timestamp  TIMESTAMP,
                trip_end_timestamp  TIMESTAMP,
                trip_seconds  INT,
                trip_miles   FLOAT,
                pickup_census_tract  INT,
                dropoff_census_tract  INT,
                pickup_community_area  INT,
                dropoff_community_area  INT,
                fare  FLOAT,
                tips  FLOAT,
                tolls  FLOAT,
                extras  FLOAT,
                trip_total  FLOAT,
                payment_type  STRING,
                company  STRING,
                pickup_latitude  FLOAT,
                pickup_longitude  FLOAT,
                pickup_location  STRING,
                dropoff_latitude  FLOAT,
                dropoff_longitude  FLOAT,
                dropoff_location  STRING)
              ROW FORMAT DELIMITED
              FIELDS TERMINATED BY ','
              STORED AS TEXTFILE
              location 'gs://${BUCKET_NAME}/chicago_taxi_trips/csv/';"
      
    2. Verify the creation of the Hive external table.
      gcloud dataproc jobs submit hive \
          --cluster presto-cluster \
          --execute "SELECT COUNT(*) FROM chicago_taxi_trips_csv;"
      
    3. Create another Hive external table chicago_taxi_trips_parquet with the same columns, but with data stored in Parquetformat for better query performance.
      gcloud dataproc jobs submit hive \
          --cluster presto-cluster \
          --execute "
              CREATE EXTERNAL TABLE chicago_taxi_trips_parquet(
                unique_key   STRING,
                taxi_id  STRING,
                trip_start_timestamp  TIMESTAMP,
                trip_end_timestamp  TIMESTAMP,
                trip_seconds  INT,
                trip_miles   FLOAT,
                pickup_census_tract  INT,
                dropoff_census_tract  INT,
                pickup_community_area  INT,
                dropoff_community_area  INT,
                fare  FLOAT,
                tips  FLOAT,
                tolls  FLOAT,
                extras  FLOAT,
                trip_total  FLOAT,
                payment_type  STRING,
                company  STRING,
                pickup_latitude  FLOAT,
                pickup_longitude  FLOAT,
                pickup_location  STRING,
                dropoff_latitude  FLOAT,
                dropoff_longitude  FLOAT,
                dropoff_location  STRING)
              STORED AS PARQUET
              location 'gs://${BUCKET_NAME}/chicago_taxi_trips/parquet/';"
      
    4. Load the data from the Hive CSV table into the Hive Parquet table.
      gcloud dataproc jobs submit hive \
          --cluster presto-cluster \
          --execute "
              INSERT OVERWRITE TABLE chicago_taxi_trips_parquet
              SELECT * FROM chicago_taxi_trips_csv;"
      
    5. Verify that the data loaded correctly.
      gcloud dataproc jobs submit hive \
          --cluster presto-cluster \
          --execute "SELECT COUNT(*) FROM chicago_taxi_trips_parquet;"
      

Run queries

You can run queries locally from the Presto CLI or from an application.

Presto CLI queries

This section demonstrates how to query the Hive Parquet taxi dataset using the Presto CLI.

1. Create an SSH tunnel. Run the following command to use dynamic port forwarding to create a localhost:1080 SOCKS proxy on your local machine that connects through an SSH tunnel to the master node of the Cloud Dataproc cluster. The terminal will hang during the execution of the command.

gcloud compute ssh presto-cluster-m \
    --project=${PROJECT} \
    --zone=${ZONE} \
    -- -D 1080 -N

  1. In a new terminal window on your local machine, run the Presto CLI and connect to the Presto server running on port 8080 of the master node of the cluster.

    ./presto-cli \
        --server presto-cluster-m:8080 \
        --socks-proxy localhost:1080 \
        --catalog hive \
        --schema default
    

  2. At the presto:default> prompt, verify that Presto can find the Hive tables.

    show tables;
    
           Table
    ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
    chicago_taxi_trips_csv
    chicago_taxi_trips_parquet
    (2 rows)
    

  3. Run queries from the presto:default> prompt, and compare the performance of querying Parquet versus CSV data.

    1. Parquet data query
      select count(*) from chicago_taxi_trips_parquet where trip_miles > 50;
      
      _col0
      ‐‐‐‐‐‐‐‐
      117957
      (1 row)
      Query 20180928_171735_00006_2sz8c, FINISHED, 3 nodes Splits: 308 total, 308 done (100.00%) 0:16 [113M rows, 297MB] [6.91M rows/s, 18.2MB/s]
      1. CSV data query
        select count(*) from chicago_taxi_trips_csv where trip_miles > 50;
        
        _col0
        ‐‐‐‐‐‐‐‐
        117957
        (1 row)
        Query 20180928_171936_00009_2sz8c, FINISHED, 3 nodes Splits: 881 total, 881 done (100.00%) 0:47 [113M rows, 41.5GB] [2.42M rows/s, 911MB/s]

Java application queries

To run queries from a Java application through the Presto Java JDBC driver:

  1. Download the Presto Java JDBC driver.

  2. Add a presto-jdbc dependency in Maven pom.xml.

    <dependency>
    <groupId>com.facebook.presto</groupId>
    <artifactId>presto-jdbc</artifactId>
    <version>0.206</version>
    </dependency>
    

Sample Java code

package dataproc.codelab.presto;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;

public class PrestoQuery {
  private static final String URL = "jdbc:presto://presto-cluster-m:8080/hive/default";
  private static final String SOCKS_PROXY = "localhost:1080";
  private static final String USER = "user";
  private static final String QUERY =
      "select count(*) as count from chicago_taxi_trips_parquet where trip_miles > 50";

  public static void main(String[] args) {
    try {
      Properties properties = new Properties();
      properties.setProperty("user", USER);
      properties.setProperty("socksProxy", SOCKS_PROXY);
      Connection connection = DriverManager.getConnection(URL, properties);
      try (Statement stmt = connection.createStatement()) {
        ResultSet rs = stmt.executeQuery(QUERY);
        while (rs.next()) {
          int count = rs.getInt("count");
          System.out.println("The number of long trips: " + count);
        }
      }
    } catch (SQLException e) {
      e.printStackTrace();
    }
  }
}

Logging and monitoring

Logging

By default, the Presto initialization action configures /var/presto/data as the Presto data directory on the cluster's master and worker nodes. The Presto logs are located at /var/presto/data/var/log/.

Web UI

By default, the Presto initialization action configures the Presto service to run on port 8080 of the cluster's master node. To connect to the Presto web interface on the master node:

  1. Create an SSH tunnel with SOCKS proxy from you local machine to master-node port 1080.

  2. Launch a Chrome browser with the socks5://localhost:1080 proxy.

    Linux

    /usr/bin/google-chrome \
        --proxy-server="socks5://localhost:1080" \
        --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
        --user-data-dir=/tmp/presto-cluster-m
    

    macOS

    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
        --proxy-server="socks5://localhost:1080" \
        --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
        --user-data-dir=/tmp/presto-cluster-m
    

  3. In the browser, go to:

    http://presto-cluster-m:8080
    

Monitoring

Presto exposes cluster runtime information through runtime tables. In a Presto session (from the presto:default>) prompt, run the following query to view runtime table data:

select * FROM system.runtime.nodes;

Cleaning up

After you've finished the Use Presto with Cloud Dataproc tutorial, you can clean up the resources that you created on GCP so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting the cluster

  • To delete your cluster:
    gcloud dataproc clusters delete --project=${PROJECT} presto-cluster
    

Deleting the bucket

  • To delete the Cloud Storage bucket you created in Before you begin, including the data files stored in the bucket:
    gsutil -m rm -r gs://${BUCKET_NAME}
    
هل كانت هذه الصفحة مفيدة؟ يرجى تقييم أدائنا:

إرسال تعليقات حول...

Cloud Dataproc Documentation
هل تحتاج إلى مساعدة؟ انتقل إلى صفحة الدعم.