Step
2
Data Engineering on Google Cloud Platform
This four-day instructor-led class provides participants a hands-on introduction to designing and building data processing systems on Google Cloud Platform.
Duration: 4 Days

Course Description

This four-day instructor-led class provides participants a hands-on introduction to designing and building data processing systems on Google Cloud Platform.

Duration

4 days

Objectives

This course teaches participants the following skills:

  • Design and build data processing systems on Google Cloud Platform
  • Process batch and streaming data by implementing autoscaling data pipelines on Cloud Dataflow
  • Derive business insights from extremely large datasets using Google BigQuery
  • Train, evaluate and predict using machine learning models using Tensorflow and Cloud ML
  • Leverage unstructured data using Spark and ML APIs on Cloud Dataproc
  • Enable instant insights from streaming data

Delivery Method

Instructor-led, Instructor-led online

Audience

This class is intended for experienced developers who are responsible for managing big data transformations including:

  • Extracting, Loading, Transforming, cleaning, and validating data
  • Designing pipelines and architectures for data processing
  • Creating and maintaining machine learning and statistical models
  • Querying datasets, visualizing query results and creating reports

Prerequisites

To get the most of out of this course, participants should have:

  • Completed Google Cloud Fundamentals: Big Data & Machine Learning OR have equivalent experience
  • Basic proficiency with common query language such as SQL
  • Experience with data modeling, extract, transform, load activities
  • Developing applications using a common programming language such Python
  • Familiarity with Machine Learning and/or statistics
Course Outline

The course includes presentations, demonstrations, and hands-on labs.

Day 1: Serverless Data Analysis

  • What is BigQuery.
  • Advanced Capabilities.
  • Performance and pricing.
  • Lab: Queries and Functions.
  • Lab: Load and Export data.
  • Introduction to Dataflow and capabilities.
  • Lab: Data pipeline.
  • Lab: MapReduce in Dataflow.
  • Lab: Side inputs.
  • Lab: Streaming.

Day 2: Leveraging unstructured data

  • Introducing Google Cloud Dataproc.
  • Creating and managing clusters.
  • Defining master and worker nodes.
  • Leveraging custom machine types and preemptible worker nodes.
  • Creating clusters with the Web Console.
  • Scripting clusters with the CLI.
  • Using the Dataproc REST API.
  • Dataproc pricing.
  • Scaling and deleting Clusters.
  • Lab: Creating Hadoop Clusters with Google Cloud Dataproc.
  • Controlling application versions.
  • Submitting jobs.
  • Accessing HDFS and Google Cloud Storage.
  • Hadoop.
  • Spark and PySpark.
  • Pig and Hive.
  • Logging and monitoring jobs.
  • Accessing onto master and worker nodes with SSH.
  • Working with PySpark REPL (command-line interpreter).
  • Lab: Running Hadoop and Spark Jobs with Dataproc.
  • Initialization actions.
  • Programming Jupyter/Datalab notebooks.
  • Accessing Google Cloud Storage.
  • Leveraging relational data with Google Cloud SQL.
  • Reading and writing streaming Data with Google BigTable.
  • Querying Data from Google BigQuery.
  • Making Google API Calls from notebooks.
  • Lab: Big Data Analysis with Dataproc.
  • Google’s Machine Learning APIs.
  • Common ML Use Cases.
  • Vision API.
  • Natural Language API.
  • Translate.
  • Speech API.
  • Lab: Adding Machine Learning Capabilities to Big Data Analysis.

Day 3: Serverless Machine Learning

  • What is machine learning (ML).
  • Effective ML: concepts, types.
  • Evaluating ML.
  • ML datasets: generalization.
  • Lab: Explore and create ML datasets.
  • Getting started with TensorFlow.
  • Lab: Using tf.learn.
  • TensorFlow graphs and loops + lab.
  • Lab: Using low-level TensorFlow + early stopping.
  • Monitoring ML training.
  • Lab: Charts and graphs of TensorFlow training.
  • Why Cloud ML?
  • Packaging up a TensorFlow model.
  • End-to-end training.
  • Lab: Run a ML model locally and on cloud.
  • Creating good features.
  • Transforming inputs.
  • Synthetic features.
  • Preprocessing with Cloud ML.
  • Lab: Feature engineering.
  • Wide and deep.
  • Image analysis.
  • Lab: Custom image classification with transfer learning.
  • Embeddings and sequences.
  • Recommendation systems.

Day 4: Resilient streaming systems

  • What is Streaming Analytics?
  • Use-cases.
  • Batch vs Streaming (Real-time).
  • Related terminologies.
  • GCP products that help build for high availability, resiliency, high-throughput, real-time streaming analytics (review of Pub/Sub and Dataflow).
  • Lab: Setup project, enable APIs, setup storage.
  • Streaming architectures and considerations.
  • Choosing the right components.
  • Lab: Explore the dataset.
  • Windowing.
  • Streaming aggregation.
  • Events, triggers.
  • Lab: Create architecture reference.
  • Topics and Subscriptions.
  • Publishing events into Pub/Sub.
  • Lab: Streaming data ingest into Pub/Sub.
  • Subscribing options: Push vs Pull.
  • Alerts.
  • Pipelines, PCollections and Transforms.
  • Windows, Events, and Triggers.
  • Aggregation statistics.
  • Streaming analytics with BigQuery.
  • Low-volume alerts.
  • Lab: alerting scenario for anomalies.
  • Latency considerations.
  • Lab: create streaming data processing pipelines with Dataflow.
  • What is Bigtable?
  • Designing row keys.
  • Performance considerations.
  • Lab: high-volume event processing.
  • What is Google Data Studio?
  • From data to decisions.
  • Lab: build a real-time dashboard to visualize processed data.