Professional Data Engineer
Current GA certification exam guide
A Professional Data Engineer makes data usable and valuable for others by collecting, transforming, and publishing data. This individual evaluates and selects products and services to meet business and regulatory requirements. A Professional Data Engineer creates and manages robust data processing systems. This includes the ability to design, build, deploy, monitor, maintain, and secure data processing workloads.
Registration for the Professional Data Engineer beta exam opened on September 19. Registration for the current GA exam is closed during the beta period and will reopen in mid October. If you have already registered to take the current GA version of the exam during the beta period, you will still take the current GA version of the exam.
Exam guides
Section 1: Designing data processing systems 1.1 Designing for security and compliance.
Considerations include: ● Identity and Access
Management (e.g., Cloud IAM and organization
policies) ● Data security (encryption and
key management) ● Privacy (e.g., personally
identifiable information, and Cloud Data Loss
Prevention API) ● Regional considerations (data
sovereignty) for data access and storage ● Legal and regulatory
compliance 1.2 Designing for reliability and fidelity.
Considerations include: ● Preparing and cleaning data
(e.g., Dataprep, Dataflow, and Cloud Data Fusion)
● Monitoring and orchestration
of data pipelines ● Disaster recovery and fault
tolerance ● Making decisions related to
ACID (atomicity, consistency, isolation, and
durability) compliance and availability ● Data validation 1.3 Designing for flexibility and portability.
Considerations include: ● Mapping current and future
business requirements to the architecture ● Designing for data and
application portability (e.g., multi-cloud and data
residency requirements) ● Data staging, cataloging, and
discovery (data governance) 1.4 Designing data migrations. Considerations
include: ● Analyzing current stakeholder
needs, users, processes, and technologies and
creating a plan to get to desired state ● Planning migration to Google
Cloud (e.g., BigQuery Data Transfer Service,
Database Migration Service, Transfer Appliance,
Google Cloud networking, Datastream) ● Designing the migration
validation strategy ● Designing the project,
dataset, and table architecture to ensure proper
data governance Section 2: Ingesting and processing the data 2.1 Planning the data pipelines. Considerations
include: ● Defining data sources and
sinks ● Defining data transformation
logic ● Networking fundamentals ● Data encryption 2.2 Building the pipelines. Considerations include:
● Data cleansing ● Identifying the services
(e.g., Dataflow, Apache Beam, Dataproc, Cloud Data
Fusion, BigQuery, Pub/Sub, Apache Spark, Hadoop
ecosystem, and Apache Kafka) ● Transformations ○ Batch ○ Streaming
(e.g., windowing, late arriving data) ○ Language ○ Ad hoc data
ingestion (one-time or automated pipeline) ● Data acquisition and import
● Integrating with new data
sources 2.3 Deploying and operationalizing the pipelines.
Considerations include: ● Job automation and
orchestration (e.g., Cloud Composer and Workflows)
● CI/CD (Continuous Integration
and Continuous Deployment) Section 3: Storing the data 3.1 Selecting storage systems. Considerations
include: ● Analyzing data access
patterns ● Choosing managed services
(e.g., Bigtable, Cloud Spanner, Cloud SQL, Cloud
Storage, Firestore, Memorystore) ● Planning for storage costs
and performance ● Lifecycle management of data
3.2 Planning for using a data warehouse.
Considerations include: ● Designing the data model ● Deciding the degree of data
normalization ● Mapping business requirements
● Defining architecture to
support data access patterns 3.3 Using a data lake. Considerations include: ● Managing the lake
(configuring data discovery, access, and cost
controls) ● Processing data ● Monitoring the data lake 3.4 Designing for a data mesh. Considerations
include: ● Building a data mesh based on
requirements by using Google Cloud tools (e.g.,
Dataplex, Data Catalog, BigQuery, Cloud Storage) ● Segmenting data for
distributed team usage ● Building a federated
governance model for distributed data systems Section 4: Preparing and using data for analysis
4.1 Preparing data for visualization.
Considerations include: ● Connecting to tools ● Precalculating fields ● BigQuery materialized views
(view logic) ● Determining granularity of
time data ● Troubleshooting poor
performing queries ● Identity and Access
Management (IAM) and Cloud Data Loss Prevention
(Cloud DLP) 4.2 Sharing data. Considerations include: ● Defining rules to share data
● Publishing datasets ● Publishing reports and
visualizations ● Analytics Hub 4.3 Exploring and analyzing data. Considerations
include: ● Preparing data for feature
engineering (training and serving machine learning
models) ● Conducting data discovery Section 5: Maintaining and automating data
workloads 5.1 Optimizing resources. Considerations include:
● Minimizing costs per required
business need for data ● Ensuring that enough
resources are available for business-critical data
processes ● Deciding between persistent
or job-based data clusters (e.g., Dataproc) 5.2 Designing automation and repeatability.
Considerations include: ● Creating directed acyclic
graphs (DAGs) for Cloud Composer ● Scheduling jobs in a
repeatable way 5.3 Organizing workloads based on business
requirements. Considerations include: ● Flex, on-demand, and flat
rate slot pricing (index on flexibility or fixed
capacity) ● Interactive or batch query
jobs 5.4 Monitoring and troubleshooting processes.
Considerations include: ● Observability of data
processes (e.g., Cloud Monitoring, Cloud Logging,
BigQuery admin panel) ● Monitoring planned usage ● Troubleshooting error
messages, billing issues, and quotas ● Manage workloads, such as
jobs, queries, and compute capacity (reservations)
5.5 Maintaining awareness of failures and
mitigating impact. Considerations include: ● Designing system for fault
tolerance and managing restarts ● Running jobs in multiple
regions or zones ● Preparing for data corruption
and missing data ● Data replication and failover
(e.g., Cloud SQL, Redis clusters)
Section 1: Designing data processing systems 1.1 Selecting the appropriate storage technologies.
Considerations include: ● Mapping storage systems to
business requirements ● Data modeling ● Trade-offs involving latency,
throughput, transactions ● Distributed systems ● Schema design 1.2 Designing data pipelines. Considerations
include: ● Data publishing and
visualization (e.g., BigQuery) ● Batch and streaming data
(e.g., Dataflow, Dataproc, Apache Beam, Apache Spark
and Hadoop ecosystem, Pub/Sub, Apache Kafka) ● Online (interactive) vs.
batch predictions ● Job automation and
orchestration (e.g., Cloud Composer) 1.3 Designing a data processing solution.
Considerations include: ● Choice of infrastructure ● System availability and fault
tolerance ● Use of distributed systems
● Capacity planning ● Hybrid cloud and edge
computing ● Architecture options (e.g.,
message brokers, message queues, middleware,
service-oriented architecture, serverless functions)
● At least once, in-order, and
exactly once, etc., event processing 1.4 Migrating data warehousing and data processing.
Considerations include: ● Awareness of current state
and how to migrate a design to a future state ● Migrating from on-premises to
cloud (Data Transfer Service, Transfer Appliance,
Cloud Networking) ● Validating a migration Section 2: Building and operationalizing data
processing systems 2.1 Building and operationalizing storage systems.
Considerations include: ● Effective use of managed
services (Cloud Bigtable, Cloud Spanner, Cloud SQL,
BigQuery, Cloud Storage, Datastore, Memorystore) ● Storage costs and performance
● Life cycle management of data
2.2 Building and operationalizing pipelines.
Considerations include: ● Data cleansing ● Batch and streaming ● Transformation ● Data acquisition and import
● Integrating with new data
sources 2.3 Building and operationalizing processing
infrastructure. Considerations include: ● Provisioning resources ● Monitoring pipelines ● Adjusting pipelines ● Testing and quality control
Section 3: Operationalizing machine learning models
3.1 Leveraging pre-built ML models as a service.
Considerations include: ● ML APIs (e.g., Vision API,
Speech API) ● Customizing ML APIs (e.g.,
AutoML Vision, Auto ML text) ● Conversational experiences
(e.g., Dialogflow) 3.2 Deploying an ML pipeline. Considerations
include: ● Ingesting appropriate data
● Retraining of machine
learning models (AI Platform Prediction and
Training, BigQuery ML, Kubeflow, Spark ML) ● Continuous evaluation 3.3 Choosing the appropriate training and serving
infrastructure. Considerations include: ● Distributed vs. single
machine ● Use of edge compute ● Hardware accelerators (e.g.,
GPU, TPU) 3.4 Measuring, monitoring, and troubleshooting
machine learning models. Considerations include: ● Machine learning terminology
(e.g., features, labels, models, regression,
classification, recommendation, supervised and
unsupervised learning, evaluation metrics) ● Impact of dependencies of
machine learning models ● Common sources of error
(e.g., assumptions about data) Section 4: Ensuring solution quality 4.1 Designing for security and compliance.
Considerations include: ● Identity and access
management (e.g., Cloud IAM) ● Data security (encryption,
key management) ● Ensuring privacy (e.g., Data
Loss Prevention API) ● Legal compliance (e.g.,
Health Insurance Portability and Accountability Act
(HIPAA), Children's Online Privacy Protection Act
(COPPA), FedRAMP, General Data Protection Regulation
(GDPR)) 4.2 Ensuring scalability and efficiency.
Considerations include: ● Building and running test
suites ● Pipeline monitoring (e.g.,
Cloud Monitoring) ● Assessing, troubleshooting,
and improving data representations and data
processing infrastructure ● Resizing and autoscaling
resources 4.3 Ensuring reliability and fidelity.
Considerations include: ● Performing data preparation
and quality control (e.g., Dataprep) ● Verification and monitoring
● Planning, executing, and
stress testing data recovery (fault tolerance,
rerunning failed jobs, performing retrospective
re-analysis) ● Choosing between ACID,
idempotent, eventually consistent requirements 4.4 Ensuring flexibility and portability.
Considerations include: ● Mapping to current and future
business requirements ● Designing for data and
application portability (e.g., multicloud, data
residency requirements) ● Data staging, cataloging, and
discovery