Registration for the new Professional Data Engineer beta exam is now open! Beta exam candidates should review the beta exam guide.

Professional Data Engineer

Current GA certification exam guide

A Professional Data Engineer makes data usable and valuable for others by collecting, transforming, and publishing data. This individual evaluates and selects products and services to meet business and regulatory requirements. A Professional Data Engineer creates and manages robust data processing systems. This includes the ability to design, build, deploy, monitor, maintain, and secure data processing workloads.

Registration for the Professional Data Engineer beta exam opened on September 19. Registration for the current GA exam is closed during the beta period and will reopen in mid October. If you have already registered to take the current GA version of the exam during the beta period, you will still take the current GA version of the exam.

Exam guides

Section 1: Designing data processing systems

1.1 Designing for security and compliance. Considerations include: 

    ●  Identity and Access Management (e.g., Cloud IAM and organization policies)

    ●  Data security (encryption and key management)

    ●  Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)

    ●  Regional considerations (data sovereignty) for data access and storage

    ●  Legal and regulatory compliance

1.2 Designing for reliability and fidelity. Considerations include:

    ●  Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)

    ●  Monitoring and orchestration of data pipelines

    ●  Disaster recovery and fault tolerance

    ●  Making decisions related to ACID (atomicity, consistency, isolation, and durability) compliance and availability

    ●  Data validation

1.3 Designing for flexibility and portability. Considerations include:

    ●  Mapping current and future business requirements to the architecture

    ●  Designing for data and application portability (e.g., multi-cloud and data residency requirements)

    ●  Data staging, cataloging, and discovery (data governance)

1.4 Designing data migrations. Considerations include:

    ●  Analyzing current stakeholder needs, users, processes, and technologies and creating a plan to get to desired state

    ●  Planning migration to Google Cloud (e.g., BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance, Google Cloud networking, Datastream)

    ●  Designing the migration validation strategy

    ●  Designing the project, dataset, and table architecture to ensure proper data governance 

Section 2: Ingesting and processing the data

2.1 Planning the data pipelines. Considerations include:

    ●  Defining data sources and sinks

    ●  Defining data transformation logic

    ●  Networking fundamentals

    ●  Data encryption

2.2 Building the pipelines. Considerations include:

    ●  Data cleansing

    ●  Identifying the services (e.g., Dataflow, Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, Apache Spark, Hadoop ecosystem, and Apache Kafka)

    ●  Transformations

        ○  Batch

        ○  Streaming (e.g., windowing, late arriving data)

        ○  Language

        ○  Ad hoc data ingestion (one-time or automated pipeline)

    ●  Data acquisition and import

    ●  Integrating with new data sources 

2.3 Deploying and operationalizing the pipelines. Considerations include:

    ●  Job automation and orchestration (e.g., Cloud Composer and Workflows)

    ●  CI/CD (Continuous Integration and Continuous Deployment)

Section 3: Storing the data

3.1 Selecting storage systems. Considerations include:

    ●  Analyzing data access patterns

    ●  Choosing managed services (e.g., Bigtable, Cloud Spanner, Cloud SQL, Cloud Storage, Firestore, Memorystore)

    ●  Planning for storage costs and performance

    ●  Lifecycle management of data

3.2 Planning for using a data warehouse. Considerations include:

    ●  Designing the data model

    ●  Deciding the degree of data normalization

    ●  Mapping business requirements

    ●  Defining architecture to support data access patterns

3.3 Using a data lake. Considerations include:

    ●  Managing the lake (configuring data discovery, access, and cost controls)

    ●  Processing data

    ●  Monitoring the data lake

3.4 Designing for a data mesh. Considerations include:

    ●  Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)

    ●  Segmenting data for distributed team usage

    ●  Building a federated governance model for distributed data systems

Section 4: Preparing and using data for analysis

4.1 Preparing data for visualization. Considerations include:

    ●  Connecting to tools

    ●  Precalculating fields

    ●  BigQuery materialized views (view logic)

    ●  Determining granularity of time data

    ●  Troubleshooting poor performing queries

    ●  Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP)

4.2 Sharing data. Considerations include:

    ●  Defining rules to share data

    ●  Publishing datasets

    ●  Publishing reports and visualizations

    ●  Analytics Hub

4.3 Exploring and analyzing data. Considerations include:

    ●  Preparing data for feature engineering (training and serving machine learning models)

    ●  Conducting data discovery

Section 5: Maintaining and automating data workloads

5.1 Optimizing resources. Considerations include:

    ●  Minimizing costs per required business need for data

    ●  Ensuring that enough resources are available for business-critical data processes

    ●  Deciding between persistent or job-based data clusters (e.g., Dataproc)

5.2 Designing automation and repeatability. Considerations include:

    ●  Creating directed acyclic graphs (DAGs) for Cloud Composer

    ●  Scheduling jobs in a repeatable way 

5.3 Organizing workloads based on business requirements. Considerations include:

    ●  Flex, on-demand, and flat rate slot pricing (index on flexibility or fixed capacity)

    ●  Interactive or batch query jobs

5.4 Monitoring and troubleshooting processes. Considerations include:

    ●  Observability of data processes (e.g., Cloud Monitoring, Cloud Logging, BigQuery admin panel)

    ●  Monitoring planned usage

    ●  Troubleshooting error messages, billing issues, and quotas

    ●  Manage workloads, such as jobs, queries, and compute capacity (reservations)

5.5 Maintaining awareness of failures and mitigating impact. Considerations include:

    ●  Designing system for fault tolerance and managing restarts

    ●  Running jobs in multiple regions or zones

    ●  Preparing for data corruption and missing data

    ●  Data replication and failover (e.g., Cloud SQL, Redis clusters)

Section 1: Designing data processing systems

1.1 Selecting the appropriate storage technologies. Considerations include:

    ●  Mapping storage systems to business requirements

    ●  Data modeling

    ●  Trade-offs involving latency, throughput, transactions

    ●  Distributed systems

    ●  Schema design

1.2 Designing data pipelines. Considerations include:

    ●  Data publishing and visualization (e.g., BigQuery)

    ●  Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)

    ●  Online (interactive) vs. batch predictions

    ●  Job automation and orchestration (e.g., Cloud Composer)

1.3 Designing a data processing solution. Considerations include:

    ●  Choice of infrastructure

    ●  System availability and fault tolerance

    ●  Use of distributed systems

    ●  Capacity planning

    ●  Hybrid cloud and edge computing

    ●  Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)

    ●  At least once, in-order, and exactly once, etc., event processing

1.4 Migrating data warehousing and data processing. Considerations include:

    ●  Awareness of current state and how to migrate a design to a future state

    ●  Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)

    ●  Validating a migration

Section 2: Building and operationalizing data processing systems

2.1 Building and operationalizing storage systems. Considerations include:

    ●  Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)

    ●  Storage costs and performance

    ●  Life cycle management of data

2.2 Building and operationalizing pipelines. Considerations include:

    ●  Data cleansing

    ●  Batch and streaming

    ●  Transformation

    ●  Data acquisition and import

    ●  Integrating with new data sources

2.3 Building and operationalizing processing infrastructure. Considerations include:

    ●  Provisioning resources

    ●  Monitoring pipelines

    ●  Adjusting pipelines

    ●  Testing and quality control

Section 3: Operationalizing machine learning models

3.1 Leveraging pre-built ML models as a service. Considerations include:

    ●  ML APIs (e.g., Vision API, Speech API)

    ●  Customizing ML APIs (e.g., AutoML Vision, Auto ML text)

    ●  Conversational experiences (e.g., Dialogflow)

3.2 Deploying an ML pipeline. Considerations include:

    ●  Ingesting appropriate data

    ●  Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)

    ●  Continuous evaluation

3.3 Choosing the appropriate training and serving infrastructure. Considerations include:

    ●  Distributed vs. single machine

    ●  Use of edge compute

    ●  Hardware accelerators (e.g., GPU, TPU)

3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:

    ●  Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)

    ●  Impact of dependencies of machine learning models

    ●  Common sources of error (e.g., assumptions about data)

Section 4: Ensuring solution quality

4.1 Designing for security and compliance. Considerations include:

    ●  Identity and access management (e.g., Cloud IAM)

    ●  Data security (encryption, key management)

    ●  Ensuring privacy (e.g., Data Loss Prevention API)

    ●  Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children's Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))

4.2 Ensuring scalability and efficiency. Considerations include:

    ●  Building and running test suites

    ●  Pipeline monitoring (e.g., Cloud Monitoring)

    ●  Assessing, troubleshooting, and improving data representations and data processing infrastructure

    ●  Resizing and autoscaling resources

4.3 Ensuring reliability and fidelity. Considerations include:

    ●  Performing data preparation and quality control (e.g., Dataprep)

    ●  Verification and monitoring

    ●  Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)

    ●  Choosing between ACID, idempotent, eventually consistent requirements

4.4 Ensuring flexibility and portability. Considerations include:

    ●  Mapping to current and future business requirements

    ●  Designing for data and application portability (e.g., multicloud, data residency requirements)

    ●  Data staging, cataloging, and discovery