Professional Data Engineer

Certification Exam Guide

A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A data engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A data engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

1. Designing data processing systems

    1.1 Selecting the appropriate storage technologies. Considerations include:

    • Mapping storage systems to business requirements
    • Data modeling
    • Tradeoffs involving latency, throughput, transactions
    • Distributed systems
    • Schema design

    1.2 Designing data pipelines. Considerations include:

    • Data publishing and visualization (e.g., BigQuery)
    • Batch and streaming data (e.g., Cloud Dataflow, Cloud Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Cloud Pub/Sub, Apache Kafka)
    • Online (interactive) vs. batch predictions
    • Job automation and orchestration (e.g., Cloud Composer)

    1.3 Designing a data processing solution. Considerations include:

    • Choice of infrastructure
    • System availability and fault tolerance
    • Use of distributed systems
    • Capacity planning
    • Hybrid cloud and edge computing
    • Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
    • At least once, in-order, and exactly once, etc., event processing

    1.4 Migrating data warehousing and data processing. Considerations include:

    • Awareness of current state and how to migrate a design to a future state
    • Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
    • Validating a migration

2. Building and Operationalizing Data Processing Systems

    2.1 Building and operationalizing storage systems. Considerations include:

    • effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Cloud Datastore, Cloud Memorystore)
    • storage costs and performance
    • lifecycle management of data

    2.2 Building and operationalizing pipelines. Considerations include:

    • data cleansing
    • batch and streaming
    • transformation
    • data acquisition and import
    • Integrating with new data sources

    2.3 Building and operationalizing processing infrastructure. Considerations include:

    • provisioning resources
    • monitoring pipelines
    • adjusting pipelines
    • testing and quality control

3. Operationalizing Machine Learning Models

    3.1 Leveraging pre-built ML models as a service. Considerations include:

    • ML APIs (e.g., Vision API, Speech API)
    • customizing ML APIs (e.g., AutoML Vision, Auto ML text)
    • conversational experiences (e.g., Dialogflow)

    3.2 Deploying an ML pipeline. Considerations include:

    • ingesting appropriate data
    • retraining of machine learning models (Cloud Machine Learning Engine, BigQuery ML, Kubeflow, Spark ML)
    • continuous evaluation

    3.3 Choosing the appropriate training and serving infrastructure. Considerations include:

    • distributed vs. single machine
    • use of edge compute
    • hardware accelerators (e.g., GPU, TPU)

    3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:

    • Machine Learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
    • Impact of dependencies of machine learning models
    • Common sources of error (e.g., assumptions about data)

4. Ensuring Solution Quality

    4.1 Designing for security and compliance. Considerations include:

    • identity and access management (e.g., Cloud IAM)
    • data security (encryption, key management)
    • ensuring privacy (e.g., Data Loss Prevention API)
    • legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children's Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))

    4.2 Ensuring scalability and efficiency. Considerations include:

    • building and running test suites
    • pipeline monitoring (e.g., Stackdriver)
    • assessing, troubleshooting, and improving data representations and data processing infrastructure
    • resizing and autoscaling resources

    4.3 Ensuring reliability and fidelity. Considerations include:

    • performing data preparation and quality control (e.g., Cloud Dataprep)
    • verification and monitoring
    • planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
    • choosing between ACID, idempotent, eventually consistent requirements

    4.4 Ensuring flexibility and portability. Considerations include:

    • mapping to current and future business requirements
    • designing for data and application portability (e.g., multi-cloud, data residency requirements)
    • Data staging, cataloging and discovery