Professional Data Engineer
Certification exam guide
A Professional Data Engineer enables data-driven decision-making by collecting, transforming, and publishing data. A data engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A data engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.
Section 1: Designing data processing systems
1.1 Selecting the appropriate storage technologies.
Considerations include: ● Mapping storage systems to
business requirements ● Data modeling ● Trade-offs involving latency,
throughput, transactions ● Distributed systems ● Schema design 1.2 Designing data pipelines. Considerations include:
● Data publishing and visualization
(e.g., BigQuery) ● Batch and streaming data (e.g.,
Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop
ecosystem, Pub/Sub, Apache Kafka) ● Online (interactive) vs. batch
predictions ● Job automation and orchestration
(e.g., Cloud Composer) 1.3 Designing a data processing solution.
Considerations include: ● Choice of infrastructure ● System availability and fault
tolerance ● Use of distributed systems ● Capacity planning ● Hybrid cloud and edge computing
● Architecture options (e.g.,
message brokers, message queues, middleware,
service-oriented architecture, serverless functions) ● At least once, in-order, and
exactly once, etc., event processing 1.4 Migrating data warehousing and data processing.
Considerations include: ● Awareness of current state and
how to migrate a design to a future state ● Migrating from on-premises to
cloud (Data Transfer Service, Transfer Appliance, Cloud
Networking) ● Validating a migration
Section 2: Building and operationalizing data processing systems
2.1 Building and operationalizing storage systems.
Considerations include: ● Effective use of managed services
(Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery,
Cloud Storage, Datastore, Memorystore) ● Storage costs and performance ● Life cycle management of data 2.2 Building and operationalizing pipelines.
Considerations include: ● Data cleansing ● Batch and streaming ● Transformation ● Data acquisition and import ● Integrating with new data sources
2.3 Building and operationalizing processing
infrastructure. Considerations include: ● Provisioning resources ● Monitoring pipelines ● Adjusting pipelines ● Testing and quality control
Section 3: Operationalizing machine learning models
3.1 Leveraging pre-built ML models as a service.
Considerations include: ● ML APIs (e.g., Vision API, Speech
API) ● Customizing ML APIs (e.g., AutoML
Vision, Auto ML text) ● Conversational experiences (e.g.,
Dialogflow) 3.2 Deploying an ML pipeline. Considerations include:
● Ingesting appropriate data ● Retraining of machine learning
models (AI Platform Prediction and Training, BigQuery
ML, Kubeflow, Spark ML) ● Continuous evaluation 3.3 Choosing the appropriate training and serving
infrastructure. Considerations include: ● Distributed vs. single machine
● Use of edge compute ● Hardware accelerators (e.g., GPU,
TPU) 3.4 Measuring, monitoring, and troubleshooting machine
learning models. Considerations include: ● Machine learning terminology
(e.g., features, labels, models, regression,
classification, recommendation, supervised and
unsupervised learning, evaluation metrics) ● Impact of dependencies of machine
learning models ● Common sources of error (e.g.,
assumptions about data)
Section 4: Ensuring solution quality
4.1 Designing for security and compliance.
Considerations include: ● Identity and access management
(e.g., Cloud IAM) ● Data security (encryption, key
management) ● Ensuring privacy (e.g., Data Loss
Prevention API) ● Legal compliance (e.g., Health
Insurance Portability and Accountability Act (HIPAA),
Children's Online Privacy Protection Act (COPPA),
FedRAMP, General Data Protection Regulation (GDPR)) 4.2 Ensuring scalability and efficiency. Considerations
include: ● Building and running test suites
● Pipeline monitoring (e.g., Cloud
Monitoring) ● Assessing, troubleshooting, and
improving data representations and data processing
infrastructure ● Resizing and autoscaling
resources 4.3 Ensuring reliability and fidelity. Considerations
include: ● Performing data preparation and
quality control (e.g., Dataprep) ● Verification and monitoring ● Planning, executing, and stress
testing data recovery (fault tolerance, rerunning failed
jobs, performing retrospective re-analysis) ● Choosing between ACID,
idempotent, eventually consistent requirements 4.4 Ensuring flexibility and portability.
Considerations include: ● Mapping to current and future
business requirements ● Designing for data and
application portability (e.g., multicloud, data
residency requirements) ● Data staging, cataloging, and
discovery