Data Engineer

Certification Exam Guide

Sample Case Study

During the exam for the Data Engineer Certification, some of the questions may refer you to a case study that describes a fictitious business and solution concept. These case studies are intended to provide additional context to help you choose your answer(s). Review some sample case studies that may be used in the exam.

Job Role Description

A Google Certified Professional - Data Engineer enables data-driven decision making by collecting, transforming, and visualizing data. The data engineer should be able to design, build, maintain, and troubleshoot data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems. The data engineer should also be able to analyze data to gain insight into business outcomes, build statistical models to support decision-making, and create machine learning models to automate and simplify key business processes.

Register for the exam

Certification Exam Guide

Section 1: Designing data processing systems

1.1 Designing flexible data representations. Considerations include:

  • future advances in data technology
  • changes to business requirements
  • awareness of current state and how to migrate the design to a future state
  • data modeling
  • tradeoffs
  • distributed systems
  • schema design

1.2 Designing data pipelines. Considerations include:

  • future advances in data technology
  • changes to business requirements
  • awareness of current state and how to migrate the design to a future state
  • data modeling
  • tradeoffs
  • system availability
  • distributed systems
  • schema design
  • common sources of error (eg. removing selection bias)

1.3 Designing data processing infrastructure. Considerations include:

  • future advances in data technology
  • changes to business requirements
  • awareness of current state, how to migrate the design to the future state
  • data modeling
  • tradeoffs
  • system availability
  • distributed systems
  • schema design
  • capacity planning
  • different types of architectures: message brokers, message queues, middleware, service-oriented

Section 2: Building and maintaining data structures and databases

2.1 Building and maintaining flexible data representations

2.2 Building and maintaining pipelines. Considerations include:

  • data cleansing
  • batch and streaming
  • transformation
  • acquire and import data
  • testing and quality control
  • connecting to new data sources

2.3 Building and maintaining processing infrastructure. Considerations include:

  • provisioning resources
  • monitoring pipelines
  • adjusting pipelines
  • testing and quality control

Section 3: Analyzing data and enabling machine learning

3.1 Analyzing data. Considerations include:

  • data collection and labeling
  • data visualization
  • dimensionality reduction
  • data cleaning/normalization
  • defining success metrics

3.2 Machine learning. Considerations include:

  • feature selection/engineering
  • algorithm selection
  • debugging a model

3.3 Machine learning model deployment. Considerations include:

  • performance/cost optimization
  • online/dynamic learning

Section 4: Modeling business processes for analysis and optimization

4.1 Mapping business requirements to data representations. Considerations include:

  • working with business users
  • gathering business requirements

4.2 Optimizing data representations, data infrastructure performance and cost. Considerations include:

  • resizing and scaling resources
  • data cleansing, distributed systems
  • high performance algorithms
  • common sources of error (eg. removing selection bias)

Section 5: Ensuring reliability

5.1 Performing quality control. Considerations include:

  • verification
  • building and running test suites
  • pipeline monitoring

5.2 Assessing, troubleshooting, and improving data representations and data processing infrastructure.

5.3 Recovering data. Considerations include:

  • planning (e.g. fault-tolerance)
  • executing (e.g., rerunning failed jobs, performing retrospective re-analysis)
  • stress testing data recovery plans and processes

Section 6: Visualizing data and advocating policy

6.1 Building (or selecting) data visualization and reporting tools. Considerations include:

  • automation
  • decision support
  • data summarization, (e.g, translation up the chain, fidelity, trackability, integrity)

6.2 Advocating policies and publishing data and reports.

Section 7: Designing for security and compliance

7.1 Designing secure data infrastructure and processes. Considerations include:

  • Identify and Access Management (IAM)
  • data security
  • penetration testing
  • Separation of Duties (SoD)
  • security control

7.2 Designing for legal compliance. Considerations include:

  • legislation (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), etc.)
  • audits