Google Cloud's data ingestion principles
Shane Glass
Developer Advocate
Erin Franz
Partner Manager, Google Cloud
Try Google Cloud
Start building on Google Cloud with $300 in free credits and 20+ always free products.
Free trialBusinesses around the globe are realizing the benefits of replacing legacy data silos with cloud-based enterprise data warehouses, including easier collaboration across business units and access to insights within their data that were previously unseen. However, bringing data from numerous disparate data sources into a single data warehouse requires you to develop pipelines that ingest data from these various sources into your enterprise data warehouse. Historically, this has meant that data engineering teams across the organization procure and implement various tools to do so. But this adds significant complexity to managing and maintaining all these pipelines and makes it much harder to effectively scale these efforts across the organization. Developing enterprise-grade, cloud-native pipelines to bring data into your data warehouse can alleviate many of these challenges. But, if done incorrectly, these pipelines can present new challenges that your teams will have to spend their time and energy addressing.
Developing cloud-based data ingestion pipelines that replicate data from various sources into your cloud data warehouse can be a massive undertaking that requires significant investment of staffing resources. Such a large project can seem overwhelming and it can be difficult to identify where to begin planning such a project. We have defined the following principles for data pipeline planning to begin the process. These principles are intended to help you answer key business questions about your effort and begin to build data pipelines that address your business and technical needs. Each section below details a principle of data pipelines and certain factors your teams should consider as they begin developing their pipelines.
Principle 1: Clarify your objectives
The first principle to consider for pipeline development is clarify your objectives. This can be broadly defined as taking a holistic approach to pipeline development that encompasses requirements from several perspectives: technical teams, regulatory or policy requirements, desired outcomes, business goals, key timelines, available teams and their skill sets, and downstream data users. Clarifying your objectives clearly identifies and defines requirements from each key stakeholder at the beginning of the process and continually checks development against these requirements to ensure the pipelines built will meet these requirements.
This is done by first clearly defining the desired end state for each project in a way that addresses a demonstrated business need of downstream data users. Remember that data pipelines are almost always the means to accomplish your end state, rather than the end state itself. An example of an effectively defined end-state is "enabling teams to gain a better understanding of our customers by providing access to our CRM data within our cloud data warehouse" rather than "move data from our CRM to our cloud data warehouse". This may seem like a merely semantic difference, but framing the problem in terms of business needs helps your teams make technical decisions that will best meet these needs.After clearly defining the business problem you are trying to solve, you should facilitate requirement gathering from each stakeholder and use these requirements to guide the technical development and implementation of your ingestion pipelines. We recommend gathering stakeholders from each team, including downstream data users, prior to development to gather requirements for the technical implementation of the data pipeline. These will include critical timelines, uptime requirements, data update frequency, data transformation, DevOps needs, and security, policy, or regulatory requirements by which a data pipeline must meet.
Principle 2: Build your team
The second principle to consider for pipeline development is build your team. This means ensuring you have the right people with the right skills available in the right places to develop, deploy, and maintain your data pipelines. After you have gathered your pipeline requirements, you can begin to develop a summary architecture that will be used to build and deploy your data pipelines. This will help you identify the human talent you will need to successfully build, deploy, and manage these data pipelines and identify any potential shortfalls that would require additional support from either third-party partners or new team members.
Not only do you need to ensure you have the right people and skill sets available in aggregate, but these individuals need to be effectively structured to empower them to maximize their abilities. This means developing team structures that are optimized for each team's responsibilities and their ability to support adjacent teams as needed.
This also means developing processes that prevent blockers to technical development whenever possible, such as ensuring that teams have all of the appropriate permissions they need to move data from the original source to your cloud data warehouse without violating the concept of least privilege. Developers need access to the original data source (depending on your requirements and architecture) in addition to the destination data warehouse. Examples of this are ensuring that developers have access to develop and/or connect to a Salesforce Connected App or read access to specific Search Ads 360 data fields.
Principle 3: Minimize time to value
The third principle to consider for pipeline development is minimize time to value. This means considering the long-term maintenance burden of a data pipeline prior to developing and deploying it in addition to being able to deploy a minimum viable pipeline as quickly as possible. Generally speaking, we recommend the following approach to building data pipelines to minimize their maintenance burden: Write as little code as possible. Functionally, this can be implemented by:
1. Leveraging interface-based data ingestion products whenever possible. These products minimize the amount of code that requires ongoing maintenance and empower users who aren't software developers to build data pipelines. They can also reduce development time for data pipelines, allowing them to be deployed and updated more quickly.
- Products like Google Data Transfer Service and Fivetran allow for managed data ingestion pipelines by any user to centralize data from SaaS applications, databases, file systems, and other tooling. With little to no code required, these managed services enable you to connect your data warehouse to your sources quickly and easily.
- For workloads managed by ETL developers and data engineers, tools like Google Cloud’s Data Fusion and Dataprep by Trifacta provide an easy-to-use visual interface for designing, managing and monitoring advanced pipelines with complex transformations.
2. Whenever interface-based products or data connectors are insufficient, use pre-existing code templates. Examples of this include templates available for Dataflow that allow users to define variables and run pipelines for common data ingestion use cases, and the Public Datasets pipeline architecture that our Datasets team uses for onboarding.
3. If neither of these options are sufficient, utilize managed services to deploy code for your pipelines. Managed services, such as Dataflow or Dataproc, eliminate the operational overhead of managing pipeline configuration by automatically scaling pipeline instances within predefined parameters.
Principle 4: Increase data trust and transparency
The fourth principle to consider for pipeline development is increase data trust and transparency. For the purposes of this document, we define this as the process of overseeing and managing data pipelines across all tools. Numerous data ingestion pipelines that each leverage different tools or are not developed under a coordinated management plan can result in "tech sprawl", which significantly increases the management overhead of data ingestion pipelines as the quantity of data pipelines increases. This becomes especially cumbersome if you are subject to service-level agreements, or legal, regulatory, or policy requirements for overseeing data pipelines. Preventing tech sprawl is, by far, the best strategy for dealing with it by developing streamlined pipeline management processes that automate reporting. Although this can theoretically be achieved by building all of your data pipelines using a single cloud-based product, we do not recommend doing so because it prevents you from taking advantage of features and cost optimizations that come with choosing the best product for your use case.
A monitoring service such as Google Cloud Monitoring Service or Splunk that automates metrics, events, and metadata collection from various products, including those hosted in on-premise and hybrid computing environments, can help you centralize reporting and monitoring of your data pipelines. A metadata management tool such as Google Cloud's Data Catalog or Informatica’s Enterprise Data Catalog can help you better communicate the nuances of your data so users better understand which data resources are best fit for a given use case. This significantly reduces your pipeline's governance burden by eliminating manual reporting processes that often result in inaccuracies or lagging updates.
Principle 5: Manage costs
The fifth principle to consider for pipeline development is manage costs. This encompasses both the cost of cloud resources and the staffing costs necessary to design, develop, deploy, and maintain your cloud resources. We believe that your goal should not necessarily be to minimize cost, but rather maximizing the value of your investment. This means maximizing the impact of every dollar spent by minimizing waste in cloud resource utilization and human time. There are several factors to consider when it comes to managing costs:
Use the right tool for the job - Different data ingestion pipelines will have different requirements for latency, uptime, transformations, etc. Similarly, different data pipeline tools have different strengths and weaknesses. Choosing the right tool for each data pipeline can help your pipelines operate significantly more efficiently. This can reduce your overall cost, free up staffing time to focus on the most impactful projects, and make your pipelines much more efficient.
Standardize resource labeling - Implement and utilize a consistent labeling schema across all tools and platforms to have the most comprehensive view of your organization's spending. One example is requiring all resources to be labeled by the cost center or team at time of creation. Consistent labeling allows you to monitor your spend across different teams and calculate the overall value of your cloud spending.
Implement cost controls - If available, leverage cost controls to prevent errors that result in unexpectedly large bills.
Capture cloud spend - Capture your spend on all cloud resource utilization for internal analysis using a cloud data warehouse and a data visualization tool. Without it, you won't understand the context of changes in cloud spend and how they correlate with changes in business.
Make cost management everyone's job - Managing costs should be part of the responsibilities of everyone who can create or utilize cloud resources. To do this well, we recommend making cloud spend reporting more transparent internally and/or implementing chargebacks to internal cost centers based on utilization.
Long-term, the increased granularity in cost reporting available within Google Cloud can help you better measure your key performance indicators. You can shift from cost-based reporting (i.e. - “We spent $X on BigQuery storage last month”) to value-based reporting (i.e. - “It costs $X to serve customers who bring in $Y revenue").
To learn more about managing costs, check out Google Cloud's "Understanding the principles of cost optimization" white paper.
Principle 6: Leverage continually improving services
The sixth principle is leverage continually improving services. Cloud services are consistently improving their performance and stability, even if some of these improvements are not obvious to users. These improvements can help your pipelines run faster, cheaper, and more consistently over time. You can take advantage of the benefits of these improvements by:
Automating both your pipelines and pipeline management: Not only should data pipelines be automated, but almost all aspects of managing your pipelines can also be automated. This includes pipeline/data lineage tracking, monitoring, cost management, scheduling, access management and more. This helps reduce long-term operational costs of each data pipeline that can significantly alter your value proposition and prevent any manual configurations from negating the benefits of later product improvements.
Minimizing pipeline complexity whenever possible: While ingestion pipelines are relatively easy to develop using UI-based or managed services, they also require continued maintenance as long as they are in use. The most easily maintained data ingestion pipelines are typically the ones that minimize complexity and leverage automatic optimization capabilities. Any transformation in a data ingestion pipeline is a manual optimization of the pipeline that may struggle to adapt or scale as the underlying services improve. You can minimize the need for such transformations by building ELT (extract, load, transform) pipelines rather than ETL (extract, transform, load) pipelines. This pushes transformations down to the data warehouse that is use a specifically optimized query engine to transform your data rather than manually configured pipelines.
Next steps
If you're looking for more information about developing your cloud-based data platform, check out our Build a modern, unified analytics data platform whitepaper. You can also visit our data integration site to learn more and find ways to get started with your data integration journey.
Once you're ready to begin building your data ingestion pipelines, learn more about how Cloud Data Fusion and Fivetran can help you make sure your pipelines address these principles.