Migrating and integrating data into BigQuery using Informatica Intelligent Cloud Services

By Rajat Pandey, Principal Product Manager, and Eddy Widjaja, Solutions Architect at Informatica

This article provides guidance and best practices for migrating and integrating your hybrid datasets to BigQuery by using Informatica Intelligent Cloud Services (IICS).

Increased access to big data has generated new types of analysis and created new data-driven workloads. The amount of data involved often creates scalability challenges and increases operating costs for traditional on-premises Enterprise Data Warehouses (EDW). To address these challenges, companies are looking to migrate or extend their legacy systems to a cloud EDW like BigQuery.

By complementing an existing EDW, or migrating your entire data workload to BigQuery, you might be able to:

  • Lower operational overhead and costs.
  • Eliminate capital costs and expensive license renewals.
  • Ensure scalability and elastic capacity.
  • Take advantage of advanced capabilities, for example, machine learning.

With BigQuery, you don't need to worry about future workloads or lack of storage capacity. BigQuery's serverless NoOps architecture elastically scales under heavy demand, which can reduce query time. By eliminating infrastructure maintenance and realigning data resources, you can glean powerful insights at scale.

When moving data to BigQuery, you need to ensure data integrity between on-premises systems and BigQuery. Proven extract, transform, and load (ETL) procedures and reliable data quality and security services let you deliver, manage, govern, and synchronize your data with BigQuery. After you've put a process into place, transferring secure data to BigQuery helps you accelerate the migration of data workloads and helps to ensure data quality. Then, when your data is in BigQuery, you can take advantage of BigQuery's advanced analytic capabilities.

IICS is a reliable toolset for migrating on-premises workloads to BigQuery. The toolset has the following advantages:

  • It is built on a microservices architecture.
  • It offers automated data management and data integration.
  • It has robust, pre-built connectivity to hundreds of on-premises and cloud-based systems.

Informatica allows organizations to design petabyte-scale data integrations, capturing data from complex hybrid networks and enabling enterprise-wide analysis.

The following sections describe the recommended steps for using IICS to integrate and migrate data into BigQuery.

Migrating to BigQuery using IICS

IICS is a multi-tenant integration platform as a service, a suite of cloud services that enable you to develop, execute, and govern data integration flows. IICS allows connections to on-premises and cloud-based processes, services, applications, and data within individual organizations or across multiple organizations.

Built on Informatica's single Intelligent Data Platform, IICS supports:

  • All flavors of integration and data management—data, application, B2B, master data management, data quality, and data security.
  • All data latencies—real-time, batch, and streaming.
  • A rich set of traditional and new protocols.

IICS is a zero-code, metadata-driven rapid development platform. The underlying runtime engine is based on a modern microservices-based architecture that provides:

  • An easy interface with new cloud services.
  • Unified administration and monitoring.
  • Seamless orchestration of integration flows.

You can deploy Cloud Secure Agent, a scaled-down data integration (DI) engine that runs on-premises, behind a firewall or in Google Cloud Platform (GCP). This approach lets you integrate on-premises sources and applications with cloud applications, which helps prevent data proliferation and security challenges. Your secure agents can run in a cluster or multiple clusters to support performance, scale, load balancing, and high availability. IICS can manage Secure Agent as hosted agents.

The following diagram shows a typical IICS use case.

Architectural diagram of using IICS with BigQuery.

Integration options

When you use the BigQuery connector, you can read data by using streaming mode or batch mode. The following diagram illustrates how Informatica Secure Agent reads data using these modes.

Architectural diagram of how the Secure Agent reads data.

For more information about Secure Agent and how to install it, see Installing Secure Agents.

Streaming mode

When the volume of data that you want to read is small (less than 10,000 rows), use streaming mode. In streaming mode, Secure Agent directly reads data from a data source. You can configure the number of rows that you want Secure Agent to read in one request.

Batch mode

When you want to read large volumes of data in a cost-efficient manner, use batch mode. In batch mode, Secure Agent first downloads the data from the data source into a Cloud Storage bucket. Next, Secure Agent reads the data from the bucket and loads it into BigQuery. Additionally, if network latency is slow, you can enable compression before transferring to Cloud Storage. For more information about data compression and write modes, see the following:

Objectives

  • Configure BigQuery and Cloud Storage connections.
  • Integrate the data.
  • Define data flow logic using the mapping designer.

Configuring BigQuery and Cloud Storage connections

Informatica provides connectors to numerous enterprise systems. You can explore the catalog of available connectors.

  1. Log in to IICS and click Administrator.
  2. Click Add-On Connectors.

    Add-on Connectors in side panel.

  3. Select the BigQuery and Cloud Storage connectors.

Setting up a BigQuery connection

  1. To create a connection, select Connections, and click New Connection.
  2. In the Type drop-down list, click Google BigQuery, and enter values for the required fields as listed in the following table.

    Connection property name
    Description
    Connection name
    Name the connection. This name shows up when you select source or target connection during integration tasks and mapping.
    Description
    Enter a descriptive tag so you can search for the connection at a later date.
    Type
    Select Google BigQuery.
    Runtime Environment
    Select the secure agent that you want to run the tasks.
    Service Account ID
    Enter the service account ID for GCP.
    Service Account Key
    Enter the service account key. You can find this key in the JSON file that was downloaded when you created the service account.
    Project ID
    Enter the GCP project ID that you want to associate with the connector.
    Storage Path
    Enter your storage path for staging purposes. Add the gs:// prefix in the path.
    Connection Mode
    Leave the default value.
    Schema Definition File Path
    Specify a path to store the JSON schema generated by IICS.
    Dataset ID
    Enter the BigQuery dataset ID.

  3. To verify the connection, click Test Connection. The connection is verified when it displays the message, The test for this connection was successful.

    Successful test connection for BigQuery.

  4. When the connection is verified, click Save.

Setting up a Cloud Storage connection

  1. From IICS, to create a connection, select Connections, and click New Connection.
  2. In the Type drop-down list, click Google Cloud Storage V2, and enter values for the required fields as listed in the following table.

    Connection property name
    Description
    Connection name
    Name the connection. This name shows up when you select source or target connection in integration tasks and mapping.
    Description
    Enter a descriptive tag so you can search for the connection at a later date.
    Type
    Select Google Cloud Storage.
    Runtime Environment
    Select the secure agent that you want to run the tasks.
    Service Account ID
    Enter the service account ID for GCP.
    Service Account Key
    Enter the service account key. You can find this key in the JSON file that was downloaded when you created the service account.
    Project ID
    Enter the GCP project ID that you want to associate with the connector.
  3. To verify the connection, click Test Connection. The connection is verified when it displays the message, The test for this connection was successful.

    Successful test connection for Cloud Storage.

Integrating the data

Informatica provides a six-step Synchronization Task wizard to create simple data integration. It lets you use expression and lookup transformations in the field mapping step.

Define the data integration task

  1. Go to the Data Integration service, and then click New.
  2. Click Synchronization Task, and then click Create.

    Create synchronization task.

  3. Enter the Task Name.

  4. For the Task Operation, select Insert to insert into the target. Click Next.

    Define new task details.

Choose the source

  1. From the Connection drop-down list, click the data source. To define a new connection, click New.
  2. Select the Source Type.

    Select a source type.

    where:

    • Single represents a single table or object.
    • Multiple represents multiple tables or objects. When you select Multiple, either you select an already-known relationship or you establish the relationship manually. By using this method, you achieve better performance because the join is done in the source database or application.
    • Saved Query lets you use a custom query to select the source data.
  3. Select the Source Object.

  4. Click Next.

Choose the target

  1. From the Connection drop-down list, select Google BigQuery as the target. If your BigQuery target doesn't have a table that you want to insert into, click Create Target to create the table. To view the first 10 records of the selected table, click Show Data Preview. Data preview is a permission setting at the user level in IICS.

    Define the target details.

  2. Click Next.

Filter incoming source data

  1. To add a filter, click New. For example, you can filter to show only ACCTBAL values greater than $0.

    Example of a data filter.

  2. To close the Data filter window, click OK, and then click Next.

Map source fields to target fields

IICS automatically maps source fields to target fields by name, as shown in the following screenshot. Source fields are on the left, and Target fields are on the right. You can drag the Source field to the Target field.

  1. In the Actions column, click fx to transform the field data by using an expression.

    Transform the field data.

  2. To perform a lookup, click Search .

    A lookup is a search for a field from another data source. This field is used in the mapping.

  3. To clear the mapping for a field, click Close .

    This example shows how to transform the Name field into proper case. There are numerous functions and operators available for transforming incoming fields.

    Example transformation.

  4. In the Target column, click Search next to Address to perform a lookup for this field. The fields are blank in the following screenshot, because there is no lookup in this example.

    Blank lookup.

  5. After exploring the Field Lookup window, click Cancel, and then click Next.

Schedule

In the scheduling step, you configure the following:

  • A schedule for running the task.
  • Email notification.
  • Pre- and post-processing commands.
  • Advanced properties for BigQuery.

To schedule a task:

  1. In the Schedule step, you can define a one time or recurring schedule for executing the task. Click New.

    Scheduling a task.

  2. In the New Schedule window, schedule when and how often a job runs. Click Ok.

    Defining a new schedule.

  3. Scroll to view the BigQuery Advanced target properties fields.

    When you run a task with a BigQuery target, advanced properties let you control the outcome.

    Advanced property name Description
    Target Dataset ID Optional field used to overwrite the target dataset ID specified by the BigQuery connection.
    Target Table Name Optional field used to overwrite the target table name specified in the synchronization task.
    Create Disposition If the target table does not exist, Create never option returns an error. Create if needed option generates the table.
    Write Disposition Defines how to insert if the target table exists. Default Write append appends into an existing table. Write empty inserts if the target table is empty; otherwise, the write operation returns an error. Write truncate deletes existing records before they are inserted into the target table. Applies to Bulk write mode only.
    Write Mode Default Bulk mode inserts records into the target in bulk, and Streaming writes row by row and allows you to query data from the target without delay.
    Streaming Template Table Suffix Suffix name for the templateSuffix BigQuery arameter. The template table is used when streaming data into BigQuery. This option lets you split a logical table into many smaller tables without adding complex code. Applies to Streaming write mode only.
    Rows per Streaming Request Number of rows per streaming request. Default is 500 rows; maximum row size is 10 MB. Applicable to Streaming write mode only.
    Staging File Name Used to overwrite the default filename. The default filename is Informatica_Stage_File_[unix timestamp].[data format: JSON, CSV]. Applies to Bulk write mode only.
    Data format of the staging file Data format of Cloud Storage staging files. JSON supports nested and repeated fields or CSV formats.
    Persist Staging File After Loading Allows the staging file to be persisted in Cloud Storage. Staging files are deleted by default. Applies to Bulk write mode only.
    Enable Staging File Compression Staging files can be compressed to reduce the cost of storage and transfer time from the secure agent to Cloud Storage.
    Job Poll Interval in Seconds Number of seconds to poll for the status of a job. Default is 10.
    Number of Threads for Uploading Staging File If increased from the default of 1, this number increases the secure agent's DTM JVMOption3 heap size as well. Applies to Bulk write mode only.
    Local Stage File Directory If you are using a local secure agent, this directory stores temporary files before writing to the Cloud Storage staging file. Applies to Bulk write mode only.
    Allow Quoted Newlines Allows a newline character inside quotes in a single record in a CSV file.
    Field Delimiter Default field delimiter is a comma (,).
    Allow Jagged Rows Allows rows without trailing columns in a CSV file. Default is not to allow jagged rows.
    pre SQL SQL statement to run before writing data to the target
    post SQL SQL statement to run after writing data to the target
    pre SQL Configuration SQL configuration for pre SQL statement to run
    post SQL Configuration SQL configuration for post SQL statement to run
    Success File Directory Not applicable.
    Error File Directory Not applicable.

    Advanced target properties.

    Standard mode selection.

  4. Click Finish, and then click Run to run the synchronization task.

Defining data flow logic using mapping

When you need complex transformations and the six-step wizard is insufficient, use mapping.

Mapping offers built-in transformations that aren't limited to just expressions and lookups. In addition, you can build mappings that have multiple data sources as well as multiple targets.

The following transformations are available:

The following section explains how to create a complex transformation, such as this finished example:

Example of a complex transformation.

This example mapping does the following:

  • Reads an opportunity object from Salesforce and an account table from SQL Server.
  • Joins the two using account ID as the key.
  • Does a lookup for contact information from Salesforce.
  • After the lookup, aggregates the number of opportunities by account and then inserts them into BigQuery.

To create a new mapping:

  1. Go to the Data Integration service and click New.
  2. Go to Mappings > Mapping, and then click Create.

    Creating a mapping.

  3. Drag the transformation from the panel to the design surface, and then set properties in the bottom panel.

    Drag transformation from panel to design surface

  4. When you're done defining your mapping, click Save. Mapping is automatically validated. If any transformations have errors, they appear in the Validation panel; otherwise, the message "Mapping is valid" appears.

    Validating mapping.

  5. Click More , and select New Mapping Task.

    New mapping task in menu.

  6. In the New Mapping Task window, complete the following steps:

    1. Enter a name in the Task Name field.
    2. From the Runtime Environment drop-down list, select your environment.
    3. Click Next.

    Details of new mapping task.

  7. Choose a schedule if needed, and click Finish.

    Scheduling new mapping task.

  8. To run the mapping task, click Run. The mapping task is queued to run.

Data Catalog

If your organization uses Enterprise Data Catalog (EDC), you can integrate your IICS org with the EDC instance. You can perform a search against the catalog to discover assets. Then you can use those assets as sources in synchronization tasks and as sources, targets, and lookup objects in mapping.

To configure the integration, go to Administrator service and fill in the EDC credentials in Integration with Informatica Enterprise Data Catalog properties.

Data Catalog to configure the integration.

For more information, see the Data Catalog discovery for sources.

Pushdown optimization

Pushdown optimization lets you run transformation logic in source or target databases. By using pushdown optimization, you can improve task performance.

When a mapping task is configured for pushdown optimization, the task converts the transformation logic to an SQL query. The task sends the query statement to the database to execute the query. The amount of transformation logic that can be pushed down depends on the database, transformation logic, and task configuration. The task processes all transformation logic that it cannot push down to a database.

Pushdown optimization configuration is in the Advanced Session Properties in Mapping Task. For more information, see the Pushdown optimization.

Performance tuning

Performance is a critical aspect when dealing with cloud data warehouse use cases. Informatica offers a natively provided partitioning capability that allows users to open multiple threads in parallel to execute a job. The following list shows some ways to optimize session performance and help with better ETL load time.

  • Partitioning. Use Informatica Partitioning with the BigQuery connector to enable parallel threads for data loading. These settings are available in the Synchronization schedule task step and the advanced Mapping tab. For more information, see Partitioning for BigQuery.
  • RECORD data type. IICS supports the RECORD data type, which is a hierarchical data type. Users can read and write RECORD data structures by using the BigQuery connector, which provides parsing support for:

When creating the connection, you select the connection type based on which data pipeline would read the RECORD data type as a hierarchical or flat structure. For more information, see the Connection Modes topic.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...