By Rajat Pandey, Principal Product Manager, and Eddy Widjaja, Solutions Architect at Informatica
This article provides guidance and best practices for migrating and integrating your hybrid datasets to BigQuery by using Informatica Intelligent Cloud Services (IICS).
Increased access to big data has generated new types of analysis and created new data-driven workloads. The amount of data involved often creates scalability challenges and increases operating costs for traditional on-premises Enterprise Data Warehouses (EDW). To address these challenges, companies are looking to migrate or extend their legacy systems to a cloud EDW like BigQuery.
By complementing an existing EDW, or migrating your entire data workload to BigQuery, you might be able to:
- Lower operational overhead and costs.
- Eliminate capital costs and expensive license renewals.
- Ensure scalability and elastic capacity.
- Take advantage of advanced capabilities, for example, machine learning.
With BigQuery, you don't need to worry about future workloads or lack of storage capacity. BigQuery's serverless NoOps architecture elastically scales under heavy demand, which can reduce query time. By eliminating infrastructure maintenance and realigning data resources, you can glean powerful insights at scale.
When moving data to BigQuery, you need to ensure data integrity between on-premises systems and BigQuery. Proven extract, transform, and load (ETL) procedures and reliable data quality and security services let you deliver, manage, govern, and synchronize your data with BigQuery. After you've put a process into place, transferring secure data to BigQuery helps you accelerate the migration of data workloads and helps to ensure data quality. Then, when your data is in BigQuery, you can take advantage of BigQuery's advanced analytic capabilities.
IICS is a reliable toolset for migrating on-premises workloads to BigQuery. The toolset has the following advantages:
- It is built on a microservices architecture.
- It offers automated data management and data integration.
- It has robust, pre-built connectivity to hundreds of on-premises and cloud-based systems.
Informatica allows organizations to design petabyte-scale data integrations, capturing data from complex hybrid networks and enabling enterprise-wide analysis.
The following sections describe the recommended steps for using IICS to integrate and migrate data into BigQuery.
Migrating to BigQuery using IICS
IICS is a multi-tenant integration platform as a service, a suite of cloud services that enable you to develop, execute, and govern data integration flows. IICS allows connections to on-premises and cloud-based processes, services, applications, and data within individual organizations or across multiple organizations.
Built on Informatica's single Intelligent Data Platform, IICS supports:
- All flavors of integration and data management—data, application, B2B, master data management, data quality, and data security.
- All data latencies—real-time, batch, and streaming.
- A rich set of traditional and new protocols.
IICS is a zero-code, metadata-driven rapid development platform. The underlying runtime engine is based on a modern microservices-based architecture that provides:
- An easy interface with new cloud services.
- Unified administration and monitoring.
- Seamless orchestration of integration flows.
You can deploy Cloud Secure Agent, a scaled-down data integration (DI) engine that runs on-premises, behind a firewall or in Google Cloud. This approach lets you integrate on-premises sources and applications with cloud applications, which helps prevent data proliferation and security challenges. Your secure agents can run in a cluster or multiple clusters to support performance, scale, load balancing, and high availability. IICS can manage Secure Agent as hosted agents.
The following diagram shows a typical IICS use case.
Integration options
When you use the BigQuery connector, you can read data by using streaming mode or batch mode. The following diagram illustrates how Informatica Secure Agent reads data using these modes.
For more information about Secure Agent and how to install it, see Installing Secure Agents.
Streaming mode
When the volume of data that you want to read is small (less than 10,000 rows), use streaming mode. In streaming mode, Secure Agent directly reads data from a data source. You can configure the number of rows that you want Secure Agent to read in one request.
Batch mode
When you want to read large volumes of data in a cost-efficient manner, use batch mode. In batch mode, Secure Agent first downloads the data from the data source into a Cloud Storage bucket. Next, Secure Agent reads the data from the bucket and loads it into BigQuery. Additionally, if network latency is slow, you can enable compression before transferring to Cloud Storage. For more information about data compression and write modes, see the following:
Objectives
- Configure BigQuery and Cloud Storage connections.
- Integrate the data.
- Define data flow logic using the mapping designer.
Configuring BigQuery and Cloud Storage connections
Informatica provides connectors to numerous enterprise systems. You can explore the catalog of available connectors.
- Log in to IICS and click Administrator.
Click Add-On Connectors.
Select the BigQuery and Cloud Storage connectors.
Setting up a BigQuery connection
- To create a connection, select Connections, and click New Connection.
In the Type drop-down list, click Google BigQuery, and enter values for the required fields as listed in the following table.
Connection property name
Description
Connection name
Name the connection. This name shows up when you select source or target connection during integration tasks and mapping.
Description
Enter a descriptive tag so you can search for the connection at a later date.
Type
Select Google BigQuery.
Runtime Environment
Select the secure agent that you want to run the tasks.
Service Account ID
Enter the service account ID for Google Cloud. You can find this ID in the JSON file that was downloaded when you created the service account. In that JSON file, the service account ID is the value of the client_email
key. Copy and paste the value without the enclosing double quotes.
Service Account Key
Enter the service account key. You can find this key in the JSON file that was downloaded when you created the service account. In that JSON file, the service account key is the value of the private_key
key. Copy and paste the value without the enclosing double quotes.
Project ID
Enter the Google Cloud project ID that you want to associate with the connector.
Storage Path
Enter your storage path for staging purposes. Add the gs://
prefix in the path.
Connection Mode
Leave the default value.
Schema Definition File Path
Specify a path to store the JSON schema generated by IICS.
Dataset ID
Enter the BigQuery dataset ID.
To verify the connection, click Test Connection. The connection is verified when it displays the message,
The test for this connection was successful
.When the connection is verified, click Save.
Setting up a Cloud Storage connection
- From IICS, to create a connection, select Connections, and click New Connection.
In the Type drop-down list, click Google Cloud Storage V2, and enter values for the required fields as listed in the following table.
Connection property name
Description
Connection name
Name the connection. This name shows up when you select source or target connection in integration tasks and mapping.
Description
Enter a descriptive tag so you can search for the connection at a later date.
Type
Select Google Cloud Storage.
Runtime Environment
Select the secure agent that you want to run the tasks.
Service Account ID
Enter the service account ID for Google Cloud. You can find this ID in the JSON file that was downloaded when you created the service account. In that JSON file, the service account ID is the value of the `client_email` key. Copy and paste the value without the enclosing double quotes.
Service Account Key
Enter the service account key. You can find this key in the JSON file that was downloaded when you created the service account. In that JSON file, the service account key is the value of the `private_key` key. Copy and paste the value without the enclosing double quotes.
Project ID
Enter the Google Cloud project ID that you want to associate with the connector.
To verify the connection, click Test Connection. The connection is verified when it displays the message,
The test for this connection was successful
.
Integrating the data
Informatica provides a six-step Synchronization Task wizard to create simple data integration. It lets you use expression and lookup transformations in the field mapping step.
Define the data integration task
- Go to the Data Integration service, and then click New.
Click Synchronization Task, and then click Create.
Enter the Task Name.
For the Task Operation, select Insert to insert into the target. Click Next.
Choose the source
- From the Connection drop-down list, click the data source. To define a new connection, click New.
Select the Source Type.
where:
- Single represents a single table or object.
- Multiple represents multiple tables or objects. When you select Multiple, either you select an already-known relationship or you establish the relationship manually. By using this method, you achieve better performance because the join is done in the source database or application.
- Saved Query lets you use a custom query to select the source data.
Select the Source Object.
Click Next.
Choose the target
From the Connection drop-down list, select Google BigQuery as the target. If your BigQuery target doesn't have a table that you want to insert into, click Create Target to create the table. To view the first 10 records of the selected table, click Show Data Preview. Data preview is a permission setting at the user level in IICS.
Click Next.
Filter incoming source data
To add a filter, click New. For example, you can filter to show only ACCTBAL values greater than $0.
To close the Data filter window, click OK, and then click Next.
Map source fields to target fields
IICS automatically maps source fields to target fields by name, as shown in the following screenshot. Source fields are on the left, and Target fields are on the right. You can drag the Source field to the Target field.
In the Actions column, click fx to transform the field data by using an expression.
To perform a lookup, click Search search.
A lookup is a search for a field from another data source. This field is used in the mapping.
To clear the mapping for a field, click Close close.
This example shows how to transform the Name field into proper case. There are numerous functions and operators available for transforming incoming fields.
In the Target column, click Search search next to Address to perform a lookup for this field. The fields are blank in the following screenshot, because there is no lookup in this example.
After exploring the Field Lookup window, click Cancel, and then click Next.
Schedule
In the scheduling step, you configure the following:
- A schedule for running the task.
- Email notification.
- Pre- and post-processing commands.
- Advanced properties for BigQuery.
To schedule a task:
In the Schedule step, you can define a one time or recurring schedule for executing the task. Click New.
In the New Schedule window, schedule when and how often a job runs. Click Ok.
Scroll to view the BigQuery Advanced target properties fields.
When you run a task with a BigQuery target, advanced properties let you control the outcome.
Advanced property name Description Target Dataset ID Optional field used to overwrite the target dataset ID specified by the BigQuery connection. Target Table Name Optional field used to overwrite the target table name specified in the synchronization task. Create Disposition If the target table does not exist, Create never option returns an error. Create if needed option generates the table. Write Disposition Defines how to insert if the target table exists. Default Write append appends into an existing table. Write empty inserts if the target table is empty; otherwise, the write operation returns an error. Write truncate deletes existing records before they are inserted into the target table. Applies to Bulk write mode only. Write Mode Default Bulk mode inserts records into the target in bulk, and Streaming writes row by row and allows you to query data from the target without delay. Streaming Template Table Suffix Suffix name for the templateSuffix BigQuery arameter. The template table is used when streaming data into BigQuery. This option lets you split a logical table into many smaller tables without adding complex code. Applies to Streaming write mode only. Rows per Streaming Request Number of rows per streaming request. Default is 500 rows; maximum row size is 10 MB. Applicable to Streaming write mode only. Staging File Name Used to overwrite the default filename. The default filename is Informatica_Stage_File_[unix timestamp].[data format: JSON, CSV]
. Applies to Bulk write mode only.Data format of the staging file Data format of Cloud Storage staging files. JSON supports nested and repeated fields or CSV formats. Persist Staging File After Loading Allows the staging file to be persisted in Cloud Storage. Staging files are deleted by default. Applies to Bulk write mode only. Enable Staging File Compression Staging files can be compressed to reduce the cost of storage and transfer time from the secure agent to Cloud Storage. Job Poll Interval in Seconds Number of seconds to poll for the status of a job. Default is 10. Number of Threads for Uploading Staging File If increased from the default of 1, this number increases the secure agent's DTM JVMOption3 heap size as well. Applies to Bulk write mode only. Local Stage File Directory If you are using a local secure agent, this directory stores temporary files before writing to the Cloud Storage staging file. Applies to Bulk write mode only. Allow Quoted Newlines Allows a newline character inside quotes in a single record in a CSV file. Field Delimiter Default field delimiter is a comma (,). Allow Jagged Rows Allows rows without trailing columns in a CSV file. Default is not to allow jagged rows. pre SQL SQL statement to run before writing data to the target post SQL SQL statement to run after writing data to the target pre SQL Configuration SQL configuration for pre SQL statement to run post SQL Configuration SQL configuration for post SQL statement to run Success File Directory Not applicable. Error File Directory Not applicable. Click Finish, and then click Run to run the synchronization task.
Defining data flow logic using mapping
When you need complex transformations and the six-step wizard is insufficient, use mapping.
Mapping offers built-in transformations that aren't limited to just expressions and lookups. In addition, you can build mappings that have multiple data sources as well as multiple targets.
The following transformations are available:
- Source
- Target
- Aggregator
- Data Masking
- Expression
- Filter
- Hierarchy Builder
- Hierarchy Parser
- Java
- Joiner
- Lookup
- Mapplet
- Normalizer
- Rank
- Router
- Sequence
- Sorter
- SQL
- Structure Parser
- Union
- Web Services
The following section explains how to create a complex transformation, such as this finished example:
This example mapping does the following:
- Reads an opportunity object from Salesforce and an account table from SQL Server.
- Joins the two using account ID as the key.
- Does a lookup for contact information from Salesforce.
- After the lookup, aggregates the number of opportunities by account and then inserts them into BigQuery.
To create a new mapping:
- Go to the Data Integration service and click New.
Go to Mappings > Mapping, and then click Create.
Drag the transformation from the panel to the design surface, and then set properties in the bottom panel.
When you're done defining your mapping, click Save. Mapping is automatically validated. If any transformations have errors, they appear in the Validation panel; otherwise, the message "Mapping is valid" appears.
Click More more_vert, and select New Mapping Task.
In the New Mapping Task window, complete the following steps:
- Enter a name in the Task Name field.
- From the Runtime Environment drop-down list, select your environment.
- Click Next.
Choose a schedule if needed, and click Finish.
To run the mapping task, click Run. The mapping task is queued to run.
Data Catalog
If your organization uses Enterprise Data Catalog (EDC), you can integrate your IICS org with the EDC instance. You can perform a search against the catalog to discover assets. Then you can use those assets as sources in synchronization tasks and as sources, targets, and lookup objects in mapping.
To configure the integration, go to Administrator service and fill in the EDC credentials in Integration with Informatica Enterprise Data Catalog properties.
For more information, see the Data Catalog discovery for sources.
Pushdown optimization
Pushdown optimization lets you run transformation logic in source or target databases. By using pushdown optimization, you can improve task performance.
When a mapping task is configured for pushdown optimization, the task converts the transformation logic to an SQL query. The task sends the query statement to the database to execute the query. The amount of transformation logic that can be pushed down depends on the database, transformation logic, and task configuration. The task processes all transformation logic that it cannot push down to a database.
Pushdown optimization configuration is in the Advanced Session Properties in Mapping Task. For more information, see the Pushdown optimization.
Performance tuning
Performance is a critical aspect when dealing with cloud data warehouse use cases. Informatica offers a natively provided partitioning capability that allows users to open multiple threads in parallel to execute a job. The following list shows some ways to optimize session performance and help with better ETL load time.
- Partitioning. Use Informatica Partitioning with the BigQuery connector to enable parallel threads for data loading. These settings are available in the Synchronization schedule task step and the advanced Mapping tab. For more information, see Partitioning for BigQuery.
RECORD data type. IICS supports the RECORD data type, which is a hierarchical data type. Users can read and write RECORD data structures by using the BigQuery connector, which provides parsing support for:
When creating the connection, you select the connection type based on which data pipeline would read the RECORD data type as a hierarchical or flat structure. For more information, see the Connection Modes topic.
Appendix: Selecting SAP as a data source in Informatica Intelligent Cloud Services
This section provides guidance for integrating SAP data sources by using Informatica Intelligent Cloud Services (IICS).
The Informatica SAP connectivity user guide provides details on SAP connectivity.
Configuring SAP connection
- Log in to IICS and click Administrator.
Click Add-On Connectors.
Select the SAP Table Reader connector.
Setting up an SAP Table connection
- To create a connection, select Connections, and click New Connection.
In the Type drop-down list, click SAP Table Connector, and enter values for the requisite fields as listed in the following table.
Connection property name Description Runtime Environment Runtime environment that contains the Secure Agent that you want to use to access SAP tables. Username SAP user name with the appropriate user authorization. Password SAP password. Client SAP client number. Language Language code that corresponds to the SAP language. Saprfc.ini Path Local directory to the sapnwrfc.ini
file.
To write to SAP tables, use the following directory:Informatica Secure Agent installation directory/main/bin/rdtm
Destination DEST entry that you specified in the sapnwrfc.ini
file for the SAP application server.
Destination is case sensitive.
Note: Use all uppercase letters for the destinationPort Range HTTP port range. The SAP Table connection uses the specified port numbers to connect to SAP tables using the HTTP protocol. Ensure that you specify valid numbers to prevent connection errors. Default: 10000-65535
.
Enter a range in the default range, for example,10000-20000
. When a range is outside the default range, the connection uses the default range.Test Streaming Tests the connection. When selected, tests the connection using both RFC and HTTP protocols. When not selected, tests the connection using the HTTP protocol. Https Connection When selected, connects to SAP through the HTTPS protocol. To successfully connect to SAP through HTTPS, verify that an administrator has configured the machines that host the Secure Agent and the SAP system. Keystore Location The absolute path to the JKS keystore file. Keystore Password The destination password specified for the JKS file. Private Key Password The export password specified for the P12 file. To verify the connection, click Test Connection. The connection is verified when it displays the message, "The test for this connection was successful."
When the connection is verified, click Save.
What's next
- Try Informatica Intelligent Cloud Services for BigQuery for yourself. Check out the Test Drive or Free Trial.
- Access the Informatica Intelligent Cloud Services website.
- Access the Cloud Integration community.
- To find resources on using Cloud Application Integration (the Informatica Cloud Real Time service), access the community.
- Learn more and share tips in the Cloud Developer community.
- Access the knowledge base. If you have questions, comments, or ideas about the knowledge base, contact the Informatica Knowledge Base team.
- Learn about Transitioning from data warehousing in Teradata to Google Cloud big data services.