Salesforce batch source

This page describes how to load data from Salesforce sObjects into Google Cloud with Cloud Data Fusion. The Salesforce batch source plugin reads sObjects from Salesforce. The sObjects are the Salesforce tables that you plan to pull from. Some examples of sObjects include opportunities, contacts, accounts, leads, and custom objects.

The Salesforce batch source plugin supports reading data with Salesforce Object Query Language (SOQL) queries and incremental or range date filters.

Before you begin

The Cloud Data Fusion Salesforce batch source uses the Salesforce Bulk API V1.
Set up Salesforce before deploying and using Cloud Data Fusion Salesforce plugin. For more information, see Create a Salesforce connected app for Cloud Data Fusion.

Configure the plugin

Go to the Cloud Data Fusion web interface and click Studio.
Check that Data Pipeline - Batch is selected (not Realtime).
In the Source menu, click Salesforce. The Salesforce node appears in your pipeline. If you don't see the Salesforce source on the Studio page, deploy the Salesforce plugins from the Cloud Data Fusion Hub.
To configure the source, go to the Salesforce node and click Properties.
Enter the following properties. For a complete list, see Properties.
1. Enter a Label for the Salesforce node—for example, Salesforce tables.
2. Enter a Reference name for the Salesforce source for lineage—for example, Salesforce tables.
3. Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.
  New connection
  To add a one-time connection to Salesforce, follow these steps:
  1. Keep Use connection turned off.
  2. In the Connection section, enter the following information from the Salesforce account in these fields:
    
    Username
    
    Password
    
    Security token
    
    Consumer key
    
    Consumer secret
    
    To get the credentials from Salesforce, see Get properties from Salesforce.
  Reusable connection
  To reuse an existing connection, follow these steps:
  1. Turn on Use connection.
  2. Click Browse connections.
  3. Click the connection name.
    
    Note: For more information about adding, importing, and editing the connections that appear when you browse connections, see Manage connections.
  4. Optional: If a connection doesn't exist and you want to create a new reusable connection, click Add connection and refer to the steps in the New connection tab.
4. Enter the SObject name to load all the columns from the object.
5. Optional: If you select the sObject name, you can filter the data with the following fields:
  - Last modified after: Only include records that were last modified after a given time.
  - Last modified before: Only include records that were last modified earlier than a given time.
  - Duration: Only include records that were last modified within a time window of the specified size.
  - Offset: Only include records where the Last modified date is less than the logical start time of the pipeline, minus the given offset.
6. Optional: For supported sObjects, to improve the performance of a pipeline, turn on Enable PK chunking. For more information, see Improve performance with PK chunking.
7. Optional: Instead of specifying the sObject Name, you can enter a SOQL query, such as SELECT LastName from Contact. For more information, see SOQL queries for the Salesforce source.
8. To test connectivity, click Get schema. Cloud Data Fusion connects to Salesforce and pulls the schema for the listed table (technically, an sObject).

Properties

Property	Macro enabled	Required property	Description
Reference name	No	Yes	Used to uniquely identify this source for tasks, such as lineage, annotating metadata.
Use connection	No	No	Use an existing connection. If a connection is used, you don't need to provide the credentials.
Browse connections	Yes	No	Name of the connection to use.
Username	Yes	Yes	Salesforce username.
Password	Yes	Yes	Salesforce password.
Security token	Yes	No	Salesforce security token. If the password doesn't contain the security token Cloud Data Fusion appends the token before authenticating with Salesforce.
Consumer key	Yes	Yes	Application Consumer Key. This is also called the OAuth client ID. To obtain a consumer key, create a Salesforce connected application.
Consumer secret	Yes	Yes	Application Consumer Secret. This is also called the OAuth client secret. To obtain a client secret create a Salesforce connected application.
Login URL	Yes	Yes	Salesforce OAuth2 login URL. The default is `https://login.salesforce.com/services/oauth2/token`
Connection timeout	Yes	No	Maximum time, in milliseconds, to wait for connection initialization before it times out. Default is 30000 milliseconds.
Proxy URL	Yes	No	Proxy URL, which contain a protocol, address, and port.
SOQL	Yes	No	An SOQL query to fetch data into source. Examples: `SELECT Id, Name, BillingCity FROM Account` `SELECT Id FROM Contact WHERE Name LIKE 'A%' AND MailingCity = 'California'`
SObject name	Yes	No	Salesforce object name to read. If value is provided, the connector gets all fields for this object from Salesforce and generates a SOQL query, such as `select FIELD_1, FIELD_2 from ${sObjectName}`. Ignored if SOQL query is provided. There are also sObjects that aren't supported in the Salesforce Bulk API. Creating a job with an unsupported object causes the following error: `Entity is not supported by the Bulk API`. These objects also aren't supported by Einstein Analytics, which uses Bulk API for querying data. Cases when BULK API isn't used: When query length sends the query to Salesforce to receive the array of batch information, the Bulk API isn't used in one case. If the query is within the limit, it executes the original query. Otherwise, it switches to wide object logic. For example, it generates an ID query that only retrieves batch information for IDs that will be used later for retrieving data through the SOAP API.
Last modified after	Yes	No	Filter data to only include records where the system field, `LastModifiedDate`, is greater than or equal to the specified date. Use the Salesforce date format (see examples). If no value is provided, no lower-bound date is applied.
Last modified before	Yes	No	Filter data to only include records where the system field, `LastModifiedDate`, is less than the specified date. Use the Salesforce date format (see examples). Specifying this value with the Last modified after property lets you read data that was modified within a window of time. If no value is provided, no upper-bound date is applied.
Duration	Yes	No	Filter data to only read records that were last modified within a window of time of the specified size. For example, if the duration is `6 hours` and the pipeline runs at 9 AM, it will read data that was last updated from 3 AM (inclusive) to 9 AM (exclusive). Specify the duration with numbers and the following time units: `seconds` `minutes` `hours` `days` `months` `years` Several units can be specified, but each unit can only be used once. For example, `2 days, 1 hours, 30 minutes`. If a value is already specified for Last modified after or Last modified before, the duration is ignored.
Offset	Yes	No	Filter data to only read records where the system field, `LastModifiedDate`, is less than the logical start time of the pipeline, minus the given offset. For example, if duration is `6 hours`, the offset is `1 hours`, and the pipeline runs at 9 AM, data that was last modified between 2 AM (inclusive) and 8 AM (exclusive) is read. Specify the duration using numbers and the following units of time: `seconds` `minutes` `hours` `days` `months` `years` Several units can be specified, but each unit can only be used once. For example, `2 days, 1 hours, 30 minutes`. If a value is already specified for Last modified after or Last modified before, the offset is ignored.
SOQL operation type	No	No	Specify the query operation to run on the table. If a query is selected, only current records are returned. Selecting `queryAll` returns all current and deleted records. The default operation is `query`.
Enable PK chunking	Yes	No	Primary key (PK) chunking splits a query on large tables into pieces, or chunks, based on record IDs, or primary keys, of the queried records. Salesforce recommends that you enable PK chunking when querying tables with more than 10 million records, or when a bulk query constantly times out. For more information, see PK chunking. PK chunking only works with queries that don't include `SELECT` clauses, or conditions other than `WHERE`. Chunking is supported for custom objects and any Sharing and History tables that support standard objects.
Chunk size	Yes	No	Specify size of chunk. Maximum size is 250,000. Default size is 100,000.
SObject parent name	Yes	No	Parent of the Salesforce Object. This is used to enable chunking for history tables or shared objects.

Salesforce date format examples

Format syntax	Example
`YYYY-MM-DDThh:mm:ss+hh:mm`	`1999-01-01T23:01:01+01:00`
`YYYY-MM-DDThh:mm:ss-hh:mm`	`1999-01-01T23:01:01-08:00`
`YYYY-MM-DDThh:mm:ssZ`	`1999-01-01T23:01:01Z`

Data type mappings

The following table is a list of Salesforce data types with corresponding CDAP types.

Salesforce data type	CDAP schema data type
`_bool`	`boolean`
`_int`	`int`
`_long`	`long`
`_double`, `currency`, `percent`, `geolocation` (latitude), `geolocation` (longitude)	`double`
`date`	`date`
`datetime`	`timestamp` (microseconds)
`time`	`time` (microseconds)
`picklist`	`string`
`multipicklist`	`string`
`combobox`	`string`
`reference`	`string`
`base64`	`string`
`textarea`	`string`
`phone`	`string`
`id`	`string`
`url`	`string`
`email`	`string`
`encryptedstring`	`string`
`datacategorygroupreference`	`string`
`location`	`string`
`address`	`string`
`anyType`	`string`
`json`	`string`
`complexvalue`	`string`

Use cases

See the following use cases for the Salesforce batch source:

Best practices

For more information about improving performance in the Salesforce batch source, see the best practices.

Release notes

What's next

Work through a Salesforce plugin tutorial.