Reference documentation and code samples for the google-cloud-bigquery class Google::Cloud::Bigquery::External::DataSource.
DataSource
External::DataSource and its subclasses represents an external data source that can be queried from directly, even though the data is not stored in BigQuery. Instead of loading or streaming the data, this object references the external data source.
The AVRO and Datastore Backup formats use DataSource. See CsvSource, JsonSource, SheetsSource, BigtableSource for the other formats.
Inherits
- Object
Examples
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new avro_url = "gs://bucket/path/to/*.avro" avro_table = bigquery.external avro_url do |avro| avro.autodetect = true end data = bigquery.query "SELECT * FROM my_ext_table", external: { my_ext_table: avro_table } # Iterate over the first page of results data.each do |row| puts row[:name] end # Retrieve the next page of results data = data.next if data.next?
Hive partitioning options:
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
Methods
#autodetect
def autodetect() -> Boolean
Indicates if the schema and format options are detected automatically.
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.autodetect = true end csv_table.autodetect #=> true
#autodetect=
def autodetect=(new_autodetect)
Set whether to detect schema and format options automatically. Any option specified explicitly will be honored.
- new_autodetect (Boolean) — New autodetect value
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.autodetect = true end csv_table.autodetect #=> true
#avro?
def avro?() -> Boolean
Whether the data format is "AVRO".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new avro_url = "gs://bucket/path/to/*.avro" avro_table = bigquery.external avro_url avro_table.format #=> "AVRO" avro_table.avro? #=> true
#backup?
def backup?() -> Boolean
Whether the data format is "DATASTORE_BACKUP".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new backup_url = "gs://bucket/path/to/data.backup_info" backup_table = bigquery.external backup_url backup_table.format #=> "DATASTORE_BACKUP" backup_table.backup? #=> true
#bigtable?
def bigtable?() -> Boolean
Whether the data format is "BIGTABLE".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new bigtable_url = "https://googleapis.com/bigtable/projects/..." bigtable_table = bigquery.external bigtable_url bigtable_table.format #=> "BIGTABLE" bigtable_table.bigtable? #=> true
#compression
def compression() -> String
The compression type of the data source. Possible values include
"GZIP"
and nil
. The default value is nil
. This setting is
ignored for Google Cloud Bigtable, Google Cloud Datastore backups
and Avro formats. Optional.
- (String)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.compression = "GZIP" end csv_table.compression #=> "GZIP"
#compression=
def compression=(new_compression)
Set the compression type of the data source. Possible values include
"GZIP"
and nil
. The default value is nil
. This setting is
ignored for Google Cloud Bigtable, Google Cloud Datastore backups
and Avro formats. Optional.
- new_compression (String) — New compression value
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.compression = "GZIP" end csv_table.compression #=> "GZIP"
#csv?
def csv?() -> Boolean
Whether the data format is "CSV".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url csv_table.format #=> "CSV" csv_table.csv? #=> true
#format
def format() -> String
The data format. For CSV files, specify "CSV". For Google sheets, specify "GOOGLE_SHEETS". For newline-delimited JSON, specify "NEWLINE_DELIMITED_JSON". For Avro files, specify "AVRO". For Google Cloud Datastore backups, specify "DATASTORE_BACKUP". [Beta] For Google Cloud Bigtable, specify "BIGTABLE".
- (String)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url csv_table.format #=> "CSV"
#hive_partitioning?
def hive_partitioning?() -> Boolean
Checks if hive partitioning options are set.
Not all storage formats support hive partitioning. Requesting hive partitioning on an unsupported format
will lead to an error. Currently supported types include: avro
, csv
, json
, orc
and parquet
.
If your data is stored in ORC or Parquet on Cloud Storage, see Querying columnar formats on Cloud
Storage.
-
(Boolean) —
true
when hive partitioning options are set, orfalse
otherwise.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#hive_partitioning_mode
def hive_partitioning_mode() -> String, nil
The mode of hive partitioning to use when reading data. The following modes are supported:
AUTO
: automatically infer partition key name(s) and type(s).STRINGS
: automatically infer partition key name(s). All types are interpreted as strings.CUSTOM
: partition key schema is encoded in the source URI prefix.
-
(String, nil) — The mode of hive partitioning, or
nil
if not set.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#hive_partitioning_mode=
def hive_partitioning_mode=(mode)
Sets the mode of hive partitioning to use when reading data. The following modes are supported:
auto
: automatically infer partition key name(s) and type(s).strings
: automatically infer partition key name(s). All types are interpreted as strings.custom
: partition key schema is encoded in the source URI prefix.
Not all storage formats support hive partitioning. Requesting hive partitioning on an unsupported format
will lead to an error. Currently supported types include: avro
, csv
, json
, orc
and parquet
.
If your data is stored in ORC or Parquet on Cloud Storage, see Querying columnar formats on Cloud
Storage.
See #format, #hive_partitioning_require_partition_filter= and #hive_partitioning_source_uri_prefix=.
- mode (String, Symbol) — The mode of hive partitioning to use when reading data.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#hive_partitioning_require_partition_filter=
def hive_partitioning_require_partition_filter=(require_partition_filter)
Sets whether queries over the table using this external data source require a partition filter that can be used for partition elimination to be specified.
See #format, #hive_partitioning_mode= and #hive_partitioning_source_uri_prefix=.
-
require_partition_filter (Boolean) —
true
if a partition filter must be specified,false
otherwise.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#hive_partitioning_require_partition_filter?
def hive_partitioning_require_partition_filter?() -> Boolean
Whether queries over the table using this external data source require a partition filter that can be used for partition elimination to be specified. Note that this field should only be true when creating a permanent external table or querying a temporary external table.
-
(Boolean) —
true
when queries over this table require a partition filter, orfalse
otherwise.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#hive_partitioning_source_uri_prefix
def hive_partitioning_source_uri_prefix() -> String, nil
The common prefix for all source uris when hive partition detection is requested. The prefix must end immediately before the partition key encoding begins. For example, consider files following this data layout:
gs://bucket/path_to_table/dt=2019-01-01/country=BR/id=7/file.avro
gs://bucket/path_to_table/dt=2018-12-31/country=CA/id=3/file.avro
When hive partitioning is requested with either AUTO
or STRINGS
mode, the common prefix can be either of
gs://bucket/path_to_table
or gs://bucket/path_to_table/
(trailing slash does not matter).
-
(String, nil) — The common prefix for all source uris, or
nil
if not set.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#hive_partitioning_source_uri_prefix=
def hive_partitioning_source_uri_prefix=(source_uri_prefix)
Sets the common prefix for all source uris when hive partition detection is requested. The prefix must end immediately before the partition key encoding begins. For example, consider files following this data layout:
gs://bucket/path_to_table/dt=2019-01-01/country=BR/id=7/file.avro
gs://bucket/path_to_table/dt=2018-12-31/country=CA/id=3/file.avro
When hive partitioning is requested with either AUTO
or STRINGS
mode, the common prefix can be either of
gs://bucket/path_to_table
or gs://bucket/path_to_table/
(trailing slash does not matter).
See #format, #hive_partitioning_mode= and #hive_partitioning_require_partition_filter=.
- source_uri_prefix (String) — The common prefix for all source uris.
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_require_partition_filter = true ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.hive_partitioning? #=> true external_data.hive_partitioning_mode #=> "AUTO" external_data.hive_partitioning_require_partition_filter? #=> true external_data.hive_partitioning_source_uri_prefix #=> source_uri_prefix
#ignore_unknown
def ignore_unknown() -> Boolean
Indicates if BigQuery should allow extra values that are not
represented in the table schema. If true
, the extra values are
ignored. If false
, records with extra columns are treated as bad
records, and if there are too many bad records, an invalid error is
returned in the job result. The default value is false
.
BigQuery treats trailing columns as an extra in CSV
, named values
that don't match any column names in JSON
. This setting is ignored
for Google Cloud Bigtable, Google Cloud Datastore backups and Avro
formats. Optional.
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.ignore_unknown = true end csv_table.ignore_unknown #=> true
#ignore_unknown=
def ignore_unknown=(new_ignore_unknown)
Set whether BigQuery should allow extra values that are not
represented in the table schema. If true
, the extra values are
ignored. If false
, records with extra columns are treated as bad
records, and if there are too many bad records, an invalid error is
returned in the job result. The default value is false
.
BigQuery treats trailing columns as an extra in CSV
, named values
that don't match any column names in JSON
. This setting is ignored
for Google Cloud Bigtable, Google Cloud Datastore backups and Avro
formats. Optional.
- new_ignore_unknown (Boolean) — New ignore_unknown value
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.ignore_unknown = true end csv_table.ignore_unknown #=> true
#json?
def json?() -> Boolean
Whether the data format is "NEWLINE_DELIMITED_JSON".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new json_url = "gs://bucket/path/to/data.json" json_table = bigquery.external json_url json_table.format #=> "NEWLINE_DELIMITED_JSON" json_table.json? #=> true
#max_bad_records
def max_bad_records() -> Integer
The maximum number of bad records that BigQuery can ignore when reading data. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid. This setting is ignored for Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
- (Integer)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.max_bad_records = 10 end csv_table.max_bad_records #=> 10
#max_bad_records=
def max_bad_records=(new_max_bad_records)
Set the maximum number of bad records that BigQuery can ignore when reading data. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid. This setting is ignored for Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
- new_max_bad_records (Integer) — New max_bad_records value
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url do |csv| csv.max_bad_records = 10 end csv_table.max_bad_records #=> 10
#orc?
def orc?() -> Boolean
Whether the data format is "ORC".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :orc do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.format #=> "ORC" external_data.orc? #=> true
#parquet?
def parquet?() -> Boolean
Whether the data format is "PARQUET".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new gcs_uri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*" source_uri_prefix = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/" external_data = bigquery.external gcs_uri, format: :parquet do |ext| ext.hive_partitioning_mode = :auto ext.hive_partitioning_source_uri_prefix = source_uri_prefix end external_data.format #=> "PARQUET" external_data.parquet? #=> true
#sheets?
def sheets?() -> Boolean
Whether the data format is "GOOGLE_SHEETS".
- (Boolean)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new sheets_url = "https://docs.google.com/spreadsheets/d/1234567980" sheets_table = bigquery.external sheets_url sheets_table.format #=> "GOOGLE_SHEETS" sheets_table.sheets? #=> true
#urls
def urls() -> Array<String>
The fully-qualified URIs that point to your data in Google Cloud. For Google Cloud Storage URIs: Each URI can contain one '' wildcard character and it must come after the 'bucket' name. Size limits related to load jobs apply to external data sources. For Google Cloud Bigtable URIs: Exactly one URI can be specified and it has be a fully specified and valid HTTPS URL for a Google Cloud Bigtable table. For Google Cloud Datastore backups, exactly one URI can be specified, and it must end with '.backup_info'. Also, the '' wildcard character is not allowed.
- (Array<String>)
require "google/cloud/bigquery" bigquery = Google::Cloud::Bigquery.new csv_url = "gs://bucket/path/to/data.csv" csv_table = bigquery.external csv_url csv_table.urls #=> ["gs://bucket/path/to/data.csv"]