This page describes advanced methods for how to use BigQuery to analyze variants.
The data in this tutorial comes from the Illumina Platinum Genomes project.
The data was loaded into a BigQuery table that uses the
BigQuery variants schema.
The name of the table is platinum_genomes_deepvariant_variants_20180823
.
If your variant data is in a BigQuery table that uses the BigQuery variants schema, it's straightforward to apply the queries in this tutorial to your data. For information on how to load variant data into BigQuery, see the documentation on using the transform pipeline.
Objectives
This tutorial shows how to do the following:
- Get an overview of the genomics data.
- Find out how non-variant segments are represented.
- Find out how variant calls are represented.
- Find out how variant call quality filters are represented.
- Aggregate hierarchical columns.
- Condense queries.
- Count distinct rows.
- Group rows.
- Write user-defined functions.
This tutorial also shows how to find the following information:
- Number of rows in the table
- Number of variant calls
- Variants called for each sample
- Number of samples
- Variants per chromosome
- High quality variants per sample
Costs
In this document, you use the following billable components of Google Cloud:
- BigQuery
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- You should be familiar with the BigQuery variants schema.
Viewing the table schema and data
Access the table and view the schema
The Illumina Platinum Genomes platinum_genomes_deepvariant_variants_20180823
table is publicly available.
Variants and non-variants in the table
The Illumina Platinum Genomes data uses the gVCF format, which means that there are rows in the table that include non-variants. These non-variants are also known as "reference calls."
In the table, the non-variant segments are generally represented in the following ways:
- With a zero-length
alternate_bases
value - With the text string
<NON_REF>
as analternate_bases.alt
value - With the text string
<*>
as analternate_bases.alt
value
The way that non-variant segments are represented typically depends on the
variant caller that generated the source data. The variants in the
platinum_genomes_deepvariant_variants_20180823
table have been called using
DeepVariant, which uses the <*>
notation.
The following tables show some rows containing values that represent non-variant segments. The segments show the following information:
- A reference block of
10
bases on chromosome1
- The reference block starts at position
1000
- The reference base at position
1000
is anA
- The reference bases at the other positions of the block are not shown
In the following table, the alternate_bases
REPEATED RECORD
column contains
no values, meaning that it is an ARRAY
of length 0.
reference_name | start_position | end_position | reference_bases | alternate_bases.alt |
---|---|---|---|---|
1 | 1000 | 1010 | A |
In the following table, the alternate_bases
REPEATED RECORD
column is length 1, and it
contains the literal text string <*>
.
reference_name | start_position | end_position | reference_bases | alternate_bases.alt |
---|---|---|---|---|
1 | 1000 | 1010 | A | <*> |
The queries used in this guide use the representations in the preceding tables.
See the VCF specification for more information on representing non-variant positions in the genome.
Viewing the table data
To view the data in the platinum_genomes_deepvariant_variants_20180823
table,
complete the following steps:
View the table in the BigQuery page in the Google Cloud console.
Information about the table appears. The table contains 19.6 GB of data and has over 105,000,000 rows.
Click Preview to view some of the rows in the table.
Querying the table
After viewing the table schema and some of its rows, start issuing queries and analyzing data. Before continuing, ensure that you're familiar with the Standard SQL Query Syntax that BigQuery uses.
Counting total rows in the table
To view the number of rows in the table:
Go to the BigQuery page in the Google Cloud console.
Click Compose query.
Copy and paste the following query into the New Query text area:
#standardSQL SELECT COUNT(1) AS number_of_rows FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823`
Click Run query. The query returns the following result:
Row number_of_rows 1 105923159
Counting variant calls in the table
Each row in the table has a genomic position that is either a variant or non-variant segment.
Each row also contains a call
column,
which is an ARRAY
of variant calls. Each call
column includes the
name
and other values, such as the genotype, quality columns, read
depth, and others typically found in a VCF file.
To count the number of variant calls, query the number of elements inside the
ARRAY
columns. You can do this in several ways which are shown next.
Each query returns the value 182,104,652, which means that there is an
average of 1.7 variant calls per row in the dataset.
Summing the lengths of call
arrays
Count the total number of variant calls across all samples by adding
the length of each call
array:
#standardSQL
SELECT
SUM(ARRAY_LENGTH(call)) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823`
The query returns the following result:
Row | number_of_calls |
---|---|
1 | 182104652 |
JOIN
ing each row
Count the total number of variant calls across all samples
by using a JOIN
on each row with the call
column. The query uses the comma (,
) operator, which is a shorthand notation used for JOIN
.
The join to the call
column makes an implicit
UNNEST
operation on the call
column.
#standardSQL
SELECT
COUNT(call) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
The query returns the following result:
Row | number_of_calls |
---|---|
1 | 182104652 |
Counting name
in a call
column
A third way to count the total number of variant calls across all samples
is to count the name
values in the call
column. Each call
column must have a single name
value, so you can run the following
query:
#standardSQL
SELECT
COUNT(call.name) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call call
The query returns the following result:
Row | number_of_calls |
---|---|
1 | 182104652 |
Counting variant and non-variant segments
To count the number of variant and non-variant segments in the table, first run a query to filter out the non-variant segments:
#standardSQL
SELECT
COUNT(1) AS number_of_real_variants
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call call
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.alternate_bases) AS alt
WHERE
alt.alt NOT IN ("<NON_REF>", "<*>"))
The query returns the following result:
Row | number_of_real_variants |
---|---|
1 | 38549388 |
As shown in Counting variant calls, the total number of variant calls in the table is 182,104,652, so the result shows that the majority of rows in the table are non-variant segments.
As shown in the section on Variants and non-variants in the
table,
there are at least three ways to classify a variant row as a
non-variant segment. In the query above, the WHERE
clause includes rows
where the alternate_bases
column has a value that is a true variant, meaning
that it is not a special marker value such as <*>
or <NON_REF>
.
For each row in the table, a subquery is issued over the alternate_bases
column of that row, which returns the value 1
for each value of
alternate_bases
that is not <NON_REF
> or <*>
. The number of rows
that the subquery returns is the number of variant segments.
The following query shows how to get the count of non-variant segments:
#standardSQL
SELECT
COUNT(1) AS number_of_non_variants
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call call
WHERE
NOT EXISTS (SELECT 1
FROM UNNEST(v.alternate_bases) AS alt
WHERE
alt.alt NOT IN ("<NON_REF>", "<*>"))
The query returns the following result:
Row | number_of_non_variants |
---|---|
1 | 143555264 |
Adding the number of real variants (38,549,388) to the number of non-variant segments (143,555,264) equals the total number of variant calls.
Counting the variants called by each sample
After examining the top-level rows in the table, you can start querying for child rows. These rows include data such as the individual samples that have had calls made against the variants.
Each variant in the table has zero or more values for call.name
. A
particular call.name
value can appear in multiple rows.
To count the number of rows in which each call set appears, run the following query:
#standardSQL
SELECT
call.name AS call_name,
COUNT(call.name) AS call_count_for_call_set
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
GROUP BY
call_name
ORDER BY
call_name
Running the query returns six rows. Each call_name
corresponds to a
sequenced individual human:
Row | call_name | call_count_for_call_set |
---|---|---|
1 | NA12877 | 31592135 |
2 | NA12878 | 28012646 |
3 | NA12889 | 31028550 |
4 | NA12890 | 30636087 |
5 | NA12891 | 33487348 |
6 | NA12892 | 27347886 |
Humans typically don't have the 30 million variants shown in the
values for call_count_for_call_set
. Filter out the non-variant segments
to count just the variant rows:
#standardSQL
SELECT
call.name AS call_name,
COUNT(call.name) AS call_count_for_call_set
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.alternate_bases) AS alt
WHERE
alt.alt NOT IN ("<NON_REF>", "<*>"))
GROUP BY
call_name
ORDER BY
call_name
The query returns the following result:
Row | call_name | call_count_for_call_set |
---|---|---|
1 | NA12877 | 6284275 |
2 | NA12878 | 6397315 |
3 | NA12889 | 6407532 |
4 | NA12890 | 6448600 |
5 | NA12891 | 6516669 |
6 | NA12892 | 6494997 |
The number of variants is now closer to 6 million, which is more typical for a human. Continue to the next section to filter true variants by genotype.
Filtering true variants by genotype
The variants in the table include no-calls, which are represented by a
genotype
value of -1. These variants are not considered true variants for
individuals, so you must filter them out. True variants can only include calls with
genotypes greater than zero. If a call includes only genotypes that are
no-calls (-1) or reference (0), then they are not true variants.
To filter the variants by genotype, run the following query:
#standardSQL
SELECT
call.name AS call_name,
COUNT(call.name) AS call_count_for_call_set
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
EXISTS (SELECT 1 FROM UNNEST(call.genotype) AS gt WHERE gt > 0)
AND NOT EXISTS (SELECT 1 FROM UNNEST(call.genotype) AS gt WHERE gt < 0)
GROUP BY
call_name
ORDER BY
call_name
The query returns the following result:
Row | call_name | call_count_for_call_set |
---|---|---|
1 | NA12877 | 4486610 |
2 | NA12878 | 4502017 |
3 | NA12889 | 4422706 |
4 | NA12890 | 4528725 |
5 | NA12891 | 4424094 |
6 | NA12892 | 4495753 |
Counting samples in the table
In Counting the variants called by each sample,
each query returned six rows with values for call_name
. To query
for and get the value for the number of rows, run the following
query:
#standardSQL
SELECT
COUNT(DISTINCT call.name) AS number_of_callsets
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
The query returns the following result:
Row | number_of_callsets |
---|---|
1 | 6 |
Counting variants per chromosome
To count the number of variants per chromosome, run the following query. The query does the following:
- Counts all rows in which there is at least one variant call with at least one genotype greater than 0.
- Groups the variant rows by chromosome and counts each group.
#standardSQL
SELECT
reference_name,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
reference_name
ORDER BY
CASE
WHEN SAFE_CAST(REGEXP_REPLACE(reference_name, '^chr', '') AS INT64) < 10
THEN CONCAT('0', REGEXP_REPLACE(reference_name, '^chr', ''))
ELSE REGEXP_REPLACE(reference_name, '^chr', '')
END
Running the query returns the name of the chromosome (reference_name
) and the
number of variant rows for each chromosome:
Row | reference_name | number_of_variant_rows |
---|---|---|
1 | chr1 | 615000 |
2 | chr2 | 646401 |
3 | chr3 | 542315 |
4 | chr4 | 578600 |
5 | chr5 | 496202 |
... | ... | ... |
Counting high-quality variants per sample
Querying calls with multiple FILTER
values
The VCF specification describes the FILTER
column
that you can use to label variant calls of differing qualities.
The following query shows how to view the per-variant-call FILTER
values
for the dataset:
#standardSQL
SELECT
call_filter,
COUNT(call_filter) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
v.call,
UNNEST(call.FILTER) AS call_filter
GROUP BY
call_filter
ORDER BY
number_of_calls
The query returns the following result:
Row | call_filter | number_of_calls |
---|---|---|
1 | RefCall | 11681534 |
2 | PASS | 26867854 |
The PASS
value signifies that a variant call is of a high quality.
FILTER
ing for high quality variant calls
When analyzing variants, you might want to filter out lower quality variants.
If the FILTER
column contains the value PASS
, it is likely that the
column doesn't contain any other values. You can verify this by running the
following query. The query also omits any calls that do not contain a PASS
value under FILTER
.
#standardSQL
SELECT
reference_name,
start_position,
end_position,
reference_bases,
call.name AS call_name,
(SELECT STRING_AGG(call_filter) FROM UNNEST(call.FILTER) AS call_filter) AS filters,
ARRAY_LENGTH(call.FILTER) AS filter_count
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter = 'PASS')
AND ARRAY_LENGTH(call.FILTER) > 1
ORDER BY
filter_count DESC, reference_name, start_position, end_position, reference_bases, call_name
LIMIT
10
As expected, running the query returns zero results.
Counting all high quality calls for each sample
The following query shows how to count all calls (variants and non-variants)
for each call set, and omits any call with a non-PASS
filter:
#standardSQL
SELECT
call.name AS call_name,
COUNT(1) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
NOT EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter != 'PASS')
GROUP BY
call_name
ORDER BY
call_name
The query returns the following result:
Row | call_name | number_of_calls |
---|---|---|
1 | NA12877 | 29795946 |
2 | NA12878 | 26118774 |
3 | NA12889 | 29044992 |
4 | NA12890 | 28717437 |
5 | NA12891 | 31395995 |
6 | NA12892 | 25349974 |
Counting all high quality true variant calls for each sample
The following query shows how to count all calls (variants and non-variants)
for each sample. It omits any call with a non-PASS
filter, and only includes
calls with at least one true variant, meaning that genotype
> 0:
#standardSQL
SELECT
call.name AS call_name,
COUNT(1) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
NOT EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter != 'PASS')
AND EXISTS (SELECT 1 FROM UNNEST(call.genotype) as gt WHERE gt > 0)
GROUP BY
call_name
ORDER BY
call_name
The query returns the following result:
Row | call_name | number_of_calls |
---|---|---|
1 | NA12877 | 4486610 |
2 | NA12878 | 4502017 |
3 | NA12889 | 4422706 |
4 | NA12890 | 4528725 |
5 | NA12891 | 4424094 |
6 | NA12892 | 4495753 |
Best practices
Condensing queries
As your queries become more complex, it's important to keep them concise to ensure that their logic is correct and simple to follow.
The following example demonstrates how to start from a query that counts the number of variants per chromosome and, step by step, condense it using SQL syntax and user-defined functions.
As explained in the section on counting variants per chromosome, the query has the following requirements:
- Counts all rows in which there is at least one variant call with at least one genotype greater than 0.
- Groups the variant rows by chromosome and counts each group.
Writing this query can be complicated because, to complete the first task, you
need to look into an ARRAY
(genotype
) within an ARRAY
(call
) while
keeping the execution context of the query at the row level. You keep the
execution context of the query at the row level
because you want to produce a per-variant result, rather than a per-call
or per-genotype
result.
The UNNEST
function lets you query over an ARRAY
column as if the
column were a table. The function returns one row for each element of an
ARRAY
. It also doesn't change the query context. Use an UNNEST
function in an EXISTS
subquery in a WHERE
clause:
#standardSQL
SELECT
reference_name,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call
WHERE EXISTS (SELECT 1
FROM UNNEST(call.genotype) AS gt
WHERE gt > 0))
GROUP BY
reference_name
ORDER BY
reference_name
The query returns the same results as the example in counting variants per chromosome:
Row | reference_name | number_of_variant_rows |
---|---|---|
1 | chr1 | 615000 |
2 | chr10 | 396773 |
3 | chr11 | 391260 |
4 | chr12 | 382841 |
5 | chr13 | 298044 |
... | ... | ... |
You can make the query more concise by changing the EXISTS
clause into a
JOIN
of the call
column with the call.genotype
column. The comma operator
is a shorthand notation used for JOIN
.
#standardSQL
SELECT
reference_name,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
reference_name
ORDER BY
reference_name
The query works, and is concise, but it doesn't let you sort the output
in ascending numerical order of chromosomes (reference_name
)
because the values in reference_name
are string types, and each value
contains the prefix "chr."
To sort the output numerically, first remove the "chr" prefix from the
reference_name
column and give it the alias chromosome
:
#standardSQL
SELECT
REGEXP_REPLACE(reference_name, '^chr', '') AS chromosome,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
chromosome
ORDER BY
chromosome
The query uses the REGEXP_REPLACE
function to replace the "chr" prefix string with an empty string. The query then
changes the GROUP BY
and ORDER BY
functions to use the computed chromosome
alias. The output still sorts by string:
Row | chromosome | number_of_variant_rows |
---|---|---|
1 | 1 | 615000 |
2 | 10 | 396773 |
3 | 11 | 391260 |
4 | 12 | 382841 |
5 | 13 | 298044 |
... | ... | ... |
To instead sort the output numerically, cast the chromosome
column from
a string to an integer:
#standardSQL
SELECT
CAST(REGEXP_REPLACE(reference_name, '^chr', '') AS INT64) AS chromosome,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
chromosome
ORDER BY
chromosome
The query returns an error because not all chromosome names, such
as "X," "Y," and "M" are numeric. Use the
CASE
function to prepend a "0" to chromosomes 1 through 9 and remove the "chr"
prefix:
#standardSQL
SELECT
CASE
WHEN SAFE_CAST(REGEXP_REPLACE(reference_name, '^chr', '') AS INT64) < 10
THEN CONCAT('0', REGEXP_REPLACE(reference_name, '^chr', ''))
ELSE REGEXP_REPLACE(reference_name, '^chr', '')
END AS chromosome,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
chromosome
ORDER BY
chromosome
The query returns the correct output:
Row | chromosome | number_of_variant_rows |
---|---|---|
1 | 01 | 615000 |
2 | 02 | 646401 |
3 | 03 | 542315 |
4 | 04 | 578600 |
5 | 05 | 496202 |
... | ... | ... |
The query uses the SAFE_CAST
function, which returns NULL
for chromosomes X, Y, and M instead of returning
an error.
As a last improvement on the output, display the reference_name
column again
instead of setting it to the chromosome
alias. To do so, move the CASE
clause to the ORDER BY
function:
#standardSQL
SELECT
reference_name,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
reference_name
ORDER BY
CASE
WHEN SAFE_CAST(REGEXP_REPLACE(reference_name, '^chr', '') AS INT64) < 10
THEN CONCAT('0', REGEXP_REPLACE(reference_name, '^chr', ''))
ELSE REGEXP_REPLACE(reference_name, '^chr', '')
END
This final query is the same as the one shown in Counting variants per chromosome.
Writing user-defined functions
BigQuery supports user-defined functions. You can use user-defined functions to create a function using another SQL expression or another programming language, such as JavaScript.
The example in Condensing queries shows how to build a complex query, but the query is overly complex.
The following query demonstrates how to make the query more concise by moving
the CASE
logic into a function:
#standardSQL
CREATE TEMPORARY FUNCTION SortableChromosome(reference_name STRING)
RETURNS STRING AS (
-- Remove the leading "chr" (if any) in the reference_name
-- If the chromosome is 1 - 9, prepend a "0" since
-- "2" sorts after "10", but "02" sorts before "10".
CASE
WHEN SAFE_CAST(REGEXP_REPLACE(reference_name, '^chr', '') AS INT64) < 10
THEN CONCAT('0', REGEXP_REPLACE(reference_name, '^chr', ''))
ELSE REGEXP_REPLACE(reference_name, '^chr', '')
END
);
SELECT
reference_name,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
reference_name
ORDER BY SortableChromosome(reference_name)
The following query also demonstrates how to make the query more concise, but it uses a function defined in JavaScript:
#standardSQL
CREATE TEMPORARY FUNCTION SortableChromosome(reference_name STRING)
RETURNS STRING LANGUAGE js AS """
// Remove the leading "chr" (if any) in the reference_name
var chr = reference_name.replace(/^chr/, '');
// If the chromosome is 1 - 9, prepend a "0" since
// "2" sorts after "10", but "02" sorts before "10".
if (chr.length == 1 && '123456789'.indexOf(chr) >= 0) {
return '0' + chr;
}
return chr;
""";
SELECT
reference_name,
COUNT(reference_name) AS number_of_variant_rows
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v
WHERE
EXISTS (SELECT 1
FROM UNNEST(v.call) AS call, UNNEST(call.genotype) AS gt
WHERE gt > 0)
GROUP BY
reference_name
ORDER BY SortableChromosome(reference_name)
Both queries return the correct result, and their logic is more concise.
Improving query performance and reducing costs
BigQuery pricing is based on the number of bytes processed for a query. Query performance improves when the amount of data processed is reduced. BigQuery provides data on how many seconds have elapsed since a query started and how many bytes the query processed. See the BigQuery query plan explanation for information on optimizing your queries.
Some of the examples in this page, such as Counting the variant calls in a table, demonstrate multiple ways to write a query. To determine which method of querying is best for you, examine the duration of different queries and see how many bytes of data they process.
Clean up
After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Work through the other Cloud Life Sciences tutorials.
- Analyze variants in BigQuery using R, RMarkdown, or JavaScript.