1000 Cannabis Genomes Project

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis.

You can start exploring this data:

The data is also available via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data is available as dataset ID 94241232795910911.

Dataset content

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

  • Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

  • MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

  • MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

  • MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Sample queries

Here are some examples of SQL queries you can run on this data in BigQuery.

These samples use BigQuery’s support for standard SQL. Use the #standardSQL tag to let BigQuery know you want to use standard SQL. For more information about the #standardSQL prefix, see Setting a query prefix.

Which Cannabis samples are included in the variants table?

This query determines how many Cannabis samples are included in the variants table - MNPR01_[BUILD_DATE].

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  call.call_set_name
FROM
  `bigquery-public-data.genomics_cannabis.MNPR01_201703` v,
  v.call
GROUP BY
  call.call_set_name

Command-line

bq query --use_legacy_sql=false '
SELECT
  call.call_set_name
FROM
  `bigquery-public-data.genomics_cannabis.MNPR01_201703` v,
  v.call
GROUP BY
  call.call_set_name'

The results are shown here:

+------------------------+
| call_set_name          |
+------------------------+
| SRS1757953             |
| SRS1758390             |
| SRS1758527             |
| SRS1759434             |
| SRS1758481             |
| SRS1757957             |
| SRS1758274             |
| SRS266829              |
| SRS1758072             |
| SRS1758111             |
| ...                    |
+------------------------+

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

This query displays the contigs with the highest variant_density. The results may indicate regions of especially intense artificial selection (genetic regions of interest for agricultural applications). The results may also indicate areas where DNA sequencing is troublesome and cannot produce high quality data.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  *
FROM (
  SELECT
    reference_name,
    COUNT(reference_name) / r.length AS variant_density,
    COUNT(reference_name) AS variant_count,
    r.length AS reference_length
  FROM
    `bigquery-public-data.genomics_cannabis.MNPR01_201703` v,
    `bigquery-public-data.genomics_cannabis.MNPR01_reference_201703` r
  WHERE
    v.reference_name = r.name
    AND EXISTS (
    SELECT
      1
    FROM
      UNNEST(v.call) AS call
    WHERE
      EXISTS (
      SELECT
        1
      FROM
        UNNEST(call.genotype) AS gt
      WHERE
        gt > 0))
  GROUP BY
    reference_name,
    r.length ) AS d
ORDER BY
  variant_density DESC

Command-line

bq query --use_legacy_sql=false '
SELECT
  *
FROM (
  SELECT
    reference_name,
    COUNT(reference_name) / r.length AS variant_density,
    COUNT(reference_name) AS variant_count,
    r.length AS reference_length
  FROM
    `bigquery-public-data.genomics_cannabis.MNPR01_201703` v,
    `bigquery-public-data.genomics_cannabis.MNPR01_reference_201703` r
  WHERE
    v.reference_name = r.name
    AND EXISTS (
    SELECT
      1
    FROM
      UNNEST(v.call) AS call
    WHERE
      EXISTS (
      SELECT
        1
      FROM
        UNNEST(call.genotype) AS gt
      WHERE
        gt > 0))
  GROUP BY
    reference_name,
    r.length ) AS d
ORDER BY
  variant_density DESC'

The results are shown here:

+----------------------------------+---------------------+----------------+----------------+
| reference_name                   | variant_density     | variant_count  | variant_length |
+----------------------------------+---------------------+----------------+----------------+
| gi|1098480201|gb|MNPR01009187.1| | 0.675239755884917   | 2228           | 6882           |
| gi|1098476186|gb|MNPR01010508.1| | 0.62694838933148    | 1810           | 2887           |
| gi|1098480371|gb|MNPR01009127.1| | 0.417397454031117   | 2951           | 7070           |
| gi|1098475344|gb|MNPR01010796.1| | 0.3986960882647943  | 795            | 1994           |
| gi|1098483659|gb|MNPR01007984.1| | 0.393582490957989   | 4244           | 10783          |
| gi|1098476754|gb|MNPR01010320.1| | 0.379833858493268   | 1326           | 3491           |
| gi|1098475239|gb|MNPR01010833.1| | 0.3713974986405655  | 683            | 1839           |
| gi|1098480751|gb|MNPR01008988.1| | 0.361954206602768   | 2719           | 7512           |
| gi|1098475142|gb|MNPR01010866.1| | 0.3583381419503750  | 621            | 1733           |
| gi|1098489468|gb|MNPR01005118.1| | 0.35504343342191    | 8951           | 25211          |
| ...                                                                                      |
+----------------------------------+---------------------+----------------+----------------+

How many variants does each sample have at the THC Synthase gene (THCA1) locus?

Cannabis sativa has 2 copies of the THCA Synthase gene. One of the sequences was submitted to GenBank in Gene duplication and divergence affecting drug content in Cannabis sativa (2015) as KJ469378. If you BLAT this sequence against the MNPR01 Cannabis sativa subspecies Cannatonic reference genome, THCA1 is located on contig gi|1098492959|gb|MNPR01002882.1| at 12889-14527.

This query uses the location of THCA1 to determine how divergent Cannatonic is from the other strains in the transcriptome dataset.

Web UI

Open the following query in the Web UI

#standardSQL
SELECT
  i.Sample_Name_s AS sample_name,
  call.call_set_name AS call_set_name,
  COUNT(call.call_set_name) AS call_count_for_call_set
FROM
  `bigquery-public-data.genomics_cannabis.sample_info_201703` i,
  `bigquery-public-data.genomics_cannabis.MNPR01_201703` v,
  v.call
WHERE
  call.call_set_name = i.SRA_Sample_s
  AND reference_name = 'gi|1098492959|gb|MNPR01002882.1|'
  AND EXISTS (
  SELECT
    1
  FROM
    UNNEST(v.alternate_bases) AS alt
  WHERE
    alt NOT IN ("",
      "<*>"))
  AND v.dp >= 10
  AND v.start >= 12800
  AND v.end <= 14600
GROUP BY
  call_set_name,
  Sample_Name_s
ORDER BY
  call_set_name

Command-line

bq query --use_legacy_sql=false '
SELECT
  i.Sample_Name_s AS sample_name,
  call.call_set_name AS call_set_name,
  COUNT(call.call_set_name) AS call_count_for_call_set
FROM
  `bigquery-public-data.genomics_cannabis.sample_info_201703` i,
  `bigquery-public-data.genomics_cannabis.MNPR01_201703` v,
  v.call
WHERE
  call.call_set_name = i.SRA_Sample_s
  AND reference_name = 'gi|1098492959|gb|MNPR01002882.1|'
  AND EXISTS (
  SELECT
    1
  FROM
    UNNEST(v.alternate_bases) AS alt
  WHERE
    alt NOT IN ("<NON_REF>",
      "<*>"))
  AND v.dp >= 10
  AND v.start >= 12800
  AND v.`end` <= 14600
GROUP BY
  call_set_name,
  Sample_Name_s
ORDER BY
  call_set_name'

The results are shown here:

+---------------------+-----------------+--------------------------+
| sample_name         | call_set_name   | call_count_for_call_set  |
+---------------------+-----------------+--------------------------+
| Chemdog91           | SRS1098403      | 52                       |
| SourTsunami         | SRS1098404      | 4                        |
| BlueBerryEssence    | SRS1098405      | 82                       |
| Black84             | SRS1098406      | 4                        |
| BlueDream           | SRS1098407      | 80                       |
| WZ_CBD              | SRS1098430      | 4                        |
| WiFi                | SRS1098431      | 4                        |
| AlaskanIce-2130     | SRS1099975      | 4                        |
| Black84-2130        | SRS1099976      | 6                        |
| WiFi-2130           | SRS1099977      | 2                        |
| ...                                                              |
+---------------------+-----------------+--------------------------+

About the data

Dataset Source: Open Cannabis Project

Category: Genomics

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Update Frequency: Updated as additional data is released to GenBank

View in BigQuery: Go to 1000 Cannabis Genomes Project data

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...