Reference Genomes

Reference Genomes, such as GRCh37, GRCh37lite, GRCh38, hg19, hs37d5, and b37, are available on Google Cloud.

Dataset access

Cloud Storage folders

The following files are available in the genomics-public-data Cloud Storage bucket:

About the dataset

Dataset source:

  • GRCh37: Genome Reference Consortium Human Build 37 includes data from the following files:

    For more information on GRCh37 data, see the GRCh37 NCBI paper and the FTP README.

  • GRCh37lite: GRCh37lite is a subset of the full GRCh37 reference set plus the human mitochondrial genome reference sequence in one file:

    For more information on GRCh37lite data, see the FTP README.

  • GRCh38: Genome Reference Consortium Human Build 38 includes data from the following files:

    For more information on GRCh38 data, see the GRCh38 NCBI paper and the FTP README.

  • Verily's GRCh38: Verily's GRCh38 reference genome is fully compatible with any b38 genome in the autosome. It has the following features:

    • Excludes all patch sequences
    • Omits alternate haplotype chromosomes
    • Includes decoy sequences
    • Masks out duplicate copies of centromeric regions

    The base assembly is GRCh38_no_alt_plus_hs38d1, which was created specifically for analysis. Its rationale and exact genomic modifications are documented in its README file.

    Verily applied the following modifications to the base assembly:

    • Reference segment names are prefixed with chr. Many of the additional data files are provided by GENCODE, which uses the "chr" naming convention.

    • All 74 extended IUPAC codes are converted to the first matching alphabetical base pair as recommended in the VCF 4.3 specification.

    • This release of the genome reference is named GRCh38_Verily_v1.

  • hg19: Similar to GRCh37, this is the February 2009 assembly of the human genome with a different mitochondrial sequence and additional alternate haplotype assemblies. The hg19 data is hosted by the UCSC FTP site.

    For more information on hg19 data, see the FTP README.

  • hs37d5: Includes data from GRCh37, the rCRS mitochondrial sequence, Human herpesvirus 4 type 1 and the concatenated decoy sequences. Data is in one file, hs37d5.fa.gz, hosted by the EBI FTP site.

    For more information on hs37d5 data, see the FTP README.

  • b37: The b37 reference genome is included by some versions of the GATK software, which includes data from GRCh37, the rCRS mitochondrial sequence, and the Human herpesvirus 4 type 1. The b37 dataset is hosted by the Broad Institute FTP site.

    For more information on b37 data, see the GATK FAQs.

Use: These datasets are publicly available for anyone to use under the terms provided by the dataset sources (https://www.ncbi.nlm.nih.gov/, https://cse.ucsc.edu/, http://www.internationalgenome.org/data, https://www.broadinstitute.org/) and are provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the datasets.