Understanding the BigQuery DICOM schema

This page explains the schema of the BigQuery table that is created when exporting DICOM metadata to BigQuery.

Terminology

To understand the schema and its components, you should be familiar with the DICOM terminology. In particular, this page uses several terms found in 3.10 DICOM Data Structures and Encoding Definitions.

Overview

The Cloud Healthcare API automatically generates the BigQuery schema using the data you are exporting as well as the DICOM dictionary. The schema will only contain columns for DICOM data elements that exist in the metadata. The only exception to this is the Person Name VR.

When exporting DICOM metadata, the Cloud Healthcare API attempts to export all data elements in the metadata. For information on what happens if an issue arises, see Conflicting and mismatching types.

Standard and private data elements

DICOM provides standard data elements that conform to a predefined specification. For a list of these data elements, see Registry of DICOM Data Elements.

In cases where you must communicate data that doesn't conform to the standard elements, you can use private data elements.

Standard data elements

The following behaviors apply to standard data elements. For private data element behavior, see Private data elements.

Column names

Columns in the generated BigQuery schema are named according to the keyword of the data element. For example, if the DICOM metadata contains a data element whose keyword is InstanceCreationDate, then the generated schema will have a corresponding column named InstanceCreationData.

Standard DICOM data element behavior

The following table shows a list of Value Representations (VRs) and their abbreviations. For any data element exported to BigQuery that contains one of these VRs, the data element will use the BigQuery data type found under "Data type":

VR Data type
  • Application Entity (AE)
  • Age String (AS)
  • Code String (CS)
  • Long String (LO)
  • Long Text (LT)
  • Short String (SH)
  • Short Text (ST)
  • Unlimited Characters (UC)
  • Unique Identifier (UI/UID)
  • Universal Resource Identifier (UR) or Universal Resource Locator (URI/URL)
  • Unlimited Text (UT)
String
Date (DA) Date
Time (TM) Time
Date Time (DT) Timestamp
  • Decimal String (DS)
  • Integer String (IS)
String
Person Name (PN) Struct (Record)
  • Floating Point Single (FL)
  • Floating Point Double (FD)
Floating point
  • Attribute Tag (AT)
  • Signed Long (SL)
  • Signed Short (SS)
  • Unsigned Long (UL)
  • Unsigned Short (US)
Integer
Sequence of Items (SQ) Struct (Record)

Nullable and repeated modes

Depending on the Value Multiplicity (VM) value of a data element, its BigQuery column will have one of two modes: NULLABLE or REPEATED.

If a data element has a VM value of 1, which indicates that the data element is unique, the data element will use the NULLABLE mode. For any other VM value, the data element will use the REPEATED mode.

For example, as shown in the Registry of DICOM Data Elements, the SOPInstanceUID keyword has a VM value of 1. As a result, when it is exported to BigQuery, its mode will be NULLABLE, and its representation in the table will look like so (when represented as JSON):

"SOPInstanceUID": "0.0.000.000000.0.000.0000.0000000.0000.0000000000.000",

Conversely, the ImageType keyword has a VM value of 2-n. As a result, when it is exported to BigQuery, its mode will be REPEATED, and its representation in the table will look like so (when represented as JSON):

"ImageType": [
  "ORIGINAL",
  "PRIMARY",
  "OTHER",
  "..."
],

Excluded VRs

Binary data is not exported to the generated BigQuery table, so data elements containing the following VRs are not exported:

  • Other Byte (OB)
  • Other Double (OD)
  • Other Float (OF)
  • Other Long (OL)
  • Other Word (OW)
  • Unknown (UN)

Person Name VR

Each column in the BigQuery schema with a Person Name (PN) VR always contains three sub-columns, regardless of whether the sub-columns contain any data. The three sub-columns are:

  • Alphabetic
  • Ideographic
  • Phonetic

Each of these sub-columns have their own five sub-columns:

  • FamilyName
  • GivenName
  • MiddleName
  • NamePrefix
  • NameSuffix

For example, consider the public tag "OperatorsName (0008,1070)," which has a VR of Person Name (PN). Suppose that the value of OperatorsName is "Darcy Smith." The schema will contain an OperatorsName column containing all of the sub-columns above, but only Alphabetic.FamilyName (Smith) and Alphabetic.GivenName (Darcy) will contain values.

Private data elements

Some clinical implementations might require you to store custom data that does not fit within the structure of public data elements. As an alternative, you can use private data elements.

Private data elements with a VR of SQ (Sequence of Items) have the same behavior as standard data elements. These are called private data sequences.

Private data elements that do not have a VR of SQ are nested under a column called OtherElements and are converted to strings. These private data elements are called non-sequence private data. To query non-sequence private data elements, your query must search within the OtherElements column of the element.

The OtherElements column contains two sub-columns, "Data" and "Tag." The Data column is the string representation of the private data element's value. It is always type REPEATED. The Tag column uses the format "Tag_HEX" where HEX is a hex string of the tag number.

Conflicting and mismatching types

If a type conflict occurs, such as when a public tag is used with an incorrect type, the public tag is treated as though it were a private tag. This means that the value of the data element is nested under a column called OtherElements and is converted to a string.

For example, say that the DICOM metadata contains a tag with:

  • A tag number "(4010,1017)"
  • A VR of SL (Signed Long)
  • A value of 32

(4010,1017) is same tag number as "Mass," which is a public tag name in the DICOM specification that has a VR of FL. The export operation expects a data element with the tag number of "(4010,1017)" to be the "Mass" public tag name with a VR of FL, and so expects to convert its value to a float (as shown in the table in Standard DICOM data element behavior

A type conflict occurs because any tags with a VR of SL will use the integer data type. The tag is therefore converted to a private tag and added to the OtherElements column.

If a non-sequence public tag name is used for sequence data, a type mismatch occurs. As a result, the sequence is treated as though it were a private data element. Instead of using the public tag name as the column name in the BigQuery schema, the public tag name's hex number is used. The hex number is of type string.

Examples: Querying public and private data elements

Consider the following snippet of a schema represented as JSON. The schema was created after exporting DICOM data to BigQuery.

[
  ...
  {
    "name": "SOPInstanceUID",
    "type": "STRING"
  },
  {
    "fields": [
      {
        "fields": [
          {
            "mode": "REQUIRED",
            "name": "Tag",
            "type": "STRING"
          },
          {
            "mode": "REPEATED",
            "name": "Data",
            "type": "STRING"
          }
        ],
        "mode": "REPEATED",
        "name": "OtherElements",
        "type": "RECORD"
      }
    ],
    "mode": "REPEATED",
    "name": "Tag_12345678",
    "type": "RECORD"
  }
  ...
]

The following sample shows how to query for the SOPInstanceUID public data element. To access the value of the column, run the following query:

#standardSQL
SELECT
  SOPInstanceUID
FROM
  `PROJECT_ID.DATASET_ID.TABLE_ID`

Running the query returns something similar to the following:

[
  ...
  {
    "SOPInstanceUID": "0.0.000.000000.0.000.0000.0000000.0000.0000000000.000"
  },
  ...
]

The following sample shows how to query for non-sequence private data. Run the following query against the OtherElements column which is inside of the Tag_12345678 column. Note the use of the UNNEST operator, which is required because you are querying a RECORD.

#standardSQL
SELECT
  Tag_12345678.OtherElements AS OtherElements
FROM
  `PROJECT_ID.DATASET_ID.TABLE_ID`,
  UNNEST(Tag_12345678) AS Tag_12345678

Running the query returns something similar to the following, depending on the amount and type of data in the Tag_12345678.OtherElements column:

[
  {
    "OtherElements": [
      {
        "Tag": "Tag_12345678",
        "Data": [
          "DATA"
        ]
      }
    ]
  },
  {
    "OtherElements": [
      {
        "Tag": "Tag_12345678",
        "Data": [
          "DATA"
        ]
      }
    ]
  },
  {
    "OtherElements": [
      {
        "Tag": "Tag_12345678",
        "Data": [
          "DATA"
        ]
      }
    ]
  }
]

What's next

Learn more about BigQuery standard SQL operations and view examples.

Hat Ihnen diese Seite weitergeholfen? Teilen Sie uns Ihr Feedback mit:

Feedback geben zu...

Cloud Healthcare API