This page explains the schema of the BigQuery table that is created when exporting DICOM metadata to BigQuery.
Terminology
To understand the schema and its components, familiarize yourself with the DICOM terminology. In particular, this page uses several terms found in 3.10 DICOM Data Structures and Encoding Definitions.
Overview
The Cloud Healthcare API automatically generates the BigQuery schema using the data you are exporting and the DICOM dictionary. The schema only contains columns for DICOM data elements that exist in the metadata. The only exception is the Person Name VR.
When exporting DICOM metadata, the Cloud Healthcare API attempts to export all data elements in the metadata. For information on what happens if an issue arises, see Conflicting and mismatching types.
Standard and private data elements
DICOM provides standard data elements that conform to a predefined specification. For a list of these data elements, see Registry of DICOM Data Elements.
In cases where you must communicate data that doesn't conform to the standard elements, you can use private data elements.
Standard data elements
The following behaviors apply to standard data elements. For private data element behavior, see Private data elements.
Column names
Columns in the generated BigQuery schema are named according to the keyword of the data element. For example, if the DICOM metadata contains a data element whose keyword is InstanceCreationDate, then the generated schema has a corresponding column named InstanceCreationDate.
Standard DICOM data element behavior
The following table shows a list of Value Representations (VRs) and their abbreviations. For any data element exported to BigQuery that contains one of these VRs, the data element uses the BigQuery data type found under "Data type":
VR | Data type |
---|---|
|
String |
Date (DA) | Date |
Time (TM) | Time |
Date Time (DT) | Timestamp |
|
String |
Person Name (PN) | Struct (Record) |
|
Floating point |
|
Integer |
Sequence of Items (SQ) | Struct (Record) |
Nullable and repeated modes
Depending on the Value Multiplicity (VM) value of a data element, its
BigQuery column has one of two
modes: NULLABLE
or REPEATED
.
If a data element has a VM value of 1, which indicates that the data element
is unique, the data element uses the NULLABLE
mode. For any other VM value,
the data element uses the REPEATED
mode.
For example, as shown in the Registry of DICOM Data Elements,
the SOPInstanceUID keyword has a VM value of 1. As a result,
when it is exported to BigQuery, its mode is NULLABLE
,
and its representation in the table looks like the following (when represented as
JSON):
"SOPInstanceUID": "0.0.000.000000.0.000.0000.0000000.0000.0000000000.000",
Conversely, the ImageType keyword has a VM value of 2-n.
As a result, when it is exported to BigQuery, its mode is
REPEATED
, and its representation in the table looks like the following (when
represented as JSON):
"ImageType": [
"ORIGINAL",
"PRIMARY",
"OTHER",
"..."
],
Excluded VRs
Binary and long-form data is not exported to the generated BigQuery
table, so data elements containing the following VRs are not exported. Instead,
the following VRs are included in a separate column (called
DroppedTags.TagName
) in the destination BigQuery table.
- Other Double (OD)
- Other Float (OF)
- Other Long (OL)
- Other Byte (OB)
- Other Word (OW)
- Unknown (UN)
- Sequence (SQ) tags containing more than approximately 1 MiB of data
- Attribute (AT), Floating Point Double (FD), Floating Point Single (FL),
Unsigned Long (UL), or Unsigned Short (US), if the Value Multiplicity (VM) is
greater than 512.
- For legacy reasons, tags of instances already ingested into the
Cloud Healthcare API might be included in the
DroppedTags.TagName
column if the Value Multiplicity is greater than 64.
- For legacy reasons, tags of instances already ingested into the
Cloud Healthcare API might be included in the
Person Name VR
Each column in the BigQuery schema with a Person Name (PN) VR always contains three subcolumns, regardless of whether the subcolumns contain any data. The three subcolumns are:
- Alphabetic
- Ideographic
- Phonetic
Each of the three subcolumns has its own five subcolumns:
- FamilyName
- GivenName
- MiddleName
- NamePrefix
- NameSuffix
For example, consider the public tag "OperatorsName (0008,1070)," which has a VR of Person Name (PN). Suppose that the value of OperatorsName is "Darcy Smith." The schema will contain an OperatorsName column containing the subcolumns listed previously, but only Alphabetic.FamilyName (Smith) and Alphabetic.GivenName (Darcy) will contain values.
Private data elements
Some clinical implementations might require you to store custom data that does not fit within the structure of public data elements. As an alternative, you can use private data elements.
Private data elements with a VR of SQ (Sequence of Items) have the same behavior as standard data elements. Private data elements with a VR of SQ are called private data sequences.
Private data elements that do not have a VR of SQ are nested under a column
called OtherElements
and are converted to strings. These private data
elements are called non-sequence private data. To
query non-sequence private data elements, your query must search within the
OtherElements
column of the element.
The OtherElements
column contains two subcolumns, "Data" and "Tag." The
Data column is the string representation of the private data element's value.
It is always type REPEATED
. The Tag column uses the format
"Tag_HEX" where HEX is a hex string of the tag number.
LastUpdated
and Type
columns
The LastUpdated
and Type
columns are added to the BigQuery
table created when you export DICOM metadata. These columns are not standard
or private data elements, and they do not correspond to the Registry of
DICOM Data Elements.
The behavior of these columns is as follows:
- The
LastUpdated
column contains a timestamp value that shows when the DICOM instance was inserted into or deleted from the DICOM store. - The
Type
column contains a string that shows what type of operation occurred. The possible values areCREATE
orDELETE
.
Conflicting and mismatching types
If a type conflict occurs, such as when a public tag is used with an incorrect
type, the public tag is treated as though it were a private tag. The value of
the data element is
nested under a column called OtherElements
and the value is converted to a string.
For example, say that the DICOM metadata contains a tag with:
- A tag number "(4010,1017)"
- A VR of SL (Signed Long)
- A value of 32
(4010,1017) is same tag number as "Mass," which is a public tag name in the DICOM specification that has a VR of FL. The export operation expects a data element with the tag number of "(4010,1017)" to be the "Mass" public tag name with a VR of FL. Therefore, the export operation expects to convert the data element's value to a float (as shown in the table in Standard DICOM data element behavior
A type conflict occurs because any tags with a VR of SL use the integer
data type. The tag is therefore converted to a private tag and added to the
OtherElements
column.
If a non-sequence public tag name is used for sequence data, a type mismatch occurs. As a result, the sequence is treated as though it were a private data element. Instead of using the public tag name as the column name in the BigQuery schema, the public tag name's hex number is used. The hex number is of type string.
Examples: Querying public and private data elements
Consider the following snippet of a schema represented as JSON. The schema was created after exporting DICOM data to BigQuery.
[
...
{
"name": "SOPInstanceUID",
"type": "STRING"
},
{
"fields": [
{
"fields": [
{
"mode": "REQUIRED",
"name": "Tag",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "Data",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "OtherElements",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "Tag_12345678",
"type": "RECORD"
}
...
]
The following sample shows how to query for the SOPInstanceUID
public data element.
To access the value of the column, run the following query:
#standardSQL SELECT SOPInstanceUID FROM `PROJECT_ID.DATASET_ID.TABLE_ID`
Running the query returns an output similar to the following:
[ ... { "SOPInstanceUID": "0.0.000.000000.0.000.0000.0000000.0000.0000000000.000" }, ... ]
The following sample shows how to query for non-sequence private data.
Run the following query against the OtherElements
column which is
inside the Tag_12345678
column. Note the use of the UNNEST
operator, which is required because you are querying a RECORD
.
#standardSQL SELECT Tag_12345678.OtherElements AS OtherElements FROM `PROJECT_ID.DATASET_ID.TABLE_ID`, UNNEST(Tag_12345678) AS Tag_12345678
Running the query returns an output similar to the following, depending
on the amount and type of data in the Tag_12345678.OtherElements
column:
[ { "OtherElements": [ { "Tag": "Tag_12345678", "Data": [ "DATA" ] } ] }, { "OtherElements": [ { "Tag": "Tag_12345678", "Data": [ "DATA" ] } ] }, { "OtherElements": [ { "Tag": "Tag_12345678", "Data": [ "DATA" ] } ] } ]
What's next
Learn more about BigQuery standard SQL operations and view examples.