NIH Chest X-ray dataset

The NIH Chest X-ray dataset consists of 100,000 de-identified images of chest x-rays. The images are in PNG format.

The data is provided by the NIH Clinical Center and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC

You can also access the data via Google Cloud (GCP), as described in Google Cloud data access.

License and attribution

There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements:

  • Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC

  • Include a citation to the CVPR 2017 paper:

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald Summers, ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, IEEE CVPR, pp. 3462-3471, 2017

  • Acknowledge that the NIH Clinical Center is the data provider

Google Cloud data access

You can get the NIH chest x-ray images from Cloud Storage, BigQuery, or using the Cloud Healthcare API.

Cloud Storage

The NIH chest x-ray data is available in the following Cloud Storage bucket:

gs://gcs-public-data--healthcare-nih-chest-xray

Go to the NIH chest x-ray dataset in Cloud Storage

The bucket includes paths to the original PNG files, as well as to DICOM instances:

PNG (provided by NIH):

gs://gcs-public-data--healthcare-nih-chest-xray/png/FILENAME.png

DICOM (provided by Google):

gs://gcs-public-data--healthcare-nih-chest-xray/dicom/FILENAME.dcm

The Cloud Storage bucket uses the "Requester Pays" model for billing. Your Google Cloud project will be billed for the charges associated with accessing the NIH data. For more information, see Requester Pays.

BigQuery

The NIH chest x-ray data is available in the chc-nih-chest-xray Google Cloud project in BigQuery.

Go to the NIH chest x-ray dataset in BigQuery

For information about accessing public data in BigQuery, see BigQuery public datasets.

Cloud Healthcare API

The NIH chest x-ray data is available in the following DICOM store hierarchy in Cloud Healthcare API:

Project: chc-nih-chest-xray
Dataset: nih-chest-xray
DICOM store: nih-chest-xray

To request access to the NIH chest x-ray dataset, complete this form.

Go to the NIH chest x-ray dataset in the Cloud Healthcare API

For more information, see the DICOM overview and Using the DICOMweb Standard.

Data viewers

You can also use the viewers that are integrated with the Cloud Healthcare API:

eUnity: https://demo.eunity.app

IMS CloudVue: https://cloudvue.imstsvc.com

Additional labels

Additional labels for the NIH chest x-ray data are available in the following Cloud Storage bucket:

gs://gcs-public-data--healthcare-nih-chest-xray-labels

Go to the NIH chest x-ray dataset labels in Cloud Storage

For more details on these labels, see our paper in Radiology.

How these Labels Were Created

The final labels for each image were assigned via adjudicated review by three radiologists. Each image was first reviewed independently by 3 radiologists. For the test set, radiologists were selected at random for each image from a cohort of 11 American Board of Radiology certified radiologists. For the validation set, the 3 radiologists were selected from a cohort of 13 individuals, including board certified radiologists and radiology residents.

If all readers were in agreement after the initial review, then that label became final. For images with label disagreements, images were returned for additional review. Anonymous labels and any notes from the previous rounds were also available during each iterative review. Adjudication proceeded until consensus, or up to a maximum of 5 rounds. For the small number of images for which consensus was not reached, the majority vote label was used.

Information available at the time of review included only patient age and image view (AP vs. PA). Additional clinical information was not available. For nodule/mass and pneumothorax, the possible labels were: present, absent, or "hedge" (ie. uncertain if present or absent). For opacity and fracture, the possible label values were only present or absent.

How to Use these Labels

In the CSV titled individual_readers.csv, each row corresponds to the label for each of the four conditions provided by a single reader for a single image. This means that each image ID and the corresponding adjudication result are repeated across multiple rows (one row per reader). The reader ID is provided for stable linking across images. A cell value of YES means "present," NO means "absent," and HEDGE means "uncertain."

In the CSVs titled validation_labels.csv and test_labels.csv the metadata provided as part of the NIH chest x-ray dataset has been augmented with 4 columns, one for the adjudicated label for each of the 4 conditions fracture, pneumothorax, airspace opacity, and nodule/mass. There are 1,962 unique image IDs in the test set and 2,412 unique image IDs in the validation set making for a total of 4,374 images with adjudicated labels. Only YES and NO appear in the adjudication label columns. If a column value is missing, then this image was not included in the adjudicated image set.

When using these labels, please include the following citation:

Anna Majkowska, Sid Mittal, David F. Steiner, Joshua J. Reicher, Scott Mayer McKinney, Gavin E. Duggan, Krish Eswaran, PoHsuan Cameron Chen, Yun Liu, Sreenivasa Raju Kalidindi, Alexander Ding, Greg S. Corrado, Daniel Tse, Shravya Shetty, Chest Radiograph Interpretation Using Deep Learning Models: Assessment Using Radiologist Adjudicated Reference Standards and Population-Adjusted Evaluation, Radiology, 2019.

For more information on the License and Attribution of the NIH chest x-ray dataset, see the License and attribution section above.

Why to Use these Labels

Using a single reader or a majority-vote approach across multiple readers can lead to errors or inconsistencies in the resulting labels that are used for model development and evaluation. This may in turn lead to less reliable estimates of model performance.

For example, if only one out of three readers correctly detects a challenging finding, it will be overruled by a majority-vote approach. In that case, not only would the model be limited in its ability to detect similar findings (absent from training data), but also the evaluation results would not reflect these errors (incorrect reference standard), thus falsely inflating the model's accuracy. Expert adjudication is a more rigorous approach that can lead to higher quality model development and evaluation.

Hat Ihnen diese Seite weitergeholfen? Teilen Sie uns Ihr Feedback mit:

Feedback geben zu...

Cloud Healthcare API