The NIH Chest X-ray dataset consists of 100,000 de-identified images of chest x-rays. The images are in PNG format.
The data is provided by the NIH Clinical Center and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
You can also access the data via Google Cloud (GCP), as described in Google Cloud data access.
License and attribution
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements:
Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
Include a citation to the CVPR 2017 paper:
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald Summers, ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, IEEE CVPR, pp. 3462-3471, 2017
Acknowledge that the NIH Clinical Center is the data provider
Google Cloud data access
You can get the NIH chest x-ray images from Cloud Storage, BigQuery, or using the Cloud Healthcare API.
The NIH chest x-ray data is available in the following Cloud Storage bucket:
The bucket includes paths to the original PNG files, as well as to DICOM instances:
PNG (provided by NIH):
DICOM (provided by Google):
The Cloud Storage bucket uses the "Requester Pays" model for billing. Your Google Cloud project will be billed for the charges associated with accessing the NIH data. For more information, see Requester Pays.
The NIH chest x-ray data is available in the
Google Cloud project in BigQuery.
For information about accessing public data in BigQuery, see BigQuery public datasets.
Cloud Healthcare API
The NIH chest x-ray data is available in the following DICOM store hierarchy in Cloud Healthcare API:
To request access to the NIH chest x-ray dataset, complete this form.
You can also use the viewers that are integrated with the Cloud Healthcare API:
IMS CloudVue: https://cloudvue.imstsvc.com
Additional labels for the NIH chest x-ray data are available in the following Cloud Storage bucket:
For more details on these labels, see our paper in Radiology.
How these Labels Were Created
The final labels for each image were assigned via adjudicated review by three radiologists. Each image was first reviewed independently by 3 radiologists. For the test set, radiologists were selected at random for each image from a cohort of 11 American Board of Radiology certified radiologists. For the validation set, the 3 radiologists were selected from a cohort of 13 individuals, including board certified radiologists and radiology residents.
If all readers were in agreement after the initial review, then that label became final. For images with label disagreements, images were returned for additional review. Anonymous labels and any notes from the previous rounds were also available during each iterative review. Adjudication proceeded until consensus, or up to a maximum of 5 rounds. For the small number of images for which consensus was not reached, the majority vote label was used.
Information available at the time of review included only patient age and image view (AP vs. PA). Additional clinical information was not available. For nodule/mass and pneumothorax, the possible labels were: present, absent, or "hedge" (ie. uncertain if present or absent). For opacity and fracture, the possible label values were only present or absent.
How to Use these Labels
In the CSV titled
individual_readers.csv, each row corresponds to the
label for each of the four conditions provided by a single reader for a single
image. This means that each image ID and the corresponding adjudication result
are repeated across multiple rows (one row per reader). The reader ID is
provided for stable linking across images. A cell value of YES means
"present," NO means "absent," and HEDGE means "uncertain."
In the CSVs titled
the metadata provided as part of the NIH chest x-ray dataset has been augmented
with 4 columns, one for the adjudicated label for each of the 4 conditions
fracture, pneumothorax, airspace opacity, and nodule/mass. There are 1,962
unique image IDs in the test set and 2,412 unique image IDs in the validation
set making for a total of 4,374 images with adjudicated labels. Only YES and
NO appear in the adjudication label columns. If a column value is missing,
then this image was not included in the adjudicated image set.
When using these labels, please include the following citation:
Anna Majkowska, Sid Mittal, David F. Steiner, Joshua J. Reicher, Scott Mayer McKinney, Gavin E. Duggan, Krish Eswaran, PoHsuan Cameron Chen, Yun Liu, Sreenivasa Raju Kalidindi, Alexander Ding, Greg S. Corrado, Daniel Tse, Shravya Shetty, Chest Radiograph Interpretation Using Deep Learning Models: Assessment Using Radiologist Adjudicated Reference Standards and Population-Adjusted Evaluation, Radiology, 2019.
For more information on the License and Attribution of the NIH chest x-ray dataset, see the License and attribution section above.
Why to Use these Labels
Using a single reader or a majority-vote approach across multiple readers can lead to errors or inconsistencies in the resulting labels that are used for model development and evaluation. This may in turn lead to less reliable estimates of model performance.
For example, if only one out of three readers correctly detects a challenging finding, it will be overruled by a majority-vote approach. In that case, not only would the model be limited in its ability to detect similar findings (absent from training data), but also the evaluation results would not reflect these errors (incorrect reference standard), thus falsely inflating the model's accuracy. Expert adjudication is a more rigorous approach that can lead to higher quality model development and evaluation.