Electronic Health Records

Large datasets of clinical notes for natural language processing.

Introduction

The development of AI models in healthcare relies heavily on access to high-quality Electronic Health Record (EHR) datasets. These datasets provide comprehensive patient information, including medical histories, treatment plans, medication records, and diagnostic data. By leveraging these datasets, researchers and developers can create more accurate and effective AI algorithms to enhance patient care and improve clinical outcomes. Below is a curated list of some of the largest and most comprehensive EHR datasets available. Each dataset is selected based on its size, diversity, and relevance to current research needs in healthcare AI.

Datasets

MIMIC-III (Medical Information Mart for Intensive Care III)
MIMIC-IV (Medical Information Mart for Intensive Care IV)
eICU Collaborative Research Database
CPRD (Clinical Practice Research Datalink)
MarketScan Commercial Claims and Encounters Database
UK Biobank
All of Us Research Program
OptumLabs Data Warehouse
Kaiser Permanente HealthConnect
Veterans Affairs (VA) EHR Database
Description
A publicly available dataset containing de-identified health data from over 40,000 critical care patients. It includes demographics, vital signs, laboratory results, medications, and more.
An updated version of the MIMIC-III dataset, containing comprehensive data from ICU patients, including new features and enhanced data quality.
A multi-center intensive care unit (ICU) database that includes detailed patient data from over 200,000 admissions to ICUs across the United States.
A UK-based dataset providing anonymized longitudinal medical records from primary care. It includes data on diagnoses, treatments, and health outcomes.
A comprehensive dataset containing de-identified patient data from private sector health plans, including demographics, diagnoses, procedures, and prescription drug use.
A large-scale biomedical database containing in-depth genetic and health information from half a million UK participants, including EHR data.
An extensive EHR dataset as part of a research program aimed at gathering health data from diverse populations across the United States to advance precision medicine.
A comprehensive dataset providing de-identified EHR data from a large, diverse patient population, including medical and prescription drug claims.
One of the largest private sector EHR systems in the world, containing data from millions of Kaiser Permanente members across multiple regions.
A comprehensive EHR system for U.S. veterans, including detailed health records from VA medical facilities across the country.
Size
Over 40,000 patients
Over 70,000 patients
Over 200,000 ICU admissions
Over 11 million patients
Over 250 million patient records
500,000 participants
Over 1 million participants (target)
Over 200 million lives covered
Over 11.7 million members
Over 9 million veterans