Genomics

Large-scale DNA datasets for building machine learning models.
Table of Contents
2
3

Introduction

The field of genomics is revolutionizing our understanding of health and disease by providing comprehensive insights into genetic variations and their implications. Genomics datasets are essential for the development of AI models that can analyze complex genetic data, identify patterns, and make accurate predictions. These datasets contain extensive genetic information, including DNA sequences, gene expression data, and associated clinical annotations. Leveraging these datasets allows researchers and developers to create advanced AI algorithms that can enhance genomic research and improve healthcare outcomes. Below is a curated list of some of the largest and most comprehensive genomics datasets available, selected for their size, diversity, and relevance to current research needs.

Datasets

The Cancer Genome Atlas (TCGA)
1000 Genomes Project
Genotype-Tissue Expression (GTEx) Project
UK Biobank
Genome Aggregation Database (gnomAD)
All of Us Research Program
Human Genome Diversity Project (HGDP)
International Cancer Genome Consortium (ICGC)
Personal Genome Project (PGP)
ENCODE (Encyclopedia of DNA Elements) Project
Description
A comprehensive dataset comprising genomic, epigenomic, transcriptomic, and proteomic data from multiple cancer types.
A detailed catalog of human genetic variation, including deep sequencing of 2,504 individuals from 26 populations.
A dataset that provides comprehensive data on gene expression and regulation across multiple human tissues.
A large-scale biomedical database containing in-depth genetic and health information from half a million UK participants.
A resource that aggregates and harmonizes exome and genome sequencing data from multiple large-scale sequencing projects.
An extensive dataset aimed at gathering genetic data from diverse populations across the United States to advance precision medicine.
A dataset providing detailed genomic data from diverse human populations, used to study human genetic diversity.
A dataset that includes comprehensive genomic data from multiple cancer types, aimed at understanding cancer genomics.
A public resource of human genomic, environmental, and trait data from individuals who have consented to public data release.
A project aimed at identifying all functional elements in the human genome, providing a comprehensive resource of genomic and epigenomic data.
Size
Over 11,000 patients across 33 cancer types
2,504 whole genomes
Over 20,000 tissue samples from 900 individuals
500,000 participants
Over 140,000 exomes and 15,000 genomes
Over 1 million participants (target)
Over 1,000 individuals from 51 populations
Over 25,000 cancer genomes
Thousands of whole genomes
Thousands of datasets