Abstract
Motivation Genome-wide association studies (GWAS) have transformed human genetics by identifying tens of thousands of trait-associated variants, enabling applications from drug discovery to polygenic risk prediction. These advancements depend critically on open sharing of GWAS summary statistics. However, a lack of standardized formats complicates downstream analyses, requiring extensive dataset-specific "munging"before analysis can proceed. Results Here we present tidyGWAS, an R package that streamlines this process by cleanly separating data validation and harmonization from quality control. tidyGWAS uses curated data to repair and harmonize variant identifiers across genome builds, imputes missing columns when possible, and validates summary statistics with minimal filters. Outputs are saved as partitioned parquet files, optimized for high-throughput analysis via the arrow package. Benchmarked against existing tools tidyGWAS is up to 6.5× faster and substantially more memory efficient. Additionally, we implement a fixed-effects meta-analysis directly on tidyGWAS output, achieving up to 10× speedup over existing software. tidyGWAS simplifies and accelerates statistical genetic workflows, improving reproducibility and scalability for large-scale genetic analyses. Availability and implementation The package, reference data, and Docker containers are freely available for broad adoption.
| Original language | English |
|---|---|
| Article number | vbaf262 |
| Journal | Bioinformatics Advances |
| Volume | 5 |
| Issue number | 1 |
| DOIs | |
| Publication status | Published - 2025 |
Fingerprint
Dive into the research topics of 'TidyGWAS: A scalable approach for standardized cleaning of genome-wide association study summary statistics'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver