Fast principal component analysis of large-scale genome-wide data.

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time cons...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Gad Abraham, Michael Inouye
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2014
Materias:
R
Q
Acceso en línea:https://doaj.org/article/7c52eb15209244efa849360d96256343
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:7c52eb15209244efa849360d96256343
record_format dspace
spelling oai:doaj.org-article:7c52eb15209244efa849360d962563432021-11-18T08:24:13ZFast principal component analysis of large-scale genome-wide data.1932-620310.1371/journal.pone.0093766https://doaj.org/article/7c52eb15209244efa849360d962563432014-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24718290/?tool=EBIhttps://doaj.org/toc/1932-6203Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.Gad AbrahamMichael InouyePublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 9, Iss 4, p e93766 (2014)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Gad Abraham
Michael Inouye
Fast principal component analysis of large-scale genome-wide data.
description Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
format article
author Gad Abraham
Michael Inouye
author_facet Gad Abraham
Michael Inouye
author_sort Gad Abraham
title Fast principal component analysis of large-scale genome-wide data.
title_short Fast principal component analysis of large-scale genome-wide data.
title_full Fast principal component analysis of large-scale genome-wide data.
title_fullStr Fast principal component analysis of large-scale genome-wide data.
title_full_unstemmed Fast principal component analysis of large-scale genome-wide data.
title_sort fast principal component analysis of large-scale genome-wide data.
publisher Public Library of Science (PLoS)
publishDate 2014
url https://doaj.org/article/7c52eb15209244efa849360d96256343
work_keys_str_mv AT gadabraham fastprincipalcomponentanalysisoflargescalegenomewidedata
AT michaelinouye fastprincipalcomponentanalysisoflargescalegenomewidedata
_version_ 1718421868834193408