We Share Science: Rui Jin
Gene Pathway selection using group lasso regularized logistic regression:
Single nucleotide polymorphisms (SNPs) are the most frequent type of genetic variation between individuals, and represent a promising tool for finding genetic determinants of complex diseases. SNPs are commonly used in genome wide association studies (GWAS) where the goal is to find SNPs associated with the disease. However not the significant SNPs themselves, but their effect on the protein structure or their impact on functional sites at the protein or DNA level is a key factor for understanding the mechanisms underlying the disease. Hence, the functional consequences of SNPs are better appreciated if the evaluation is performed at the biological system level, for instance by determining their effect in the context of genes set or signaling pathways. In this study, we propose a new method to uncover the association between disease status and SNPs which are grouped into gene pathways.. Our method uses overlapping group lasso regularized logistic regression to model the joint effects of SNPs and to select key explanatory SNPs at the group level. We use resampling techniques to rank the pathways and then SNPs within pathways in the order of their significance. Our ranking approach allows the presence of overlapping groups. We use data collected as part of The Study of Addiction: Genetics and Environment
(SAGE) to demonstrate the use of our methodology. The overarching goal of the data analysis is to identify novel genetic factors that contribute to addiction through a large-scale genome-wide association study of DSM-IV alcohol dependent cases and non-dependent, unrelated control subjects of
European and African American descent.