Abstract: |
For many human diseases, differential expression (DE) analysis is one of the important tools to unveil the gene expression profile (GEP) differences between patient and control groups. It will reveal novel insights into the genes and pathways, and is potentially helpful for drug targets and therapeutics. However, a fundamental knowledge gap still remains for DE, concerning whether disease-associated GEP changes in tissue samples are due to changes in their cellular compositions, or due to GEP changes in specific cells. It would be much more informative to study gene expression on specific cells, or identify cell-intrinsic differentially expressed genes (CI-DEGs). But for many complex biological mixtures, such as brain tissues, exhaustive knowledge of individual cell types and their specific markers is lacking. Although single-cell RNA sequencing (RNAseq) data can be used or serve as a reference, such approaches remain costly, cumbersome and limited in sample sizes. In contrast, computational tools can be used to leverage widely available large-scale bulk tissue RNAseq data sets. As a step prior to the DE analysis, bulk tissue GEP data can be de-convoluted as GEP in specific cell types and cellular composition of tissue samples. The basic mathematical model of complete deconvolution is nonnegative matrix factorization (NMF), which is also a major machine learning algorithm used in spectral unmixing in analytical chemistry, remote sensing, image processing and topic mining, etc. NMF is a well-known ill-posed problem and its solution is generally not separable, so a direct application will pose great challenges on interpretability of biological data. Based on the geometric properties of the GEPs in potential marker genes, we propose a geometric structure guided NMF model, for which the weak identifiability conditions of the NMF is partially satisfied. Computational algorithms for the resulting non-convex optimization are developed in the frame work of Alternating direction method of multipliers (ADMM). Our preliminary simulations on synthetic and biological data have shown improved solution separability. |
|