Special Session 45: 

Principal Component Analysis In The Space of Phylogenetic Trees

Ruriko Yoshida
Naval Postgraduate School
USA
Co-Author(s):    Tom Nye, Xiaoxian Tang, and Grady Weyenberg
Abstract:
Principal component analysis is a popular method of reducing high-dimensional data to a low-dimensional representation that preserves much of the sample`s structure. However, the space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector space, and methods adapted to tree-space are needed. Previous work introduced the notion of a principal geodesic in this space, analogous to the first principal component. Here we propose a geometric object for tree-space similar to the $k$th order principal component in Euclidean space: the locus of the weighted Fr\`echet mean of $k+1$ vertex trees when the weights vary over the $k$-simplex. We establish some basic properties of these objects, in particular that they have dimension $k$. We propose algorithms for projection onto these surfaces and for finding the principal locus associated with a sample of trees. Simulation studies demonstrate that these algorithms perform well, and analysis of two empirical data sets, containing Apicomplexa and African coelacanth genomes respectively, reveal important structure from the second-order principal components.