This is Penn State

Predictive Models from Ultra-High-Dimensional Longitudinal Data

(supported in part by Frymoyer Chair in IST held by Vasant Honavar at Penn State University, the Sudha Murty Distinguished Visiting Chair in Neurocomputing and Data Science held by Vasant Honavar at the Indian Institute of Science)
 

Longitudinal data, sometimes also called panel data, i.e, collections of repeated observations from a set of individuals, taken from a larger population, over a period of time, often at irregularly spaced time points for each individual, are common across a broad range of applications, including health sciences, social sciences, learning sciences, economics, among others. Such data can be used to uncover the relationship between the time-varying patterns of certain measured variables (or features) and a particular outcome variable (or outcome) of interest, e.g., stock market crash, disease onset, health risk.

Longitudinal data exhibit longitudinal correlation (LC), i.e., correlations across observations of the same individual taken at different time points. In addition, observations across individuals may be correlated as well because of their shared traits (e.g., demographic characteristics), leading to clustered correlation (CC), or both. Under such circumstances, the observations (either within or across individuals) are no longer independent and identically distributed (i.i.d). Ignoring either part of the correlation can lead to incorrect parameter estimation, invalid tests of hypothesis, and misleading statistical inferences or predictions. Moreover, such data are characterized by fixed effects that are shared by the population under study; random effects that are individual-specific, or mixed effects, i.e., the combination of fixed effects and random effects. With the advent of big data, often the number of variables far exceeds the number of individuals, which greatly increases the need for effective variable selection, computational efficiency and interpretable models. Last, but not the least, observations at any given time point have many missing measurements, and missing data are generally not missing at random.

Against this background, we are developing novel methods for predictive modeling from ultra-high- dimensional, irregularly sampled, longitudinal data, including:

  1. Longitudinal Multi-Level Factorization Machines (LMLFM) (Liang et al., 2020), a novel, efficient, provably convergent extension of Factorization Machine (FM) for predictive modeling of longitudinal data characterized by mixed effects, in the presence of LC, CC, or both. A key feature of FM is that it models interactions of variables by mapping the interactions to dot products of vectors in a low dimensional latent space. LMLFM, like FM, uses latent factors to efficiently model higher order interactions between features. LMLFM extends FM to handle fixed, random, or mixed effects as needed for predictive modeling from longitudinal data. In addition, LMLFM incorporates a change in the structure of the underlying model to enhance model interpretability and achieve strictly linear training time with respect to the size of training data. To the best of our knowledge, LMLFM is the first multi-level regression model that extends variable selection beyond fixed effects to include random effects, and hence can be applied to high-dimensional longitudinal data. LMLFM uses a hierarchical probabilistic graphical model (HPGM) and avoids the need for hyperparameter tuning. LMLFM uses a variant of the iterated conditional modes (ICM) algorithm for learning the parameters of an LMLFM model based on a maximum a posteriori (MAP) formulation derived from the HPGM. We have established the convergence of LMLFM. Results of experiments with simulated data that show that LMLFM can effectively cope with high dimensional longitudinal data in the presence of both LC and CC whereas state-of-the-art baseline methods fail to do so (e.g., LMLFM can handle longitudinal data with over 5000 variables whereas the state-of-the-art baseline multilevel mixed effects models fail when the number of variables exceeds 100). Results of experiments with real-world data sets show that LMLFM compares favorably with the state-of-the-art baselines in terms of predictive accuracy, while producing sparse, interpretable models that include only the relevant subset of variables.
  2. Longitudinal deep kernel Gaussian process regression (L-DKGPR) (Liang et al., 2021) to overcome these limitations by fully automating the discovery of complex multilevel correlation structure from longitudinal data. Specifically, L-DKGPR eliminates the need for ad hoc heuristics or trial and error using a novel adaptation of deep kernel learning that combines the expressive power of deep neural networks with the flexibility of non-parametric kernel methods. L-DKGPR effectively learns the multilevel correlation with a novel additive kernel that simultaneously accommodates both time-varying and the time-invariant effects. We have developed an efficient algorithm to train L-DKGPR using latent space inducing points and variational inference. Results of extensive experiments on several benchmark data sets demonstrate that L-DKGPR significantly outperforms the state-of-the-art longitudinal data analysis (LDA) methods.
  3. A novel, modular, convolution-based feature extraction and attention mechanism that simultaneously identifies the variables as well as time intervals over which the variables identified impact the classifier output (Hsieh et al., 2021). The results of our extensive experiments with several benchmark data sets that show that the proposed method outperforms the state-of-the-art baseline methods on multi-variate time series classification task. The results of our case studies demonstrate that the variables and time intervals identified by the proposed method make sense relative to available domain knowledge.
  4. Functional autoencoders (Hsieh et al., 2021), which generalize neural network autoencoders so as to learn non-linear representations of functional data. We have derived from first principles, a functional gradient based algorithm for training functional autoencoders. The results of experiments demonstrate that the functional autoencoders outperform the state-of-the-art baseline methods. The resulting methods find applications in many real-world settings, e.g., monitoring of individual health, climate, brain activity, environmental exposures, among others, the data of interest change smoothly over a continuum, e.g., time, yielding multi-dimensional functional data. Functional data representations can be used for functional data clustering, classification, and regression.
  5. SRVAR (Hsieh et al., 2021), a novel approach to the problem of simultaneously learning the dynamics of transitions between hidden states and the state-dependent relationships between variables. SRVAR uses state space recurrent neural networks to model the transitions between hidden states, and utilizes the smooth acyclic characterization of DAGs to efficiently learn the state-dependent DAGs. We have results of experiments on simulated data as well as a real-world data set that show the superiority of SRVAR over state-of-the-art baselines in recovering the patterns of state transitions while modeling state-specific dependencies between variables.