### PhD Dissertation Abstracts: 2008

## Statistics PhD Alumni 2008:

**Kun Chen****Ying Chen****Zhonghua Gu****Bitao Liu****Ziqi Liu****Ruixiao Lu****Michelle Norris****En-Tzu Tang****Qiuyan Xu****Li Zhu**

### Kun Chen (2008)

ADVISER: Jane-Ling Wang

TITLE: **Functional Approaches for High Dimensional and Gene Expression Data**

ABSTRACT: With the advances of modern science and technology, large-scale data sets have been accumulated at unprecedented rate and size, which requires efficient and effective statistical methods to extract meaningful information from the data. This presentation addresses two specific questions for high-dimensional gene expression data with the development of Functional Data approaches.

First question is to find differentially expressed genes, for which their expression values over time distinguish between two or more groups among thousands of candidates. A Functional Principal Component approach is developed and employed to depict the dynamics of the gene trajectories. With little model assumptions FPC analysis shows better sensitivity and specificity compared to commonly used two-way mixed ANOVA.

For the second question of survival prediction with high-dimensional gene expression levels, a novel method, Stringing, is proposed to take advantage of the high dimension by representing such data as discretized and noisy observations that originate from a hidden stochastic process. This methodology substantially broadens the applicability of Functional Data Analysis. A functional Cox regression model is proposed and implemented by supervised iterative selection of predictor subsets. The Stringing is compared with existing methods for survival regression with high-dimensional predictors, and demonstrate superior performance of Stringing in an application with real data set.

### Ying Chen (2008)

ADVISER: David Rocke

TITLE: **Statistical Approaches for Detection of Relevant Genes and Pathways in Analysis of Gene Expression Data**

ABSTRACT: High-throughput DNA microarray technology has become a standard tool in biomedical and genomic research nowadays. Modulation of the expressions of thousands of genes simultaneously provides important insights into the molecular mechanisms of biological processes. This dissertation addresses issues in gene selection, pathway analysis, and differential gene expression detection in microarray studies and proposes new statistical approaches to solve these issues.

Selection of relevant genes for sample classification is an important issue in microarray studies. Although microarray experiments provide measurements of activities for thousands of genes, we expect a small number of genes that are related to the phenotype of interest. Identification of relevant genes can help researchers better understand the biological mechanism. A new gene selection method named RA-PLS is developed for partial least squares analysis (PLS). It is formulated in the framework of hypothesis testing and uses a test statistic based on PLS weights. To estimate the null distribution of the test statistic, we introduce a new idea named random augmentation. RA-PLS is evaluated in terms of sensitivity and specificity in simulation studies and also compared with some other gene selection methods. The results show that RA-PLS has high sensitivity, high specificity, and perform quite stably in different simulation settings. The results of application of RA-PLS to two real microarray data sets are also discussed.

Pathway analysis has become increasingly important in microarray experiments. Biological pathways are sets of genes that serve a particular cellular or physiologic function. The identification of relevant pathways for classifying normal and diseased samples can make the interpretation of the results much easier. A pathway-based model for partial least square analysis (PB-PLS) is developed by incorporating pathway information into the statistical model building steps. The simulation studies show that the PB-PLS model can identify relevant pathways and also improve predictive accuracy than standard PLS model. The application to two real microarray data sets demonstrates consistency of the results of PB-PLS model with some other pathway analysis methods.

### Zhonghua Gu (2008)

ADVISER: Jiming Jiang

TITLE: **Model Diagnostics for Generalized Linear Mixed Models**

ABSTRACT: Generalized linear mixed models (GLMM) have received a lot of attention in the past decades or even longer. The allowance of discrete and non-normally distributed responses and the incorporation of random effects have made GLMM a flexible approach to model a transformation of the mean as a function of both fixed and random effects. The application of this kind of models can be addressed to statistical issues, such as heterogeneity, over-dispersion, and intra-cluster correlation.

This dissertation shows step by step about model diagnostics process, starting from model introduction, parameter estimation, and then test statistic and its asymptotic property development. Our proposed methods are for both nested and crossed random effect structures, and they can provide a formal way to test the distributional assumptions on both the random effects and the observations. The feasibility from computation point of view has also been considered.

The model diagnostic method proposed for GLMM with nested random effects starts from picking minimum chi-square estimate (MCE) as the parameter estimation method. Then, a modified Pearson's chi-square test statistic is defined to be used for GLMM diagnostics. Because the random effects are nested, we still have the independence at the unit level, which leads to the application of original central limit theorem. The regularity conditions are discussed, and simulation and data analysis are performed with both Binary and Poisson distribution assumptions on the observations.

The model diagnostic method proposed for GLMM with crossed random effects starts from using method of simulated moments (MSM) estimate as the parameter estimation method. The test statistic follows the similar formula from the previous chapter. However, the normalizing constant and the choices for the number of cells involved in the test statistics are different. At the same time, we also need to consider a different central limit theorem because of the crossed random effects, which lead to the dependence across all the observations. The martingale central limit theorem comes into mind and the conditions needed for using this theorem are proved. Finally, simulation study and data analysis are also performed in this chapter.

### Bitao Liu (2008)

ADVISER: Hans-Georg Müller

TITLE: **Estimating Derivatives for Samples of Sparsely Observed Functions, with Application to On-line Auction Dynamics**

ABSTRACT: It is often of interest to recover derivatives of a sample of random functions from sparse and noise-contaminated measurements, especially when the dynamics of underlying processes is of interest. We propose a novel approach based on estimating derivatives of eigenfunctions and expansions of random functions into their eigenfunctions to obtain a representation for derivatives. In combination with estimates for functional principal component scores for sparse data, this leads to a viable solution of the challenging problem to recover derivatives for sparsely observed functions. We establish consistency results and demonstrate in simulations that the method is superior to alternative approaches (derivative estimation with random effects models based on B-spline bases, kernel smoothing, smoothing splines, or P-splines). Our study is motivated by an analysis of bidding histories for eBay auctions, where bids are typically very sparse in the middle and somewhat more frequent near the beginning and end of an auction. We demonstrate the estimation of derivatives of price curves for individual auctions from the sparsely observed bidding histories and also derive a model-free first order differential equation that applies in the case of Gaussian processes. This provides a data-driven dynamic model which we employ to elucidate auction dynamics.

### Ziqi Liu (2008)

ADVISER: Rudy Beran

TITLE: **Nonparametric Bootstrap Method on Stiefel Manifold and GPAV algorithm for ASP fit**

ABSTRACT: In the first part of this article, we studies statistical models on Stiefel Manifold, which is the collection of n by p matrices X in Riemann space with the restriction X' X = I and commonly used to describe a rigid configuration of p distinguishable directions in n dimensional space, by constructing the intrinsic mean and sample mean of random variables on it. A nonparametric bootstrap confidence region is provided to the intrinsic mean and the coverage probability of this confidence region is proven to converge to the desired level. As an application, this method is applied to a set of vectorcardiogram data.

The second part of this article is motivated by the lack of an efficient algorithm to realize the ASP method with the Bi-Monotone shrinkage class for two-way layouts when both of the factors are ordinal. By applying the Generalized Pool Adjacent Violators algorithm that is characterized by low computational complexity and close to optimal solution to the monotonic regression problem directly to small scaled data set and after refining to large scaled data set, the result method assures the feasibility of the ASP fits for two-way layouts with Bi-Monotone shrinkage class and the high efficiency of minimizing the risk of the ASP estimators. Therefore, will broaden the use of the ASP methods for the settings where regression or ANOVA are established tools.

### Ruixiao Lu (2008)

ADVISER: David Rocke

TITLE: **Statistical Issues in Detection of Biological Signals in the Analysis of Microarray Gene Expression Data**

ABSTRACT: Microarray technology is a pretty new and powerful technology that significantly increases the speed of molecular biological research. It is rapidly developing in many aspects. Gene expression profiling generated from microarray has provided researchers a lot of insightful information for potential biomarker detection and chemical components discovery in the past few years. The analysis of gene expression microarray data has been developed since the technology appeared. There has been typical steps for certain kind of microarray gene expression experiments, such as comparative experiments, and has been helpful and useful for extracting biological information in the past. However, when the researches nowadays become more delicate and the biological signals become subtler, the detection of differential expression becomes more difficult.

Two-color microarray platform is one of the major types of microarrays. A major reason for using two-color microarrays is that the use of two samples labeled with different dyes on the same slide, that bind to probes on the same spot, is supposed to adjust for many factors that introduce noise and errors into the analysis. Most users assume that any differences between the dyes can be adjusted out by standard methods of normalization. However, it is not always the case, and sometimes the so-called dye-bias and slide-bias are so big to compromise the real biological signals. For this problem, we will introduce a diagnostic tool for visualization and quantification.

Identification of differential expression of transcripts, proteins, and other biological molecules is complicated by a number of factors. First, the very comprehensiveness of high-throughput methods makes it difficult to separate real effects from statistical accidents. Second, there are differences in timing of transcriptional cascades and other molecular pathways between individual samples-this is especially true for human data. We have developed a new method for dealing with this problem by utilizing biological information in the analysis, after careful low-level preprocessing of the raw signals and statistical analysis of the significance of changes in single probes or genes. The first step is to select one or more pathways or genes groups and identify the probes/probe sets on an array that correspond to the pathway or gene group. Second, for each probe/probe set, we conduct the usual statistical analysis such as a t-test for the difference between treatment and control groups. We then test for whether in the aggregate this shows up-regulation, down-regulation, or neither. The significance of the test could be accessed by sampling genes or sampling arrays, and we show the pros and cons for both of them. This new method can also serve as a powerful tool for evaluating the reproducibility of a study. We illustrate the value of the method with results from dose-response study of low dose radiation exposure data from human expression arrays. An application study and the full theoretical methodology will be described in details.

### Michelle Norris (2008)

ADVISER: Wesley Johnson

TITLE: **Parametric and Semiparametric Joint Modeling for Longitudinal Diagnostic Outcomes**

ABSTRACT: Diagnostic screening involves testing humans or animals for the presence of disease or infection. A variety of screening tests, yielding binary, ordinal and continuous data, are currently being used by diagnosticians. For some diseases, a perfect, "gold-standard" test does not exist or is too invasive or expensive to use. Hence, the goals of diagnostic screening may include: quantifying the performance of an imperfect screening test, diagnosing subjects, and estimating disease prevalence – possibly in the absence of a perfect reference test.

To date, most work in the area of diagnostic screening has focused on cross-sectional data. However, longitudinal diagnostic screening data are currently being collected in many studies. As a result, we develop a novel model for joint longitudinal diagnostic screening outcomes in the no-gold standard case. We consider the case where two tests are repeatedly administered to each subject – one yielding a continuous response and the other a binary response. For infected subjects, we assume the existence of a changepoint corresponding to time of infection and posit appropriate changes to the model thereafter. This results in a varying-dimensional parameter space since the true infection status of the subjects is unknown. We make inference using Bayesian Markov chain Monte Carlo methods, incorporating the Reversible Jump Markov chain Monte Carlo algorithm of Green for posterior simulation from a varying-dimensional parameter space. We test the model's performance on simulated data, then analyze a data set for Johne's Disease in cattle. We also propose a semiparametric version of this model, which allows the slope of the linear trajectory modeling the continuous response for infected subjects to be drawn from a flexible, nonparametric distribution.

Additionally, we consider the longitudinal model proposed by Diggle (1988), which includes fixed effects, a random subject effect and a stochastic process. The stochastic process is used to induce an autoregressive correlation structure on the within-subject observations. We propose modeling the distribution of random effects nonparametrically in this setting. A Markov chain Monte Carlo scheme for posterior inference is derived, and its performance is assessed using simulated data.

### En-Tzu Tang (2008)

ADVISER: Jiming Jiang

TITLE: **On Estimation of the Mean Squared Error in Small Area Estimation and Related Topics**

ABSTRACT: Small area estimation has received a lot of attention because of the wide range of applications and an increasing demand for reliable small area statistics. Mixed effects models are used widely in small are estimation, because of their flexibility in effectively combining different sources of information and explaining different sources of error. This presentation is concerned with estimation for small area means and MSE of these estimation as well as other related topics associated with mixed model prediction problems.

In general the mixed models used in small area estimation can be classified as area-specific models and unit-specific models. As for area- specific models, Fay-Herriot (1979) model is the most representative one.

Under the Fay-Herriot model, we show that how to find the best EBLUP according to the minimum MSE in both balanced and unbalanced cases. For unit-specific models, nested error regression models are considered in order to compare the performance of four different MSE estimators of EBLUP. Non-cluster model or crossed random effects model is also one of the common and widely used mixed effect models. Most methods of MSE estimation for EBLUP are not designed for models with correlations across the areas. From the results of Das et al. (2004) and Hall and Maiti's nonparametric matched-moment, double-bootstrap method (2006), we construct a MSE estimator for the non-cluster balanced model. A simulation under different settings is carried out to compare the performance of different MSE estimators in this model.

### Qiuyan Xu (2008)

ADVISER: Jiming Jiang

TITLE: **Estimation of Integrated Covolatility for Asynchronous Assets**

ABSTRACT: Use of high-frequency return data has led to dramatic improvements in both theoretical and applied finance research. Estimators of covariance among multiple processes have been proposed, such as realized variance and Hayashi-Yoshida estimator. We are introducing our new estimator, random lead-lag estimator (RLLE), which coincides with the Hayashi-Yoshida estimator at very high frequency. We studied the performance of RLLE both with and without microstructure noise for non-synchronous data and obtained the optimal estimator with good bias-variance trade-off. Our result is conformed by simulation. We also applied our method to real data application.

### Li Zhu (2008)

ADVISER: Fushing Hsieh

TITLE: **Modeling Dynamics in Two Statistical Problems: Longitudinal Disease Activity Score and Parasite Infection**

ABSTRACT: Oftentimes, certain specific features of a given data set are associated with the underlying dynamic mechanism that generated the data. Without this concern, some commonly accepted methodologies are applied universally to different data, but this may lead to the loss of some important information. To attain insightful and comprehensive knowledge on a specific data, it is crucial to devise tailored methodology according to the actual data structure, so as to capture the intrinsic dynamics and extrapolate sufficient information. In this dissertation, statistic approaches are developed for modeling data dynamics based on two typical real world problems.

The first problem is the dynamics of serial outcomes in longitudinal data. It is motivated from post marketing evaluations of pharmaceutical products, which focus particularly on medical treatment that gives rise to phase-dependent effects and volatile longitudinal responses. A stochastic modeling procedure is developed for analyzing the dynamics of longitudinal data, which adapts to model individual-specific trajectory in the first place, and then summarize individual information on population level. Statistical analysis is demonstrated with a data set from a rheumatoid arthritis study, in which longitudinal measurement is the disease activity score based on 28 joints. The results provide more meaningful interpretations than traditional methods.

The second problem is the dynamics of a decision-making system that includes a mass of indistinguishable agents and one target host, which is motivated by a problem of parasite infection. The huge system-wise heterogeneity is postulated to result from macroscopic behavioral tactics employed by almost all involved agents. A State-Space based mass event-history is developed here to explore the potential existence of such macroscopic behaviors. By imposing an unobserved internal state-space variable into the system, each individual's event-history is made into a composition of a common state duration and an individual specific time to action. Parametric statistical inferences are derived and illustrated in simulation studies. And real data analysis of mass parasitic nematode invading host larvae yields results that are coherent with the biology of the nematode as a parasite, and further include new quantitative interpretations.