### PhD Dissertation Abstracts: 2007

## Statistics PhD Alumni 2007:

### Wei Liu (2007)

ADVISER: Wolfgang Polonik

TITLE: **Statistical Network Comparison**

ABSTRACT: The study of dynamical random networks (graphs) has attracted a lot of attention in recent years. Statistics is challenging in this context, because in general only a very small number of observed networks is available. An important statistical problem, considered in this research, is to assess topological dissimilarities between networks.

The proposed approach assesses topological dissimilarities between networks indirectly. The structure of the given networks is destroyed by adding noise (this process is called "scrambling"). The amount of noise necessary in order to make the topologies of the scrambled networks statistically indistinguishable is used as a dissimilarity measure.

To follow this approach one has decided on its basic ingredients, such as the way to introduce noise, the way to measure the amount of noise, and the test statistic for comparing topologies of the scrambled networks. Three scrambling methods are proposed that to a certain extend allow to control the level of scrambling imposed on a network. Topologies of networks are compared via the spectral distributions of their (standardized) adjacency matrices. In fact, moments of these spectral distributions are utilized for testing purposes. This is motivated by a recent result of Bai and Yao (1) who derive a functional central limit theorem for an empirical spectral process based on Wigner matrices indexed by analytical functions. We have extended their results slightly to allow for constant diagonal elements in these matrices. This then allows the application of this result to (standardized) adjacency matrices of networks (graphs) without self-edges.

The proposed methodology is evaluated via simulation studies using model based networks and are further applied to some protein-protein networks.

Reference: (1) Bai, Z. and Yao, J. On the convergence of the spectral empirical process of Wigner matrices. Bernoulli 11, 1059-1092 (2005).

### Candace Metoyer (2007)

ADVISER: Prabir Burman

TITLE: **Estimation Methods for Linear, Nonlinear, and Multidimensional Time Series: Applications of State-Space Modeling**

ABSTRACT: Burman and Shumway (2004) use penalized least-squares to generate estimates for the trend-only linear time series model, Y(t) = T(t) + e(t), where T(t) is called the trend and e(t) is random error. We extend their approach and apply it to the trend plus seasonal linear time series model, Y(t) = T(t) + S(t) + e(t), where S(t) is called the seasonal. We assume that the d-th order trend differences are iid random variables and we assume that the p-th order seasonal sums are iid random variables. Using penalized least-squares, we obtain closed-form expressions for the trend and seasonal estimators. Next, we generalize this method further and consider the class of time series where the distribution of the observation is a member of the exponential family of distributions. We focus on Poisson and Bernoulli time series problems and present an estimation procedure based on the penalized log-likelihood. Last, we consider the class of time series where the observation is a column vector of length M. In this scenario, our first task is dimension reduction. Using a principal components analysis, we reduce the effective dimension from M to m

### Lu Wang (2007)

ADVISER: Rudy Beran

TITLE: **Penalization and Rank Reduction**

ABSTRACT: The Penalized Total Least Square estimator is based on two types of well-known least square estimator: Penalized Least Square estimator and Total Least Square estimator for unknown response surface with additive noise. We begin by formulating the estimation problem as the rank constrained minimization of a penalized least square problem in order to achieve the Penalized Total Least Square estimator, which leads to consider further classes of candidate estimators for the unknown means in order to achieve lower risk. Adaptation selects the estimator within a candidate class that minimizes the estimated risk, which is an unbiased estimator of the risk function. Under the model assumption, such adaptive estimators minimize risk asymptotically over the class of candidate estimators as the number of rows of the matrix tends to infinity. The so called penalized total least square estimator is applied on both simulated data and real data, both out performs the traditional method in the sense of minimizing risk.

$\theta-$Separable estimator generalizes the idea of penalized least square estimator into broader class. This section deals with the following approach for estimating the mean $m$ of an $n-$dimensional random vector $x$: first, a family $\{A(\theta): \theta \in \Theta\}$ of $n \times n$ matrices is defined. The so called $\theta-$separable matrix depend on a $p\times 1$ unknown parameter vector $\theta$, and have special structure on the eigenvalues and eigenvectors. Examples of such an estimator includes: ANOVA model, ridge regression and multiple shrinkage estimator. Then, James-Stein estimation is introduced as minimization of the risk function. An element $A(\tilde{\theta}): \tilde{\theta} \in \Theta$ is selected by minimizing the $L_2$ risk function. Because the risk function involves the unknown parameter $m$ and variance of noise, instead of minimizing the risk function, $A(\hat{\theta}): \hat{\theta} \in \Theta$ is selected by minimizing the estimated risk function, which is a uniform consistent estimator of risk function. Estimators selected by minimizing estimated risk is also known as Mallows $C_L$ procedure. Generalized Cross Validation methods are also introduced.

The two methods are compared both asymptotically and by numerical experiments.

### Jingjing Ye (2007)

ADVISER: David Rocke

TITLE: **Preprocessing and Biomarker Detection Analysis for Biological Mass Spectrometry Data**

ABSTRACT: Biomarker detection using mass spectrometry has been billed as having high potential to improve public health. It has also presented considerably great challenges in the statistical analysis of the data with high dimensional data, massive file sizes, noise and complexity. In this dissertation, I propose methods of preprocessing the spectral data to overcome the difficulties for the purpose of extracting valuable information contained in mass spectrometry data.

In this talk, we propose a five-step preprocessing algorithm developed for mass spectrometry M/I data. The algorithm consists of imputation of missing intensities, normalization, integration of fractions, transformation, and selection of potential biomarkers. The five-step preprocessing on the M/I spectra is carried out on mass spectrometry glycomics data, a new emerging research area for detecting biomarkers. The proposed imputation can retain similar information to the raw spectrum and the selection of biomarkers based on statistical models is explored. The algorithm is applied to glycomics prostate and ovarian cancer data with selection of biomarkers incorporated in cross-validation for evaluation. With low misclassification error rates, good precision, and visually and clinically confirmed oligosaccharides detected in the process, we can conclude that the five-step M/I spectrum algorithm is a good choice in preprocessing and conducting differential expression analysis on mass spectrometry data.

Moreover, the methods of linear combination of selected potential biomarkers to achieve better classification are proposed. We investigate a non-parametric approach of maximizing the area under the curve with constrained threshold gradient direct regularization (TGDR-AUC) on the mass spectrometry ovarian glycomics data. Simulations of the method are conducted and proved asymptotic approximation of parameters. In the application of ovarian cancer case, TGDR-AUC is shown to have superior classification in both small biomarker large sample size and large biomarker small sample size scenarios. The method can detect clinical biomarkers, which are confirmed to be oligosaccharides, and provide the flexibility of build-in dimension reduction technique.

The talk shows step by step to mass spectrometry users about preprocessing procedures and biomarker detection methods based on the data. Our proposed methods can solve the purpose of preprocessing M/I spectrum and performing differential expression analysis on the outcome of disease. Thus, the methods are competitive in the analysis of mass spectrometry biological data and if implemented in the software, it will be available for mass spectrometry users to conduct their analysis.

### Shuying Zhu (2007)

ADVISER: Rudy Beran

TITLE: **Bootstrap Methods with Applications in Multivariate Analysis**

ABSTRACT: This study is concerned with certain classical methods in hypothesis testing and construction of simultaneous confidence sets in multivariate linear analysis. Three approaches in hypothesis testing are proposed: Asymptotic, bootstrap, and prepivoting methods. The performance of asymptotic method depends strongly on the availability of the asymptotic expansion. The asymptotic test statistic is first order correct; that is, its coverage differs from the nominal level by. Although certain classical refinements to asymptotical tests have been shown to be second order accurate, however, they are too cumbersome for analytical approach. Examples include Yao's (1965) approximate degrees of freedom method to the multivariate Behrens-Fisher problem and the multivariate Bartlett adjustment to the chi-squared asymptotics. The analytical difficulties are in the sense of computing the degrees of freedom and recovering the Bartlett factor. The bootstrap method, however, avoids such difficulties and can be approximated directly by a Monte Carlo algorithm. The principal aim of the present investigation is to compare the bootstrap method to the refinements of the asymptotic method in theory and in simulation. It is shown that the appropriate bootstrap test based on the Behrens-Fisher statistic is equivalent to James's first order asymptotic series and Yao's approximate degrees of freedom test; and the appropriate bootstrap likelihood ratio test automatically accomplishes Bartlett's adjustment to the chi-squared asymptotics. In addition, prepivoting any test statistic before forming a bootstrap test reduces the order of the error in rejection probability. The prepivoting can be iterated. The problem of constructing simultaneous confidence sets in multivariate linear analysis is considered. In the case when the normality assumption does not satisfy, the classical method such as pivotal method is too difficult for analytical approach. One way to improve this problem is to employ the nonparametric bootstrapping method that underlies Beran's (1988) bootstrapped roots method. Under stringent conditions, it is shown that the bootstrapped roots method overcomes distributional difficulties and generates simultaneous confidence sets such that the overall coverage probability is correct and the coverage probabilities of the individual confidence sets are equal in both multivariate regression and multivariate analysis of variance, respectively. For the special case in multivariate analysis of variance where the normality is presented, the projection method is proposed. It is shown that the projection method is not only suitable for balanced complete layout but also for the unbalanced complete layout.