Seminars 2016-2017



Wednesday, June 7, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Michael Binois (Chicago Booth and Virginia Tech)

Title: Practical Heteroskedastic Gaussian Process Modeling and Design for Large Simulation Experiments

Abstract: We present a unified view of likelihood based Gaussian progress regression for simulation experiments exhibiting input-dependent noise. Replication plays a key role in that context as it allows to perform inference for all parameters, bypassing full-data sized calculations. We then borrow a latent-variable idea from machine learning to address heteroskedasticity, leveraging both the computational and statistical efficiency of designs with replication. We further propose to create and sequentially enrich designs with a tunable degree of replication. Illustrations are provided, including real-world simulation experiments from manufacturing and the management of epidemics.

Wednesday, May 31, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Yunjin Choi (Department of Statistics and Applied Probability, National University of Singapore)

Title: 2-D logistic fused lasso on large scale data

Abstract: Lasso is a widely used regression analysis method that yields sparse estimator. As sparse estimators leads to an automated model selection procedure, many variations of lasso have been developed to address various problem settings. Fused lasso, which is one of the variations, utilizes the sparse property to obtain locally clustered estimators. In this work, we investigate 2-D fused lasso in logistic regression problem that detects two-dimensional geographical features of estimators given binary response variables. Despite many beneficial features, fused lasso can be computationally costly, especially when the cost function does not have analytic solution. To address this issue, our proposed method facilitates large scale data analysis by utilizing ADMM algorithm. Also, we extend the method to be applicable to flexible data structures including missing observations scenario. Simulations show promising results.

Wednesday, May 24, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Mladen Kolar (Booth School of Business, University of Chicago)

Title: Post-Regularization Inference for Dynamic Nonparanormal Graphical Models

Abstract: We propose a novel class of dynamic nonparanormal graphical models, which allows us to model high dimensional heavy-tailed systems and the evolution of their latent network structures. Under this model we develop statistical tests for presence of edges both locally at a fixed index value and globally over a range of values. The tests are developed for a high-dimensional regime, are robust to model selection mistakes and do not require commonly assumed minimum signal strength. The testing procedures are based on a high dimensional, debiasing-free moment estimator, which uses a novel kernel smoothed Kendall's tau correlation matrix as an input statistic. The estimator consistently estimates the latent inverse Pearson correlation matrix uniformly in both index variable and kernel bandwidth. Its rate of convergence is shown to be minimax optimal. Thorough numerical simulations and an application to a neural imaging dataset support the usefulness of our method.

Joint work with Junwei Lu and Han Liu.

Wednesday, May 17, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Ramtin Pedarsani (ECE, UCSB)

Title: Fast and Robust Compressive Phase Retrieval Using Sparse-Graph Codes

Abstract: In this talk, we consider the problem of compressive phase retrieval that emerges in different applications such as optics, X-ray crystallography, astronomy, etc. The compressive phase retrieval problem is to recover a sparse complex signal from the magnitude of linear measurements. We consider two scenarios where the measurements are noiseless or noisy, and using coding theoretic tools, we develop the "PhaseCode" algorithm that has near-optimal sample and decoding complexity in both settings. Further, we provide extensive simulations results that show tight agreement between theory and practice. Finally, we will briefly discuss how our coding theoretic framework can tackle other sparse recovery problems such as sparse mixed linear regression, group testing, and sparse covariance estimation.

Wednesday, May 10, Location: Multi-Cultural Center, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Bin Yu (Statistics and EECS, UC Berkeley) [Sobel Seminar]

Title: Three principles of data science: predictability, stability, and computability

Abstract: In this talk, I'd like to discuss the intertwining importance and connections of three principles of data science in the title in data-driven decisions. Making prediction as its central task and embracing computation as its core, machine learning has enabled wide-ranging data-driven successes. Prediction is a useful way to check with reality. Good prediction implicitly assumes stability between past and future. Stability (relative to data and model perturbations) is also a minimum requirement for interpretability and reproducibility of data driven results (cf. Yu, 2013). It is closely related to uncertainty assessment. Obviously, both prediction and stability principles can not be employed without feasible computational algorithms, hence the importance of computability.

The three principles will be demonstrated in the context of two neuroscience projects and through analytical connections. In particular, the first project adds stability to predictive modeling used for reconstruction of movies from fMRI brain signals for interpretable models. The second project use predictive transfer learning that combines AlexNet, GoogleNet and VGG with single V4 neuron data for state-of-the-art prediction performance. Our results lend support, to a certain extent, to the assemblence of these CNNs to brain and at the same time provide stable pattern interpretations of neurons in the difficult primate visual cortex V4.

Wednesday, May 3, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Bala Rajaratnam (University of California, Davis)

Title: A spatial modeling approach to high dimensional statistical paleoclimate reconstructions

Abstract: The study of climate over the earth's history is a topic of current interest whose relevance has increased rapidly with the growing concern over climate change. Reconstructing climates of the past (sometimes referred to as the "hockey stick" problem) has been used to understand whether the current climate is anomalous in a millennial context. To this end, various statistical climate field reconstructions (CFR) methods have been proposed to infer past temperature from (paleoclimate) multiproxy networks. We propose a novel statistical climate field reconstruction method that aims to use recent advances in statistics, and in particular, high dimensional sparse covariance estimation to tackle this problem. The new CFR method provides a flexible framework for modeling the inherent spatial heterogeneities of high-dimensional spatial fields and at the same time provide the parameter reduction necessary for obtaining precise and well-conditioned estimates of the covariance structure of the field, even when the sample size is much smaller than the number of variables. Our results show that the new method can yield significant improvements over existing methods, with gains uniformly over space. We also show that the new methodology is useful for regional paleoclimate reconstructions, and can yield better uncertainty quantification. We demonstrate that the increase in performance is directly related to recovering the underlying structure in the covariance of the spatial field. We also provide compelling evidence that the new methodology performs well even at spatial locations with few proxies. (Joint work with D.Guillot and J. Emile-Geay).

Wednesday, April 26, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Rajarshi Guhaniyogi (University of California, Santa Cruz)

Title: Scalable Bayesian regression framework for tensor valued objects

Abstract: This talk proposes a Bayesian approach to regression with a tensor predictor or response. Tensor covariates/responses are commonly vectorized prior to analysis, failing to exploit the structure of the tensor, and resulting in poor estimation and predictive performance. We develop a novel class of multiway shrinkage priors for the coefficients in tensor regression models. Properties are described, including posterior consistency under mild conditions, and an efficient Markov chain Monte Carlo algorithm is developed for posterior computation. Simulation studies illustrate substantial gains over vectorizing or using existing tensor regression methods in terms of estimation and parameter inference. The approach is further illustrated in a neuroimaging application.

Monday, April 10, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Hokwon Cho (University of Nevada, Las Vegas)

Title: On Fixed-Width Confidence Limits for the Risk Ratio

Abstract: We propose a sequential method for obtaining approximate confidence limits and optimal sample sizes for the risk ratio (RR) of two independent sequences of binomial variates. The procedure is developed based on a bias-corrected maximum likelihood estimator (MLE). We investigate properties of the estimator; unbiasedness, efficiency in variance and normality. First-order asymptotic expansions are obtained to investigate large-sample properties of the proposed procedure. Monte Carlo experiment is carried out for various scenarios of samples for examining the finite sample behavior. Two-stage procedure would be discussed if time permits.

Wednesday, February 22, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Raghu Pasupathy (Purdue University)

Title: The Adaptive Gradient Method for Optimization of Functions Observed with Error

Abstract: We consider the question of unconstrained optimization in certain settings, e.g., stochastic optimization, where the objective function f is unknown but can be estimated using an inexact oracle such as quasi-Monte Carlo (QMC) or numerical quadrature. The inexact oracle is assumed to yield function estimates whose error decays with an increase in oracle effort. For solving such optimization problems, we present the adaptive gradient method (AGM) in two flavors called Constant Step AGM and Backtracking Line Search AGM. AGM's salient feature is the adaptive manner in which it constructs gradient estimates — since gradient information is unavailable, AGM estimates the gradient adaptively, by exerting (only) enough oracle effort to keep the error in the gradient estimate in "lock-step" with the true gradient. We show that AGM's iterates converge to a stationary point of f under minimal conditions. We also prove two sets of results on AGM's work complexity. (i) When f is twice differentiable with Lipschitz first derivative, AGM's work complexity is shown to be arbitrarily close to O(ε-2 - (1 / μ(α))), where μ(α) is the error decay rate of the gradient estimate expressed as a function of the error decay rate α of the objective function estimate. (ii) When f is twice differentiable and strongly convex with Lipschitz first derivative, AGM's work complexity is shown to be arbitrarily close to O(ε-1/μ(α)). The corresponding rates in (i) and (ii) when an exact oracle for f and its gradient is available are known to be O(ε-2) and O(-log ε) respectively. We illustrate the calculation of α and μ(α) for common choices, e.g., QMC with finite difference gradients; we also illustrate AGM through a numerical example.

Wednesday, January 25, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Alexander Franks (Harvard University)

Title: Bayesian Covariance Estimation with Applications in High-throughput Biology

Abstract: Understanding the function of biological molecules requires statistical methods for assessing covariability across multiple dimensions as well as accounting for complex measurement error and missing data. In this talk, I will discuss two models for covariance estimation which have applications in molecular biology. In the first part of the talk, I will describe a model-based method for evaluating heterogeneity among several p x p covariance matrices in the large p, small n setting and will illustrate the utility of the method for exploratory analyses of high-dimensional multivariate gene expression data. In the second half of the talk, I will describe the role of covariance estimation in quantifying how cells regulate protein levels. Specifically, estimates of the correlation between steady-state levels of mRNA and protein are used to assess the degree to which protein levels are determined by post-transcriptional processes. Differences in cell preparation, measurement technology and protocol, as well as the pervasiveness of missing data complicate the accurate estimation of this correlation. To address these issues, I fit a Bayesian hierarchical model to a compendium of 58 data sets from multiple labs to infer a structured covariance matrix of measurements. I contextualize and contrast our results to conclusions drawn in previous studies.

Monday, January 23, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Benjamin Risk (SAMSI & University of North Carolina)

Title: Linear non-Gaussian component analysis via maximum likelihood

Abstract: Independent component analysis (ICA) is popular in many applications, including cognitive neuroscience and signal processing. Due to computational constraints, principal component analysis is used for dimension reduction prior to ICA (PCA+ICA), which could remove important information. The problem is that interesting independent components (ICs) could be mixed in several principal components that are discarded and then these ICs cannot be recovered. We formulate a linear non-Gaussian component model with Gaussian noise components. To estimate this model, we propose likelihood component analysis (LCA), in which dimension reduction and latent variable estimation are achieved simultaneously. Our method orders components by their marginal likelihood in a manner that parallels the ordering of components by variance used in principal component analysis (PCA). We present a semi-parametric LCA in which the log densities are estimated using cubic B-splines. In simulations, latent components are recovered that are discarded by PCA+ICA methods. We apply our method to a multivariate dataset on leaf attributes and demonstrate that LCA is a useful data visualization and dimension reduction tool that reveals features not apparent from PCA or PCA+ICA. We also apply our method to an fMRI experiment from the Human Connectome Project and identify artifacts missed by PCA+ICA.

Friday, January 20, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Yuwen Gu (University of Minnesota)

Title: High-dimensional Generalizations of Asymmetric Least Squares and Their Applications

Abstract: Asymmetric least squares (ALS) regression is a convenient and effective method of summarizing the conditional distribution of a response variable given the covariates. Recent years have seen a growing interest in ALS amongst statisticians, econometricians and financial analysts. However, existing work on ALS only considers the traditional low-dimension-and-large-sample setting. In this talk, we systematically explore the Sparse Asymmetric LEast Squares (SALES) regression under high dimensionality. We show the complete theory using penalties such as lasso, MCP and SCAD. A unified efficient algorithm for fitting SALES is proposed and is shown to have a guaranteed linear convergence.

An important application of SALES is to detect heteroscedasticity in high-dimensional data and from that perspective it provides a computationally friendlier alternative to sparse quantile regression (SQR). However, when the goal is to separate the set of significant variables for the mean and that for the standard deviation of the conditional distribution, SALES and SQR can fail when overlapping variables exist. To that end, we further propose a Coupled Sparse Asymmetric LEast Squares (COSALES) regression. We show that COSALES can consistently identify the two important sets of significant variables for the mean and standard deviation simultaneously, even when the two sets have overlaps.

Wednesday, January 18, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Kelly Bodwin (University of North Carolina Chapel Hill)

Title: Coherent Set Mining in Binary Data

Abstract: In this talk, I will introduce a new algorithm, Coherent Set Mining (CSM), for extracting sets of associated variables from binary data. This method relies on a new notion of object association, coherence, and an iterative testing approach based on asymptotic distribution results. In particular, CSM is designed to be effective in the common but challenging setting of non-identically distributed samples. I will share applications of CSM in text mining and in music recommendation systems.

Friday, January 13, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Murat Erdogdu (Statistics, Stanford University)

Title: Design and Analysis of Scalable Algorithms via Statistical Tools

Abstract: Statistics and optimization have been closely linked since the very outset. This connection has become more essential lately, mainly because of the recent advances in computational resources, the availability of large amount of data, and the consequent growing interest in statistical and machine learning algorithms. In this talk, I will discuss how one can use tools from statistics such as Stein's lemma and subsampling to design scalable, efficient, and reliable optimization algorithms. The focus will be on large-scale problems where the iterative minimization of the empirical risk is computationally intractable, i.e., the number of observations n is much larger than the dimension of the parameter p, n >> p >> 1. The proposed algorithms have wide applicability to many supervised learning problems such as binary classification with smooth surrogate losses, generalized linear problems in their canonical representation, and M-estimators. The algorithms rely on iterations that are constructed by Stein's lemma, that achieve quadratic convergence rate, and that are cheaper than any batch optimization method by at least a factor of O(p). I will discuss theoretical guarantees of the proposed algorithms, along with their convergence behavior in terms of data dimensions. Finally, I will demonstrate their performance on well-known classification and regression problems, through extensive numerical studies on large-scale real datasets, and show that they achieve the highest performance compared to other widely used and specialized algorithms.

Monday, January 9, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Zijian Guo (Statistics, University of Pennsylvania)

Title: Inference for High Dimensional Linear Models: Fundamental Limits and Algorithms

Abstract: High dimensional linear regression is one of the most important models in analyzing modern data sets. Although the estimation problem is well understood, there is still a paucity of methods and fundamental theoretical results on confidence intervals for high dimensional linear regression. In this talk, I will present confidence interval results for a general linear functional. I will first construct confidence intervals of optimal expected length in the oracle setting of known sparsity level. Then, I will focus on the problem of adaptation to sparsity for the construction of confidence intervals. I will identify the regimes in which it is possible to construct adaptive confidence intervals. In terms of optimality and adaptivity, there are striking differences between linear functionals with a sparse loading and a dense loading.

In the framework of high dimensional linear models, another interesting quantity is the normalized inner product of the two regression vectors, which can represent an important concept in genetics, the genetic correlation between phenotypes. I will introduce Functional De-biased Estimator (FDE) which achieves the optimal convergence rate of estimating the genetic correlation. The FDE estimator is applied to estimate the genetic correlations among different phenotypes in a yeast data set. Finally, I will discuss an interesting connection between the aforementioned problems and provide a unified view of the proposed procedures.

Wednesday, November 30, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: John R. Gilbert (Computer Science, UCSB)

Title: Graphs and Sparse Matrices: There and Back Again

Abstract: The mathematical connections between graph theory and linear algebra are intimate and well known. The computational links between the two fields are also deep, extending from basic data structures to fundamental decompositions to the design of efficient algorithms. During the first 50 years of this computational relationship, graphs served numerical linear algebra by enabling efficient sparse matrix computation. Recently, matrix computation has been returning the favor.

I will talk about the past and present of this relationship in both directions, and speculate a bit on its future. Along the way, I will describe two software systems we have built for computing with large graphs and networks on parallel computers, CombBLAS and the Knowledge Discovery Toolbox. The key to their performance and scaling is sparse matrix computation. Finally, I will advertise the Graph BLAS Forum, an open effort to standardize primitives for graph computation, building on many groups' work on graph algorithms in the language of linear algebra.

Wednesday, November 16, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Ali Al-Sharadqah (California State University Northridge)

Title: Geometric Fitting in Error-In-Variables Models

Abstract: We will introduce Errors-in-Variables models (EIV) and its applications in geometric estimation, which is a widely known topic in computer vision and pattern recognition. In geometric estimation, two types of problems will be discussed: (1) Fitting geometric curves such as circles, and ellipses to a set of experimental observations whose both coordinates are contaminated by noisy errors. (2) Other applications in computer vision such as 'Fundamental Matrix' estimation and 'Homography' computation that are essential in 3D-reconstruction.

Some theoretical results in circle and ellipse fitting will be addressed first. These results lead to some methodological questions that require further investigation. Therefore, we developed our unconventional statistical analysis that allowed us to effectively assess EIV parameter estimates. We validated this approach through a series of numerical tests. We theoretically compared the most popular fits for circles and ellipses with each other and we showed why and by how much each fit differs from others. Our theoretical comparison leads to new unbeatable fits with superior characteristics that surpass all existing fits theoretically and experimentally.

Wednesday, November 9, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Xiaotong Shen (University of Minnesota)

Title: Personalized prediction and recommender systems

Abstract: Personalized prediction predicts a user's preference for a large number of items through user-specific as well as content-specific information, based on a very small amount of observed preference scores. In a sense, predictive accuracy depends on how to pool the information from similar users and items. Two major approaches are collaborative filtering and content-based filtering. Whereas the former utilizes the information on users that think alike for a specific item, the latter acts on characteristics of the items that a user prefers, on which two kinds of recommender systems Grooveshark and Pandora are built. In this talk, I will discuss various aspects of latent factor modeling, in addition to computational strategies for large problems.

Wednesday, November 2, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Josselin Garnier (Ecole Polytechnique)

Title: Correlation-based imaging in random media

Abstract: Sensor array imaging in a randomly scattering medium is usually limited because coherent signals recorded by the array and coming from a reflector to be imaged are weak and dominated by incoherent signals coming from multiple scattering by the medium. Stochastic and multi-scale analysis has recently allowed for the emergence of original imaging techniques. We will see in this talk how correlation-based imaging techniques can mitigate or even sometimes benefit from the multiple scattering of waves.

Wednesday, October 26, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: (Tony) Jianguo Sun (University of Missouri)

Title: Regression Analysis of Informatively Interval-censored Failure Time Data

Abstract: Interval-censored failure time data occur in many fields such as demography, economics, medical research and reliability, and many inference procedures on them have been developed (Chen et al., 2012; Sun, 2006). However, most of the existing approaches assume that the mechanism that yields interval censoring is independent of the failure time of interest and it is clear that this may not be true in practice. In this talk, we will discuss this latter situation and present some inference procedures for the problem.

Wednesday, October 19, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Jason Marden (ECE-UCSB)

Title: Incentivizing Local Behavior in Distributed Systems

Abstract: The central goal in multiagent systems is to design local control laws for the individual agents to ensure that the emergent global behavior is desirable with respect to a given system level objective. Game theory is beginning to emerge as a valuable set of tools for achieving this goal. A central component of this game theoretic approach is the assignment of utility functions to the individual agents. Here, the goal is to assign utility functions within an "admissible" design space such that the resulting game possesses desirable properties, e.g., existence and efficiency of pure Nash equilibria. Our first set of results focuses on ensuring the existence of pure Nash equilibria. Here, we prove that weighted Shapley values completely characterize the space of "local" utility functions that guarantee the existence of a pure Nash equilibrium. That is, if the agents' utility functions cannot be represented as a weighted Shapley value, then there exists a game for which a pure Nash equilibrium does not exist. One of the interesting consequences of this characterization is that guaranteeing the existence of a pure Nash equilibrium necessitates the use of a game structure termed "potential games". Building on this characterization, our second set of results will focus on characterizing the utility functions that optimize the efficiency of the resulting pure Nash equilibrium.

Wednesday, October 12, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Pierre-Oliver Goffard (PSTAT-UCSB)

Title: Boundary Crossing Problems with Applications to Risk Management

Abstract: Many problems in stochastic modeling come down to study the crossing time of a certain stochastic process through a given boundary, lower or upper. Typical fields of application are in risk theory, epidemic modeling, queueing, reliability and sequential analysis. The purpose of this talk is to present a method to determine boundary crossing probabilities linked to stochastic point processes having the order statistic property. A very well known boundary crossing result is revisited, a detailed proof is given. the same arguments may be used to derive results in trickier situations. We further discuss the practical implications of this classical result and if there is still some time left, some duality features might be presented.

Wednesday, September 28, South Hall 5607F, 3:30-5:00 p.m.; refreshments served at 3:15 p.m.

Speaker: Debases Sengupta (PSTAT-UCSB and Indian Statistical Institute, Kolkata)

Title: Feature sensitive and automated curve registration with paleo-climatic application

Abstract: Given two sets of functional data having a common underlying mean function but different degrees of distortion in time measurements, we provide a method of estimating the time transformation necessary to align (or 'register') them. The novelty of the method lies in the elimination of prior smoothing, which can be an impediment to good performance. We prove that the proposed method is consistent under fairly general conditions. Simulation results show superiority of the performance of the proposed method over two existing methods. The proposed method is illustrated through the analysis of three paleoclimatic data sets. (This work was done jointly with Dibyendu Bhowmick and Radhendushka Srivastava.)