Research

My research is on methods in spatial and environmental epidemiology, specifically focusing on environmental mixtures and spatio-temporal cluster detection. In my collaborative work, I have explored spatial aspects of breast cancer risk, maternal and obstetric outcomes, and environmental and ecological applications.

A Flexible Method for Identifying Spatial Clusters of Breast Cancer Using Individual-Level Data

Maria E. Kamenetsky, Amy Trentham-Dietz, Polly Newcomb, Jun Zhu, Ronald Gangnon (Annals of Epidemiology, 2022)

Prior research has shown that cancer risk varies by geography, but scan statistic methods for identifying cancer clusters in case-control studies have been limited in their ability to identify multiple clusters and adjust for participant-level risk factors. We develop a method to identify geographic patterns of breast cancer odds using the Wisconsin Women’s Health Study, a series of 5 population-based case-control studies of female Wisconsin residents aged 20-79 enrolled in 1988-2004 (cases=16,076, controls=16,795). We create sets of potential clusters by overlaying a 1 km grid over each county-neighborhood and enumerating a series of overlapping circles. Using a two-step approach, we fit a penalized binomial regression model to the number of cases and trials in each grid cell, penalizing all potential clusters by the least absolute shrinkage and selection operator (Lasso). We use BIC to select the number of clusters, which are included in a participant-level logistic regression model. We identify 15 geographic clusters, resulting in 23 areas of unique geographic odds ratios. After adjustment for known risk factors, confidence intervals narrowed but breast cancer odds ratios did not meaningfully change; one additional hotspot was identified. By considering multiple overlapping spatial clusters simultaneously, we discern gradients of spatial odds across Wisconsin.

Regularized spatial and spatio-temporal cluster detection

Maria E. Kamenetsky, Junho Lee, Jun Zhu, Ronald Gangnon (Spatial and Spatio-Temporal Epidemiology, 2022)

Spatial and spatio-temporal cluster detection are important tools in public health and many other areas of application. Cluster detection can be approached as a multiple testing problem, typically using a space and time scan statistic. We recast the spatial and spatio-temporal cluster detection problem in a high-dimensional data analytical framework with Poisson or quasi-Poisson regression with the Lasso penalty. We adopt a fast and computationally-efficient method using a novel sparse matrix representation of the effects of potential clusters. The number of clusters and tuning parameters are selected based on (quasi-)information criteria. We evaluate the performance of our proposed method including the false positive detection rate and power using a simulation study. Application of the method is illustrated using breast cancer incidence data from three prefectures in Japan.

Tutorials supplement “Regularized spatial and spatio-temporal cluster detection” (in press) and are associated with the clusso R package, which can be found here.

Introduction to clusso

Mapping with clusso

Using clusso with case-control data

strm

Maria Kamenetsky, Guangqing Chi, Jun Zhu (2020)

strm is an R package that fits spatio-temporal regression model based on Chi & Zhu Spatial Regression Models for the Social Sciences (2019). The approach fits a simultaneous spatial error model (SAR) while incorporating a temporally lagged response variable and temporally lagged explanatory variables. The GitHub page can be found here and strm is now available on CRAN.

Clustered Spatio-Temporal Varying Coefficient Regression Model

Junho Lee, Maria E. Kamenetsky, Ronald Gangnon, Jun Zhu (Statistics in Medicine, 2021)

In regression analysis for spatio-temporal data, identifying clusters of spatial units over time in a regression coefficient could provide insight into the unique relationship between a response and covariates in certain subdomains of space and time windows relative to the background in other parts of the spatial domain and the time period of interest. In this article, we propose a varying coefficient regression method for spatial data repeatedly sampled over time, with heterogeneity in regression coefficients across both space and over time. In particular, we extend a varying coefficient regression model for spatial-only data to spatio-temporal data with flexible temporal patterns. We consider the detection of a potential cylindrical cluster of regression coefficients based on testing whether the regression coefficient is the same or not over the entire spatial domain for each time point. For multiple clusters, we develop a sequential identification approach. We assess the power and identification of known clusters via a simulation study. Our proposed methodology is illustrated by the analysis of a cancer mortality dataset in the Southeast of the U.S.

Tutorials supplement “Clustered Spatio-Temporal Varying Coefficient Regression Model” (2021) and are associated with the coefclust package, which can be found here.

Introduction to coefclust

Spatio-Temporal Analysis using coefclust

Spatial Regression Analysis of Poverty in R

Maria Kamenetsky, Guangqing Chi, Donghui Wang, Jun Zhu (Spatial Demography, 2019)

Poverty has been studied across many social science disciplines, resulting in a large body of literature. Scholars of poverty research have long recognized that the poor are not uniformly distributed across space. Understanding the spatial aspect of poverty is important because it helps us understand place-based structural inequalities. There are many spatial regression models, but there is a learning curve to learn and apply them to poverty research. This manuscript aims to introduce the concepts of spatial regression modeling and walk the reader through the steps of conducting poverty research using R: standard exploratory data analysis, standard linear regression, neighborhood structure and spatial weight matrix, exploratory spatial data analysis, and spatial linear regression. We also discuss the spatial heterogeneity and spatial panel aspects of poverty. We provide code for data analysis in the R environment and readers can modify it for their own data analyses. We also present results in their raw format to help readers become familiar with the R environment.

The tutorials below supplement “Spatial Regression Analysis of Poverty in R” (2019) by Kamenetsky, Chi, Wang, and Zhu (Spatial Demography, 2019). The SpatialRegPovertyR repository for these tutorials can be found here.

Using tidycensus

Weighting and transformations

Using tmap

Predictive Enforcement of Pollution and Hazardous Waste Violations in New York State - Data Science for Social Good

Eric Potash, Jimmy Jin, Maria Kamenetsky, Dean Magee, Paul van der Boor, Rayid Ghani (2016)

The improper treatment and disposal of hazardous waste can have disastrous effects on the environment and human health. The Resource Conservation and Recovery Act (RCRA) governs hazardous waste management in the United States. To enforce its regulations, the New York State Department of Environmental Conservation (NYSDEC) inspects facilities that handle hazardous materials. However, due to resource constraints, not all facilities can be inspected each year. We worked with NYSDEC to build predictive models that use reporting, monitoring, and enforcement data to prioritize inspection resources.

Details on the project and conference video can be found here.

Predictive Modeling for Environmental Protection: Hazardous Waste Management , can be found here .