Welcome to the Waldron lab for public health data science at the CUNY School of Public Health in New York City. I teach biostatistics, have an active research program in cancer genomics and in metagenomic profiling of the human microbiome, and develop methods within the intersection of statistical analysis and computation. My lab aims to generate new insights into human health, disease, and treatment through improved tools and novel analysis of publicly available data.


I believe that the health disparities of race, ethnicity, class, and geography can and must be eliminated through social, political, and scientific change. Professionally I am deeply committed to the Bioconductor project for open-source bioinformatics software, through contributions of individual software packages and support for the project as a whole.

Cancer Genomics

I have a long-standing interest in developing methods for and testing hypotheses using cancer genomics data. These efforts have resulted in greater understanding of the role of gene expression in defining disease subtypes and patient outcomes in high-grade serous ovarian carcinoma, colorectal cancer, and other cancers. They have generated software and databases for the analysis of multi-omics data, notably including MultiAssayExperiment and curatedTCGAData.

Human Microbiome Studies

Metagenomic sequencing has enabled probing the microbial communities that colonize the human body with previously unimaginable depth and resolution. I am fascinated by roles the microbiome may play as an interface between the individual and their environment, and the corresponding implications for health. This area of study is made even more enticing by the vast amounts of data becoming publicly available that can be combined and analyzed in new ways. Contributions in this area include the databases curatedMetagenomicData and HMP16SData.

Validation in Machine Learning

Methods of machine learning or statistical learning make it possible to learn prediction models from high-dimensional data such as from genomics. However, predictions for new cases are persistently worse than those for training data, even when controlling for the effects of over-fitting by cross-validation. I have been involved in a string of related methodological projects to quantify and mitigate the lack of generalizability of prediction models trained from genomic data.