Óscar A. Chávez Ortiz - Data Science Projects

Developing Python Package: Lyra

This project investigates whether useful inference of a galaxies Lyman-alpha EW can be conditioned on the galaxy properties. I developed a new Python package that utilizes the python code sbi, a nromalizing flow framework, to map galaxies properties to emerged Lyman-alpha properties. The code is fully pip-installable and is mainly used for inference.

Trainging was done on 11,000,000 galaxies using the full posteriors of 15,000 galaxies to perform the mapping of galaxy properties to emerged Lyman-alpha strengths. I validated the inference tool and found that the trained posteiror is well calibrated and does not under or overpredict the emerged Lyman-alpha EW.

Airline Delay: Automated SQL Ingestion and Querying

his project builds a scalable data ingestion and analysis pipeline for U.S. airline flight delay data, focusing on reliable ETL workflows, data validation, and reproducible analytics. It automates the loading, cleaning, and storage of large monthly flight records into a structured database, enabling efficient querying and exploratory analysis of delay patterns across airlines, airports, and time periods. The system is designed with extensibility in mind, providing a clean foundation for future integration of machine learning models to predict delays and evaluate operational risk at scale.

Auditory Pain Analysis (Medical Application)

This project investigates whether vocal characteristics can be used to classify physical pain from audio recordings. Initial experiments with a custom neural network showed moderate performance but were limited by the small size of the training dataset. To address this, we leveraged pretrained audio representations using the CLAP model and evaluated multiple machine learning approaches built on top of these embeddings.

We found that models trained with randomly sampled train–test splits achieved higher accuracy, while performance dropped significantly when splitting data by speaker identity. This indicates that the models rely heavily on speaker-specific vocal features and struggle to generalize to unseen voices without an established baseline. These findings highlight the importance of speaker normalization and motivate future work focused on learning speaker-invariant representations, with potential applications in remote and telephone-based medical triage systems.

ML Comparison: Predicting Lyman-α Emission with Machine Learning

The Epoch of Reionization (EoR) marks the final major phase transition of the universe, during which neutral hydrogen became fully ionized. Understanding the timing and progression of this era remains challenging due to limited observational tracers and the strong absorption of key emission lines by neutral hydrogen at early times. Lyman-α emission is a powerful probe of this process, but its observability depends sensitively on both galaxy properties and surrounding hydrogen.

In this project, I lead a team of 4 to analyze over 11,000 galaxies at lower redshift—where the universe is fully ionized—to model the intrinsic Lyman-α emission using different machine learning modesl. Physical galaxy properties were inferred using Bayesian modeling, and multiple ML approaches were evaluated to predict the rest-frame Lyman-α equivalent width.

We compare neural networks, Random Forests, Decision Trees, and XGBoost models, finding that XGBoost achieves the best overall performance. Using SHAP interpretability, we identify star formation rate and dust attenuation as the dominant drivers of Lyman-α emission. These results provide a physically interpretable framework for predicting intrinsic Lyman-α emission, enabling future constraints on the neutral hydrogen fraction during the Epoch of Reionization.

Determining Likelihood of Vaccination

Flu Shot Learning is a multilabel classification project focused on predicting whether individuals received H1N1 and seasonal flu vaccines using demographic, behavioral, and health-related survey data from the 2009 U.S. National H1N1 Flu Survey. I developed an XGBoost-based modeling pipeline trained on processed tabular features and deployed it using Amazon SageMaker, leveraging S3 for data storage and managed cloud infrastructure for scalable training. Model performance was evaluated using random train–test splits across multiple preprocessing configurations, achieving strong predictive performance with AUC-ROC scores of 0.82 for H1N1 vaccination and 0.84 for seasonal flu vaccination. The results demonstrate that survey-based features capture meaningful signals related to vaccination behavior and highlight the feasibility of deploying tree-based models for public health prediction tasks in a production-ready cloud environment.

Hypothesis Testing or A/B Testing

This project uses statistical hypothesis testing to evaluate whether a new golf ball performs better than a previously used model. We collected data from multiple rounds and applied a variety of t-tests, including paired, independent, and one-sample tests, to rigorously compare performance. All tests indicated no significant improvement with the new golf ball—the results suggest that, under the conditions tested, the new ball performs similarly to the old one. The project demonstrates a structured approach to experimental design, data validation (e.g., checking for normality), and reproducible statistical analysis using Python and logging.