Data Science & ML Projects
A showcase of personal data science experiments and machine learning work outside my research that I dabbled with. If you are interested in my tech resume look here.
Auditory Pain Analysis (Medical Application)
This project investigates whether vocal characteristics can be used to classify physical pain from audio recordings. Initial experiments with a custom neural network showed moderate performance but were limited by the small size of the training dataset. To address this, we leveraged pretrained audio representations using the CLAP model and evaluated multiple machine learning approaches built on top of these embeddings. We found that models trained with randomly sampled train–test splits achieved higher accuracy, while performance dropped significantly when splitting data by speaker identity. This indicates that the models rely heavily on speaker-specific vocal features and struggle to generalize to unseen voices without an established baseline. These findings highlight the importance of speaker normalization and motivate future work focused on learning speaker-invariant representations, with potential applications in remote and telephone-based medical triage systems.
ML Comparison: Predicting Lyman-α Emission with Machine Learning
The Epoch of Reionization (EoR) marks the final major phase transition of the universe, during which neutral hydrogen became fully ionized. Understanding the timing and progression of this era remains challenging due to limited observational tracers and the strong absorption of key emission lines by neutral hydrogen at early times. Lyman-α emission is a powerful probe of this process, but its observability depends sensitively on both galaxy properties and surrounding hydrogen. In this project, I lead a team of 4 to analyze over 11,000 galaxies at lower redshift—where the universe is fully ionized—to model the intrinsic Lyman-α emission using different machine learning modesl. Physical galaxy properties were inferred using Bayesian modeling, and multiple ML approaches were evaluated to predict the rest-frame Lyman-α equivalent width. We compare neural networks, Random Forests, Decision Trees, and XGBoost models, finding that XGBoost achieves the best overall performance. Using SHAP interpretability, we identify star formation rate and dust attenuation as the dominant drivers of Lyman-α emission. These results provide a physically interpretable framework for predicting intrinsic Lyman-α emission, enabling future constraints on the neutral hydrogen fraction during the Epoch of Reionization.
Hypothesis Testing or A/B Testing
This project uses statistical hypothesis testing to evaluate whether a new golf ball performs better than a previously used model. We collected data from multiple rounds and applied a variety of t-tests, including paired, independent, and one-sample tests, to rigorously compare performance. All tests indicated no significant improvement with the new golf ball—the results suggest that, under the conditions tested, the new ball performs similarly to the old one. The project demonstrates a structured approach to experimental design, data validation (e.g., checking for normality), and reproducible statistical analysis using Python and logging.
Medical Imaging Classifier
Working on a CNN-based classifier for medical images to detect bone fractures and ear canal injuries. Goal is to automate injury detection and assess severity levels for faster diagnosis.