Predicting Lyman-α Emission with Machine Learning

Scientific Background

The Epoch of Reionization (EoR) represents the final major phase transition of the universe, during which neutral hydrogen became fully ionized. Although quasars provide evidence of this transition, constructing a complete timeline remains difficult due to absorption of key emission lines at early times.

Lyman-α emission is a sensitive tracer of neutral hydrogen and is produced abundantly in star-forming galaxies. However, its observability is strongly suppressed by scattering in neutral gas, complicating its use as a probe of reionization.

Project Goal

Can machine learning accurately predict intrinsic Lyman-α emission from a galaxy using its physical properties?

By studying galaxies at lower redshift—where the universe is fully ionized—we aim to model the intrinsic Lyman-α equivalent width and identify the galaxy properties that drive its observability.

Data

Our dataset consists of over 11,000 galaxies with physical properties inferred using Bayesian modeling. After applying quality cuts on photometric fits, signal-to-noise, and equivalent width, a clean subset was used for machine learning analysis.

Machine Learning Models

We evaluated several non-linear regression models to predict the rest-frame Lyman-α equivalent width:

Results

XGBoost achieved the best predictive performance, outperforming neural networks and tree-based models, with prediction errors of approximately 60–80 Å.

Neural networks performed competitively but struggled at high equivalent widths due to data sparsity. Decision Trees showed limited predictive power, while Random Forests achieved moderate success.

Feature Importance & Interpretability

Using SHAP analysis on the XGBoost model, we identified the most influential physical drivers of Lyman-α emission:

These results provide physically interpretable insights into the processes that regulate Lyman-α escape from galaxies.

Scientific Impact

This framework enables predictions of intrinsic Lyman-α emission, which can be applied at higher redshifts to constrain the neutral hydrogen fraction during the Epoch of Reionization. By combining machine learning with physical interpretability, this work bridges data-driven modeling and astrophysical insight.

Future Work

Full Project Report

Below is the full technical report for this project, including methodology, model comparisons, and visualizations.