Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines

Can you predict whether people got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors? You can check out the repo here.

Overview

In this challenge, we will take a look at vaccination, a key public health measure used to fight infectious diseases. Vaccines provide immunization for individuals, and enough immunization in a community can further reduce the spread of diseases through "herd immunity." As of the launch of this competition, vaccines for the COVID-19 virus are still under development and not yet available. The competition will instead revisit the public health response to a different recent major respiratory disease pandemic. Beginning in spring 2009, a pandemic caused by the H1N1 influenza virus, colloquially named "swine flu," swept across the world. Researchers estimate that in the first year, it was responsible for between 151,000 to 575,000 deaths globally. A vaccine for the H1N1 flu virus became publicly available in October 2009. In late 2009 and early 2010, the United States conducted the National 2009 H1N1 Flu Survey. This phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. These additional questions covered their social, economic, and demographic background, opinions on risks of illness and vaccine effectiveness, and behaviors towards mitigating transmission. A better understanding of how these characteristics are associated with personal vaccination patterns can provide guidance for future public health efforts. See the data and challenge here: H1N1 Data

Project Goal

The primary goal of this project was to determine whether we can accuratley predict the likelihood of someone getting an h1n1 vacciine and seasonal flu vaccine. This project is a multilabel project, where we need to predict the likelihood of either label instead of each class.

Approach & Methods

We began by developing a custom XGBoost model trained directly on processed tabular features extracted from the dataset. This approach allowed for rapid experimentation and baseline performance evaluation.

To scale training and leverage cloud resources, we deployed the model using Amazon SageMaker. The dataset was uploaded to S3, and a SageMaker training job was launched using a custom training script, enabling distributed training and automated management of compute resources.

Model performance was evaluated under multiple preprocessing and cross-validation strategies:

Random splits: data points were randomly divided into training and testing sets, allowing evaluation of overall predictive performance.

Key Findings

The XGBoost models trained via SageMaker achieved strong predictive performance, with AUC-ROC scores of 0.82% and 0.84%, for h1n1 and seasonal flu respectively, on hold-out test sets using different preprocessing and feature configurations. This indicates that the selected tabular features contain meaningful signals for the target outcome.

Potential Applications

These results demonstrate the feasibility of using XGBoost models for predictive tasks in a cloud environment, such as large-scale risk scoring, classification, or anomaly detection. Deploying models via SageMaker allows for scalable training and inference, enabling integration into production pipelines or decision-support systems.

Limitations

Limited diversity in training data may reduce generalization
Model interpretability limited without feature importance.

Future Work

Future extensions could further improve model robustness and applicability:

Expand training dataset across more diverse populations or conditions
Implement additional feature engineering or selection methods
Explore ensemble or hybrid models to improve generalization
Incorporate explainability tools (e.g., SHAP, feature importance)
Deploy real-time inference pipelines on SageMaker endpoints

Continuing the Project

This project is designed to be extensible. Future contributors could build on this work by integrating new datasets, testing alternative algorithms, automating model retraining on SageMaker, or combining model outputs with other analytics systems in production environments.