Can you predict whether people got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors? You can check out the repo here.
In this challenge, we will take a look at vaccination, a key public health measure used to fight infectious diseases. Vaccines provide immunization for individuals, and enough immunization in a community can further reduce the spread of diseases through "herd immunity." As of the launch of this competition, vaccines for the COVID-19 virus are still under development and not yet available. The competition will instead revisit the public health response to a different recent major respiratory disease pandemic. Beginning in spring 2009, a pandemic caused by the H1N1 influenza virus, colloquially named "swine flu," swept across the world. Researchers estimate that in the first year, it was responsible for between 151,000 to 575,000 deaths globally. A vaccine for the H1N1 flu virus became publicly available in October 2009. In late 2009 and early 2010, the United States conducted the National 2009 H1N1 Flu Survey. This phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. These additional questions covered their social, economic, and demographic background, opinions on risks of illness and vaccine effectiveness, and behaviors towards mitigating transmission. A better understanding of how these characteristics are associated with personal vaccination patterns can provide guidance for future public health efforts. See the data and challenge here: H1N1 Data
The primary goal of this project was to determine whether we can accuratley predict the likelihood of someone getting an h1n1 vacciine and seasonal flu vaccine. This project is a multilabel project, where we need to predict the likelihood of either label instead of each class.
We began by developing a custom XGBoost model trained directly on processed tabular features extracted from the dataset. This approach allowed for rapid experimentation and baseline performance evaluation.
To scale training and leverage cloud resources, we deployed the model using Amazon SageMaker. The dataset was uploaded to S3, and a SageMaker training job was launched using a custom training script, enabling distributed training and automated management of compute resources.
Model performance was evaluated under multiple preprocessing and cross-validation strategies:
The XGBoost models trained via SageMaker achieved strong predictive performance, with AUC-ROC scores of 0.82% and 0.84%, for h1n1 and seasonal flu respectively, on hold-out test sets using different preprocessing and feature configurations. This indicates that the selected tabular features contain meaningful signals for the target outcome.
These results demonstrate the feasibility of using XGBoost models for predictive tasks in a cloud environment, such as large-scale risk scoring, classification, or anomaly detection. Deploying models via SageMaker allows for scalable training and inference, enabling integration into production pipelines or decision-support systems.
Future extensions could further improve model robustness and applicability:
This project is designed to be extensible. Future contributors could build on this work by integrating new datasets, testing alternative algorithms, automating model retraining on SageMaker, or combining model outputs with other analytics systems in production environments.