Population Size and Cigarette Consumption: A Machine Learning Perspective

Adebayo OP, Ahmed I, Ogunjimi OA and Oyeleke KT

Published on: 2025-06-20

Abstract


This study develops a predictive machine learning model to analyze cigarette pack sales using key economic and demographic variables. The gradient boosting model achieved strong predictive performance with a test RMSE of 0.530 and explained 70.6% of sales variance (R² = 0.706), demonstrating reliable forecasting capability. Feature importance analysis revealed price as the dominant predictor, contributing 55.7% of the model's explanatory power, followed by tax (12.7%) and population (11.5%). The implementation of early stopping at the 10th iteration prevented overfitting while maintaining model generalizability. These findings provide actionable insights for public health policy and retail strategy, quantifying how pricing and taxation influence consumption patterns. The analysis also identified negligible contributions from certain variables (e.g., CPI), suggesting opportunities for model simplification. By combining robust predictive accuracy with interpretable feature importance metrics, this research offers a data-driven framework for understanding cigarette market dynamics and supporting evidence-based decision-making. The results highlight the critical role of price sensitivity in tobacco consumption while establishing methodological foundations for future sales forecasting models.

Keywords

Machine learning model; XGBoost regression; LightGBM; Gain; Cover; Frequency metrics

Introduction

The relationship between population size and cigarette consumption represents a critical area of study in public health research, with significant implications for tobacco control policies and population health outcomes. As global populations continue to grow and urbanize, understanding the dynamics of cigarette consumption patterns becomes increasingly important for effective public health interventions [1].

Research has consistently demonstrated that population characteristics significantly influence tobacco use patterns. Larger populations tend to exhibit higher absolute cigarette consumption due to simple scaling effects, but more importantly, population density and urbanization have been shown to correlate with distinct smoking behaviors [2]. The concentration of tobacco retailers in urban areas, combined with targeted marketing strategies, often leads to increased accessibility and normalization of smoking in densely populated regions [3].

Demographic factors interact with population size to create complex consumption patterns. Studies have found that age structure within populations significantly affects smoking prevalence, with younger populations in developing nations showing different consumption patterns compared to aging populations in developed countries [4]. Furthermore, the phenomenon of population aging in many nations has led to shifting patterns of tobacco-related diseases, with longer exposure periods contributing to greater health burdens [5].

Economic theories of consumption suggest that population size affects cigarette demand through multiple mechanisms. The theory of rational addiction [6] posits that social interactions within populations influence smoking behaviors, with larger populations potentially enabling stronger social multiplier effects. Additionally, the distributional economics of tobacco products means that per-unit costs typically decrease in larger markets, potentially increasing consumption [7].

Recent empirical evidence highlights the importance of considering geographic and temporal variations in the population-consumption relationship. A longitudinal study across 100 countries found that while population growth initially predicts increased cigarette consumption, this relationship reverses as nations implement comprehensive tobacco control measures [8]. This suggests that population effects are mediated by policy environments, with Framework Convention on Tobacco Control (FCTC) implementation playing a crucial moderating role [9].

The public health implications of these findings are substantial. As the global population is projected to reach 9.7 billion by 2050 [10], with most growth occurring in low- and middle-income countries where tobacco control may be weaker, understanding these dynamics becomes increasingly urgent. Effective tobacco control strategies must account for population size and distribution factors when designing interventions, particularly in rapidly urbanizing regions where new tobacco markets are emerging [11].

This study developed a machine learning model to predict cigarette sales and identify key influencing factors. By analyzing economic and demographic variables, the research optimized prediction accuracy (using RMSE and R² metrics) while preventing over fitting through early stopping techniques. The findings highlight price as the most significant driver of sales, with tax and demographic factors playing secondary roles. These insights offer practical value for both policymakers assessing taxation impacts and businesses optimizing pricing strategies. The study also evaluated model limitations, suggesting opportunities to refine future analyses by excluding less influential variables. Overall, the results provide data-driven insights into cigarette market dynamics, serving academic research and real-world decision-making while establishing a foundation for improved predictive modeling approaches.

Theoretical Foundations of Gradient Boosting

At its core, Gradient Boosting is an iterative algorithm that minimizes a loss function by adding weak learners to the model in a sequential fashion. The process begins with an initial model, such as the mean of the target variable for regression tasks. In each iteration, the algorithm computes the residuals (the differences between the actual and predicted values) and trains a new weak learner to predict these residuals. This new learner is then added to the model with a weight that minimizes the loss function. The final prediction is the sum of the predictions from all weak learners, weighted by their respective contributions.

Mathematically, the final model can be expressed as the sum of the predictions from all weak learners, each scaled by a weight. The algorithm uses gradient descent to minimize the loss function, which is why it is called Gradient Boosting. This approach allows the model to capture complex, non-linear relationships in the data, making it highly effective for a wide range of predictive tasks.

For Full-length article, kindly go through this PDF link: