Imbalanced data oversampling through subspace optimization with Bayesian reinforcement
2026 (English)In: Artificial Intelligence Review, ISSN 0269-2821, E-ISSN 1573-7462, Vol. 59, no 1, article id 1Article in journal (Refereed) Published
Abstract [en]
Many real-world machine learning classification problems suffer from imbalanced training data, where the least frequent label has high relevance and significance for the end user, such as equipment breakdowns or various types of process anomalies. This imbalance can negatively impact the learning algorithm and lead to misclassification of minority labels, resulting in erroneous actions and potentially high unexpected costs. Most previous oversampling methods rely only on the minority samples, often ignoring their overall density and distribution in relation to the other classes. In addition, most of them lack in the oversampling method’s explainability. In contrast, this paper proposes a novel oversampling method that considers a subspace of the feature-set for the creation of synthetic minority samples using nonlinear optimization of a class-sensitive objective function. Suitable subspaces for oversampling are identified through a Bayesian reinforcement strategy based on Dirichlet smoothing, which may be useful for explainable-AI. An empirical comparison of the proposed method is performed with 10 existing techniques on 18 real-world datasets using two traditional machine learning classifiers and four evaluation metrics. Statistical analysis of cross-validated runs over the 18 datasets and four metrics (i.e. 72 experiments) reveals that the proposed approach is among the best performing methods in 6 and 2 instances when using random forest classifier and support vector machine classifier, thus placing it at the top. The study also reveals that some feature combinations are more important than others for minority oversampling, and the proposed approach offers a way to identify such features.
Place, publisher, year, edition, pages
Springer Nature, 2026. Vol. 59, no 1, article id 1
Keywords [en]
Imbalanced data, Oversampling, Nonlinear optimization, Dirichlet distribution, Bayesian reinforcement, Density-based, Features subspace, Feature importance, Explainable-AI
National Category
Computer Sciences Computer Systems
Research subject
Virtual Production Development (VPD); Skövde Artificial Intelligence Lab (SAIL)
Identifiers
URN: urn:nbn:se:his:diva-25994DOI: 10.1007/s10462-025-11417-1ISI: 001610765900001Scopus ID: 2-s2.0-105021344491OAI: oai:DiVA.org:his-25994DiVA, id: diva2:2012898
Projects
TOPAZ - Towards Prescriptive Analytics in Virtual Factories through Structured Data Mining and OptimizationIntegrated Manufacturing Analytics Platform for Predictive Maintenance with IoT
Funder
University of SkövdeKnowledge Foundation, 20200011Vinnova, 2021-02537
Note
CC BY 4.0
Published online: 10 November 2025
Mahesh Kumbhar, mahesh.kumbar@his.se
The authors acknowledge the financial support received from KK-stiftelsen (The Knowledge Foundation, Stockholm, Sweden) and VINNOVA (Sweden Innovation Agency, Stockholm, Sweden) for the research projects ‘TOPAZ - Towards Prescriptive Analytics in Virtual Factories through Structured Data Mining and Optimization’ under grant 20200011 and ‘Integrated Manufacturing Analytics Platform for Predictive Maintenance with IoT’ under grant 2021-02537.
Open access funding provided by University of Skövde.
2025-11-112025-11-112025-11-20Bibliographically approved