Assessing the Impact of Feature Quantity in Tree Based Machine Learning Models for the Detection of Malicious Software Packages in PyPI
2025 (English)Independent thesis Basic level (degree of Bachelor), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Software supply chain attacks, where malicious code is inserted into software components or repositories, present a growing threat to the open-source ecosystem. Traditional scanning methods often fail to detect hidden threats. Machine Learning (ML) offers a potential solution by automating the detection of malicious packages. However, previous studies report false positive rates too high for practical deployment. This study investigates whether increasing feature quantity improves ML model performance for detecting malicious software packages on PyPI. Using aquasi-experiment with Decision Tree, Random Forest, and XGBoost classifiers, the feature set is incrementally expanded, and performance changes are measured. Results show that while adding features improves performance initially, there are diminishing gains beyond a certain point. Future work includes expanding feature sets further and adding additional randomized datasets to reduce feature bias.
Place, publisher, year, edition, pages
2025. , p. 59
Keywords [en]
Classification, Feature Quantity, Artificial Intelligence, Malicious, cyber security, Software Supply-Chain
National Category
Computer Sciences Software Engineering
Identifiers
URN: urn:nbn:se:his:diva-25575OAI: oai:DiVA.org:his-25575DiVA, id: diva2:1985408
Subject / course
Informationsteknologi
Educational program
Computer Science - Specialization in Systems Development
Supervisors
Examiners
2025-07-242025-07-242025-09-29Bibliographically approved