Evaluation of Machine Learning techniques for Master Data Management
2023 (English)Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE credits
Student thesis
Abstract [en]
In organisations, duplicate customer master data present a recurring problem. Duplicate records can result in errors, complication, and inefficiency since they frequently result from dissimilar systems or inadequate data integration. Since this problem is made more complicated by changing client information over time, prompt detection and correction are essential. In addition to improving data quality, eliminating duplicate information also improves business processes, boosts customer confidence, and makes it easier to make wise decisions. This master’s thesis explores machine learning’s application to the field of Master Data Management. The main objective of the project is to assess how machine learning may improve the accuracy and consistency of master data records. The project aims to support the improvement of data quality within enterprises by managing issues like duplicate customer data. One of the research topics of study is if machine learning can be used to improve the accuracy of customer data, and another is whether it can be used to investigate scientific models for customer analysis when cleaning data using machine learning. Dimension identification, appropriate algorithm selection, appropriate parameter value selection, and output analysis are the four steps in the study's process. As a ground truth for our project, we came to conclusion that 22,000 is the correct number of clusters for our clustering algorithms which represents the number of unique customers. Saying this, the best performing algorithm based on number of clusters and the silhouette score metric turned out the be KMEANS with 22,000 clusters and a silhouette score of 0.596, followed by BIRCH with 22,000 number of clusters and a silhouette score of 0.591.
Place, publisher, year, edition, pages
2023. , p. 34
Keywords [en]
Master Data Management, Machine Learning, data quality, data duplicates
National Category
Information Systems
Identifiers
URN: urn:nbn:se:his:diva-23239OAI: oai:DiVA.org:his-23239DiVA, id: diva2:1799382
Subject / course
Informationsteknologi
Educational program
Data Science - Master’s Programme
Supervisors
Examiners
2023-09-222023-09-222023-09-22Bibliographically approved