Högskolan i Skövde

his.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluation of Machine Learning techniques for Master Data Management
University of Skövde, School of Informatics.
2023 (English)Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

In organisations, duplicate customer master data present a recurring problem. Duplicate records can result in errors, complication, and inefficiency since they frequently result from dissimilar systems or inadequate data integration. Since this problem is made more complicated by changing client information over time, prompt detection and correction are essential. In addition to improving data quality, eliminating duplicate information also improves business processes, boosts customer confidence, and makes it easier to make wise decisions. This master’s thesis explores machine learning’s application to the field of Master Data Management. The main objective of the project is to assess how machine learning may improve the accuracy and consistency of master data records. The project aims to support the improvement of data quality within enterprises by managing issues like duplicate customer data. One of the research topics of study is if machine learning can be used to improve the accuracy of customer data, and another is whether it can be used to investigate scientific models for customer analysis when cleaning data using machine learning. Dimension identification, appropriate algorithm selection, appropriate parameter value selection, and output analysis are the four steps in the study's process. As a ground truth for our project, we came to conclusion that 22,000 is the correct number of clusters for our clustering algorithms which represents the number of unique customers. Saying this, the best performing algorithm based on number of clusters and the silhouette score metric turned out the be KMEANS with 22,000 clusters and a silhouette score of 0.596, followed by BIRCH with 22,000 number of clusters and a silhouette score of 0.591.

Place, publisher, year, edition, pages
2023. , p. 34
Keywords [en]
Master Data Management, Machine Learning, data quality, data duplicates
National Category
Information Systems
Identifiers
URN: urn:nbn:se:his:diva-23239OAI: oai:DiVA.org:his-23239DiVA, id: diva2:1799382
Subject / course
Informationsteknologi
Educational program
Data Science - Master’s Programme
Supervisors
Examiners
Available from: 2023-09-22 Created: 2023-09-22 Last updated: 2023-09-22Bibliographically approved

Open Access in DiVA

fulltext(588 kB)417 downloads
File information
File name FULLTEXT01.pdfFile size 588 kBChecksum SHA-512
8a7287da05e13f0f25dc665c661891ad7887c9123e9f7907d4a0309bbdde9dbcad808a5ce4d6734ffb5f95689786c3103a1bf71442e0610a7a5b38f5415a8a42
Type fulltextMimetype application/pdf

By organisation
School of Informatics
Information Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 417 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 322 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf