A comperative study of text classification models on invoices: The feasibility of different machine learning algorithms and their accuracy
2018 (English)Independent thesis Basic level (degree of Bachelor), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Text classification for companies is becoming more important in a world where an increasing amount of digital data are made available. The aim is to research whether five different machine learning algorithms can be used to automate the process of classification of invoice data and see which one gets the highest accuracy. Algorithms are in a later stage combined for an attempt to achieve higher results.
N-grams are used, and results are compared in form of total accuracy of classification for each algorithm. A library in Python, called scikit-learn, implementing the chosen algorithms, was used. Data is collected and generated to represent data present on a real invoice where data has been extracted.
Results from this thesis show that it is possible to use machine learning for this type of problem. The highest scoring algorithm (LinearSVC from scikit-learn) classifies 86% of all samples correctly. This is a margin of 16% above the acceptable level of 70%.
Place, publisher, year, edition, pages
2018. , p. 42
Keywords [en]
Machine learning, text classification, invoices, supervised learning, information retrieval, ensemble learning
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:his:diva-15647OAI: oai:DiVA.org:his-15647DiVA, id: diva2:1219825
External cooperation
Asitis AB
Subject / course
Informationsteknologi
Educational program
Computer Science - Specialization in Systems Development
Supervisors
Examiners
2018-06-262018-06-182025-09-29Bibliographically approved