his.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Email Mining Classifier: The empirical study on combining the topic modelling with Random Forest classification
University of Skövde, School of Informatics.
2017 (English)Independent thesis Basic level (degree of Bachelor), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself.

This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation.

The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy. 

Place, publisher, year, edition, pages
2017. , p. 64
Keywords [en]
Email mining, Latent Dirichlet Allocation, Random Forest classification
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:his:diva-14710OAI: oai:DiVA.org:his-14710DiVA, id: diva2:1182329
Subject / course
Computer Science
Educational program
Computer Science - Specialization in Systems Development
Supervisors
Examiners
Available from: 2018-02-16 Created: 2018-02-13 Last updated: 2018-02-16Bibliographically approved

Open Access in DiVA

fulltext(5619 kB)148 downloads
File information
File name FULLTEXT01.pdfFile size 5619 kBChecksum SHA-512
42a5072c1a8d111f148607b8738d37d552dfaf0c8bb57f0567a15652e4a3abb67fc4cd28661f97c7820ef286e387776097a55f41486a6d515fa5aa70a6ce872e
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Halmann, Marju
By organisation
School of Informatics
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 148 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 213 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf