Högskolan i Skövde

his.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Hadoop scalability evaluation for machine learning algorithms on physical machines: Parallel machine learning on computing clusters
University of Skövde, School of Informatics.
University of Skövde, School of Informatics.
University of Skövde, School of Informatics.
2021 (English)Independent thesis Basic level (degree of Bachelor), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The amount of available data has allowed the field of machine learning to flourish. But with growing data set sizes comes an increase in algorithm execution times. Cluster computing frameworks provide tools for distributing data and processing power on several computer nodes and allows for algorithms to run in feasible time frames when data sets are large. Different cluster computing frameworks come with different trade-offs. In this thesis, the scalability of the execution time of machine learning algorithms running on the Hadoop cluster computing framework is investigated. A recent version of Hadoop and algorithms relevant in industry machine learning, namely K-means, latent Dirichlet allocation and naive Bayes are used in the experiments. This paper provides valuable information to anyone choosing between different cluster computing frameworks.

The results show everything from moderate scalability to no scalability at all. These results indicate that Hadoop as a framework may have serious restrictions in how well tasks are actually parallelized. Possible scalability improvements could be achieved by modifying the machine learning library algorithms or by Hadoop parameter tuning.

Place, publisher, year, edition, pages
2021. , p. 54, xviii
Keywords [en]
Parallel machine learning, cluster computing, Hadoop, performance analysis
National Category
Information Systems, Social aspects
Identifiers
URN: urn:nbn:se:his:diva-20102OAI: oai:DiVA.org:his-20102DiVA, id: diva2:1576361
Subject / course
Informationsteknologi
Educational program
Computer Science - Specialization in Systems Development
Supervisors
Examiners
Available from: 2021-06-30 Created: 2021-06-30 Last updated: 2021-06-30Bibliographically approved

Open Access in DiVA

fulltext(801 kB)165 downloads
File information
File name FULLTEXT01.pdfFile size 801 kBChecksum SHA-512
3b20744843aacfa9c73a2f334cf32e8e3fc5dd08d179bc66d718ed2668070cbeaea78bcce536f64df6c75421ca1f6191690ccaf5f11f5a89f4fa62b2a31503ee
Type fulltextMimetype application/pdf

By organisation
School of Informatics
Information Systems, Social aspects

Search outside of DiVA

GoogleGoogle Scholar
Total: 165 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 176 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf