Högskolan i Skövde

his.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Semantic frame based automatic extraction of typological information from descriptive grammars
Högskolan i Skövde, Institutionen för informationsteknologi.
2019 (engelsk)Independent thesis Advanced level (degree of Master (One Year)), 20 poäng / 30 hpOppgave
Abstract [en]

This thesis project addresses the machine learning (ML) modelling aspects of the problem of automatically extracting typological linguistic information of natural languages spoken in South Asia from annotated descriptive grammars. Without getting stuck into the theory and methods of Natural Language Processing (NLP), the focus has been to develop and test a machine learning (ML) model dedicated to the information extraction part. Starting with the existing state-of-the-art frameworks to get labelled training data through the structured representation of the descriptive grammars, the problem has been modelled as a supervised ML classification task where the annotated text is provided as input and the objective is to classify the input to one of the pre-learned labels. The approach has been to systematically explore the data to develop understanding of the problem domain and then evaluate a set of four potential ML algorithms using predetermined performance metrics namely: accuracy, recall, precision and f-score. It turned out that the problem splits up into two independent classification tasks: binary classification task and multiclass classification task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Using stratified 10-fold cross validation performance metrics are measured and the candidate algorithms  are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for further fine tuning and used in a web demo for typological information extraction tool developed to show the usability of the ML model in the field.

sted, utgiver, år, opplag, sider
2019. , s. 49
Emneord [en]
Automatic Information Extraction, Spoken Languages, Typological Linguistic Information, Logistic Regression, Classification
HSV kategori
Identifikatorer
URN: urn:nbn:se:his:diva-17893OAI: oai:DiVA.org:his-17893DiVA, id: diva2:1371627
Fag / kurs
Computer Science
Utdanningsprogram
Informatics - Master's Programme
Veileder
Examiner
Tilgjengelig fra: 2019-11-20 Laget: 2019-11-20 Sist oppdatert: 2019-11-20bibliografisk kontrollert

Open Access i DiVA

fulltext(1768 kB)405 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1768 kBChecksum SHA-512
5105cde5f4110737fbd3fa3eeff131c48550b8224f00d89952cb71a81e308c5681a738a89cf9c1a096fd10ed5cef0e1523a1d18c4ee6cdb3594b14f18caef2a2
Type fulltextMimetype application/pdf

Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 405 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 398 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf