his.sePublikationer
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Semantic frame based automatic extraction of typological information from descriptive grammars
Högskolan i Skövde, Institutionen för informationsteknologi.
2019 (Engelska)Självständigt arbete på avancerad nivå (magisterexamen), 20 poäng / 30 hpStudentuppsats (Examensarbete)
Abstract [en]

This thesis project addresses the machine learning (ML) modelling aspects of the problem of automatically extracting typological linguistic information of natural languages spoken in South Asia from annotated descriptive grammars. Without getting stuck into the theory and methods of Natural Language Processing (NLP), the focus has been to develop and test a machine learning (ML) model dedicated to the information extraction part. Starting with the existing state-of-the-art frameworks to get labelled training data through the structured representation of the descriptive grammars, the problem has been modelled as a supervised ML classification task where the annotated text is provided as input and the objective is to classify the input to one of the pre-learned labels. The approach has been to systematically explore the data to develop understanding of the problem domain and then evaluate a set of four potential ML algorithms using predetermined performance metrics namely: accuracy, recall, precision and f-score. It turned out that the problem splits up into two independent classification tasks: binary classification task and multiclass classification task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Using stratified 10-fold cross validation performance metrics are measured and the candidate algorithms  are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for further fine tuning and used in a web demo for typological information extraction tool developed to show the usability of the ML model in the field.

Ort, förlag, år, upplaga, sidor
2019. , s. 49
Nyckelord [en]
Automatic Information Extraction, Spoken Languages, Typological Linguistic Information, Logistic Regression, Classification
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:his:diva-17893OAI: oai:DiVA.org:his-17893DiVA, id: diva2:1371627
Ämne / kurs
Datavetenskap
Utbildningsprogram
Datavetenskap - magisterprogram
Handledare
Examinatorer
Tillgänglig från: 2019-11-20 Skapad: 2019-11-20 Senast uppdaterad: 2019-11-20Bibliografiskt granskad

Open Access i DiVA

fulltext(1768 kB)15 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1768 kBChecksumma SHA-512
5105cde5f4110737fbd3fa3eeff131c48550b8224f00d89952cb71a81e308c5681a738a89cf9c1a096fd10ed5cef0e1523a1d18c4ee6cdb3594b14f18caef2a2
Typ fulltextMimetyp application/pdf

Av organisationen
Institutionen för informationsteknologi
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 15 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 72 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf