his.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Semantic frame based automatic extraction of typological information from descriptive grammars
University of Skövde, School of Informatics.
2019 (English)Independent thesis Advanced level (degree of Master (One Year)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This thesis project addresses the machine learning (ML) modelling aspects of the problem of automatically extracting typological linguistic information of natural languages spoken in South Asia from annotated descriptive grammars. Without getting stuck into the theory and methods of Natural Language Processing (NLP), the focus has been to develop and test a machine learning (ML) model dedicated to the information extraction part. Starting with the existing state-of-the-art frameworks to get labelled training data through the structured representation of the descriptive grammars, the problem has been modelled as a supervised ML classification task where the annotated text is provided as input and the objective is to classify the input to one of the pre-learned labels. The approach has been to systematically explore the data to develop understanding of the problem domain and then evaluate a set of four potential ML algorithms using predetermined performance metrics namely: accuracy, recall, precision and f-score. It turned out that the problem splits up into two independent classification tasks: binary classification task and multiclass classification task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Using stratified 10-fold cross validation performance metrics are measured and the candidate algorithms  are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for further fine tuning and used in a web demo for typological information extraction tool developed to show the usability of the ML model in the field.

Place, publisher, year, edition, pages
2019. , p. 49
Keywords [en]
Automatic Information Extraction, Spoken Languages, Typological Linguistic Information, Logistic Regression, Classification
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:his:diva-17893OAI: oai:DiVA.org:his-17893DiVA, id: diva2:1371627
Subject / course
Computer Science
Educational program
Informatics - Master's Programme
Supervisors
Examiners
Available from: 2019-11-20 Created: 2019-11-20 Last updated: 2019-11-20Bibliographically approved

Open Access in DiVA

fulltext(1768 kB)8 downloads
File information
File name FULLTEXT01.pdfFile size 1768 kBChecksum SHA-512
5105cde5f4110737fbd3fa3eeff131c48550b8224f00d89952cb71a81e308c5681a738a89cf9c1a096fd10ed5cef0e1523a1d18c4ee6cdb3594b14f18caef2a2
Type fulltextMimetype application/pdf

By organisation
School of Informatics
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 8 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 32 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf