Högskolan i Skövde

his.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Investigating the Ability of Machine Learning Models to Classify LLM-Generated Software Artifacts from Student-Written Software Artifacts
University of Skövde, School of Informatics.
University of Skövde, School of Informatics.
2025 (English)Independent thesis Basic level (degree of Bachelor), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This study investigates the use of machine learning (ML) models to distinguish between Largue Language Model generated (LLM-generated) and student-written programming assignments. It compares the CSEDM and LLM-generated datasets from ChatGPT, DeepSeek, and Qwen to identify distinguishing features. The LLM dataset has been generated with several prompt engineering techniques to ensure a diverse and representative series of outputs. Several ML models, including XGBoost, SVM, and LightGBM, are evaluated for classification performance. This study has shown that Deepseek is the most difficult to classify LLM, whereas the ML models excelled in different areas, with no one model being preeminent. The study also explores prompt engineering techniques to generate LLM code resembling student submissions. Results include an analysis of what ML models perform the best and why, visualising what differences LLM software artifacts have to student artifacts. Furthermore, the results emphasize and visualize features importance for different models using feature extraction. The findings offer promising results and insights into educational policy and the development of stronger automated detection tools.

Place, publisher, year, edition, pages
2025. , p. 130
Keywords [en]
Machine Learning, Large Language Models, Programming Education, Feature Extraction, DeepSeek, ChatGPT, Qwen, Prompt Engineering
National Category
Software Engineering Computer Systems
Identifiers
URN: urn:nbn:se:his:diva-25548OAI: oai:DiVA.org:his-25548DiVA, id: diva2:1985257
Subject / course
Informationsteknologi
Educational program
Computer Science - Specialization in Systems Development
Supervisors
Examiners
Available from: 2025-07-23 Created: 2025-07-23 Last updated: 2025-09-29Bibliographically approved

Open Access in DiVA

fulltext(24256 kB)306 downloads
File information
File name FULLTEXT01.pdfFile size 24256 kBChecksum SHA-512
8bc17f4dcaad1e4ef9c9bd4567a9f6f0ef160c58430105e25311d0a012487cb71ca468861962ffdd6711ea549d5cd268fbbe99f367fe6a4ecee17b4d304da6b1
Type fulltextMimetype application/pdf

By organisation
School of Informatics
Software EngineeringComputer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 306 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 645 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • apa-cv
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf