IISM | DSI | EDUCATION - Thesis Projects - Data-Centric Text Augmentation to improve Large Language Model Performance

Data-Centric Text Augmentation to improve Large Language Model Performance

Type:Master's thesis
Date:Immediately
Supervisor:
Joshua Holstein (KIT)

Leopold Müller (Universität Bayreuth)

Motivation:
This thesis is dedicated to exploring and enhancing the performance of large language models (LLMs) in organizational contexts, particularly focusing on the challenge of limited training data. It aims to develop effective methods for training small language models tailored for specific text input and output tasks within organizations. The central theme revolves around improving and augmenting existing datasets, emphasizing the quality and preparation of data over the complexity of the model's architecture. The goal is to optimize the performance of small language models on specific downstream tasks, making the most of the limited
data resources available.

CURRENT RESEARCH STATUS:
While LLMs have revolutionized the field of natural language processing, the challenge of high quality training data remains. Especially in organizational settings, where training data for specialized tasks is often scarce. While existing models have shown significant successes, their adaptability to specific, often data-constrained, organizational tasks remains an insufficiently explored area.

RESEARCH GOAL:
This research will address the gap in knowledge regarding the optimization of LLMs for specific tasks in datalimited scenarios. Key questions include: How can small language models be effectively trained with limited data? What are the most efficient text augmentation and preprocessing techniques? How do these techniques impact the performance of LLMs in specific downstream tasks?

THEORETICAL & PRACTICAL IMPLICATIONS AND IMPACT ON THE STATUS QUO:
This research can contribute the way LLMs are trained, shifting the focus from increasingly complex models to data quality and preparation. Practically, it has the potential to significantly enhance the efficiency and accuracy of LLMs in organizational settings, particularly in tasks where data is scarce.

METHODOLOGY:
- Literature Review: A comprehensive review of existing approaches in text augmentation and data preparation for LLMs.
- Implementation: Applying these approaches to small language models, focusing on the unique challenges of organizational contexts.
- Evaluation: Assessing various combinations of text augmentation and preprocessing techniques to determine their effectiveness.
- Best Practices Derivation: Developing a set of best practices based on the evaluation, guiding future implementations in similar contexts
Literature:
• Jakubik, J., Vössing, M., Kühl, N., Walk, J., & Satzger, G. (2022). Data-centric artificial intelligence. arXiv preprint arXiv:2212.11854.
• Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., & Hu, X. (2023). Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158.
• Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of big Data, 8, 1-34.
• Exemplary dataset: https://huggingface.co/datasets/knkarthick/dialogsum