Linguistic Resource Development for Low Resource Languages A Case Study Of Gujarati

Abstract

The importance of Natural Language Processing (NLP) in the era of artificial intelligence is growing day by day due to the development of underlying applications which enables machines to understand and process human languages efficiently. NLP research and de- velopment is driven by data and linguistic resources available for the specific language in hand. Unfortunately, some languages are still resource constrained and hence the de- velopment of NLP tools becomes more challenging for such languages. In this thesis, we discuss the linguistic resource development for the Gu,jarati - An Indo-Aryan language with rich and complex morphology. Developing linguistic level tools for the low resource languages like Gujarati is much needed as it helps in creating new resources for the lan- guage and in turn significantly improve the performance of the existing Natural Language Processing (NLP) applications. The work primarily focuses on analyzing the linguistic challenges, dataset generation and developing Part of Speech(POS) Jagger and Morpho- logical Analyzer models for Gujarati to enhance the NLP capabilities of the language POS tagging and morphological analysis are the foundational tasks in for any NLP prob- lem. The morphological complexities and resource scarcity presents unique challenges in developing these tools for the Gujarati language. We study the unique linguistic features of the Gujarati language and perform series of experiments regarding POS tagging and morphological analysis. During this work, we have developed POS-Morph dataset for the Gujarati language which is integrated in the standard Unimorph schema. In one of our research findings, we establish the linguistic interdependence between POS tag and mor- phological features. Our initial experiments are conducted on conventional deep learning models like Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) and later on with the evolution of pre-trained models, we leveraged multilingual pre- trained models for our experiments.

Description

Keywords

Citation

Collections

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced