Developing a pilot Hindi Treebank based on Computational Paninian Grammar
Loading...
Date
item.page.authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Penn Treebank has proved the importance of treebanks as a linguistic resource for NLP. The current research presents an effort to develop a pilot treebank for Hindi, which could be used for creating a large scale treebank for Hindi. Building a treebank requires a computational grammar framework, an annotation scheme based on a chosen grammar, guidelines for annotating various types of constructions in the concerned language, and other related resources such as verb frames, etc. Since Hindi has a relatively free word order, dependency grammar formalism is well suited for it. So we chose Computational Paninian Grammar framework [36]. Panini s grammar is a dependency grammar [99, 162]. Hence, the scheme for annotating treebanks for Indian languages was developed based on this framework. As part of this study, a pilot treebank for Hindi (HyDT Hyderabad Dependency Treebank for Hindi) [21] was developed which was released for ICON-2009 (International Conference on Natural Language Processing-2009) [86]. The scheme [21] and guidelines for treebank annotation for Hindi developed during this study were modified and are being used for a multi-layered and multi-representational treebank for Hindi and Urdu [39, 42, 188] which is a collaborative project between various Universities.
newline
newlineAlong with the creation of Hindi Treebank (HyDT), I also created a supplementary resource of verb frames for 687 Hindi verbs. I present the work on verb frames [22] for Hindi verbs and show the methodology used in preparing these frames and the criteria followed for classifying Hindi verbs. The main goal of this work is to create a linguistic resource which will prove to be indispensable for various NLP applications. I have also worked on the mapping between Propbank annotation and dependency annotation, based on Paninian Grammatical Framework [21, 36].
newline
newlineI have also discussed the use of HyDT data (Hyderabad Dependency Treebank for Hindi) [21] in various experiments.