Improving Performance of Classifiers for Medical Text Data using Rare Word Handling

Abstract

quotIn the recent digital era, an enormous amount of digital text literature increases exponentially. These vast amounts of unstructured text literatures require a deeper understanding for automatic text classification. Text classification task is to classify new documents to predefined categories based on its content. Nowadays, many researchers are interested in solving text classification problems and developing a system for specific application of text classification. The text classification applications examples are spam email classification, research articles or patents classification, news categorization, sentiment analysis and many more. Text data sources are web pages, social media, emails, chats, news, user reviews, biomedical text, clinical reports and others. To perform any classification task for text data, it is a complex and time-consuming task because of text s complex structure. Digital text sources have data in terms of semi-structure or unstructured form. It is noticeable that text classification problems in some specific areas are still unsolvable problems like medical text data classification and other areas where domain specific knowledge is required to handle classification tasks. Medical text datasets are classified in two types. One type is the clinical notes/reports like hand written prescriptions of doctors and another type is the computerized/digitized biomedical literature like short sentences, articles, research papers or literature data. In this thesis, we have mainly focused on second data types as automatic biomedical text data classification without human help. As growing of digitized medical literature, it has challenging task for medical community, researchers, scientists, doctors to find answer of questions because of medical text data has issues like: domain specific words with prefix-root-suffix, abbreviations, polysemy and synonyms words, non-grammatical words, low frequency medical words or rare words. newlineThe traditional medical text classification process has stages/phases like

Description

Keywords

Citation

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced