Outlier Detection in Categorical Data

Outlier Detection in Categorical Data

Files

80_recommendation.pdf (102.96 KB)

abstract.pdf (127.73 KB)

annexures.pdf (347.33 KB)

chapter1.pdf (460.21 KB)

chapter2.pdf (372.62 KB)

Abstract

Outlier detection continues to be one of the prominent research areas as it has tremendous applications such as fraud detection in financial transactions, fault identification in industrial products, disease identification in healthcare systems, image processing, malware detection in computer systems, intruder detection in communication systems etc. Most of the techniques developed in the literature for finding outliers are specific to a particular type of dataset and accept the input data in numerical format only. The data objects in the real world datasets may contain categorical features as well as numerical features. Hence, developing outlier detection algorithms that can handle categorical and numerical data efficiently has become the need of the hour. The techniques available for outlier detection can be categorized as Classification-based, Clustering-based, Proximity-based, Ensemble-based, Statistics-based etc. In this thesis, a novel similarity measure for finding proximity-based outliers from categorical datasets, a voting-based ensemble of different outlier detectors, and a hybrid method from different outlier detection models with two categorical encoders are presented. newlineA similarity measure called Correlation weighted Probability and Level (CPL) based similarity is proposed in this thesis to measure the similarity between two data objects having multiple categorical attributes. This is based on the probability of the occurrence of a particular value in an attribute and the correlation between that attribute and the class attribute. The association between each attribute and the class attribute is used as the weight to the corresponding attribute for determining the similarity. This association is a measure of predicting the value of the class attribute from other attributes. The similarity between two categorical variables is the weighted sum of the similarities between the two values of each attribute. The outlier detection experiments are conducted in R software package using different categorical

URI

http://hdl.handle.net/10603/462108

Collections

Department of Computer Science and Engineering

Full item page

Outlier Detection in Categorical Data

Files

Date

item.page.authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced