Outlier Detection in Categorical Data
Loading...
Date
item.page.authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Outlier detection continues to be one of the prominent research areas as it has tremendous applications such as fraud detection in financial transactions, fault identification in industrial products, disease identification in healthcare systems, image processing, malware detection in computer systems, intruder detection in communication systems etc. Most of the techniques developed in the literature for finding outliers are specific to a particular type of dataset and accept the input data in numerical format only. The data objects in the real world datasets may contain categorical features as well as numerical features. Hence, developing outlier detection algorithms that can handle categorical and numerical data efficiently has become the need of the hour. The techniques available for outlier detection can be categorized as Classification-based, Clustering-based, Proximity-based, Ensemble-based, Statistics-based etc. In this thesis, a novel similarity measure for finding proximity-based outliers from categorical datasets, a voting-based ensemble of different outlier detectors, and a hybrid method from different outlier detection models with two categorical encoders are presented.
newlineA similarity measure called Correlation weighted Probability and Level (CPL) based similarity is proposed in this thesis to measure the similarity between two data objects having multiple categorical attributes. This is based on the probability of the occurrence of a particular value in an attribute and the correlation between that attribute and the class attribute. The association between each attribute and the class attribute is used as the weight to the corresponding attribute for determining the similarity. This association is a measure of predicting the value of the class attribute from other attributes. The similarity between two categorical variables is the weighted sum of the similarities between the two values of each attribute. The outlier detection experiments are conducted in R software package using different categorical