A Fine Grained Approach to Develop Domain Specific Search Engine

Abstract

Growth of content and users on internet provides research opportunities in crowdsourcing, search engines and other areas that consume and process web. Lack of clarity in task definition, uncertainty in completion time, concern on data confidentiality, unavailability of workers, cost considerations and quality of output are some of the research opportunities in crowdsourcing. Quality of output is one of the primary concerns in crowdsourcing tasks, as lack of it defeats the objective of using crowd. Researchers are exploring various mechanisms that are based on statistics, machine learning and game theory for acceptable output quality. newline newlineAs part of this work, an online survey conducted on crowdsourcing reiterated importance of quality assurance and quality control of tasks. The online survey along with review of research literature confirmed that quality mechanisms on crowdsourcing tasks benefit from availability of credible knowledge base in a domain. To build domain specific knowledge base, this work proposes a fine grained approach to identify sub-domains in domain and extract related content, an enrichment of a knowledge base in the form of an ontology and credibility assessment based on web genres. The research outcomes on creation and enrichment of knowledge base are used to develop a domain specific search engine. newline newlineWeb pages on internet contain content across domains. Crawler efficiency, sub-domains representation and noise reduction are required to extract domain specific content. A systematic approach to identify sub-domains in a domain is proposed. Further, the work extends metaheuristics based Artificial Bee Colony (ABC) algorithm to extract sub-domainsâ URLs. The extended ABC algorithm for crawling performed better than existing industry scale open source crawlers in terms of volume of extracted URLs and usage of compute resources. A metric SeedRel to measure precision of seed URLs based on child URLs presence and content relevance is proposed. The work measured sub-domains coverage with a baseline val

Description

Keywords

Citation

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced