The Impact of Main Content Extraction on Near-Duplicate Detection
The Impact of Main Content Extraction on Near-Duplicate Detection
Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and potential of near-duplicates with an information retrieval focus, motivating that current efforts toward an open and …