Researcher: Manuel Blum
Research Area: Trustworthy Computing Platforms and Devices
Web spam occurs when webmasters manipulate their site to take advantage of search engine ranking algorithms. Web spam is incredibly prevalent in search results, as high rankings for popular search queries can translated into success for site owners. Users tend to click pages that are ranked highly, and are unlikely to look far past the first page; instead they refine their search. Consequently, search engines try to provide the most relevant pages for queries at the top of the results page. This leads to tension between engines and webmasters: search engines want to deliver the most relevant content at the top of their search results, while webmasters want to be ranked as highly as possible.
Current methods for combating web spam result in an arms race: researchers race to create new algorithms to detect web spam, while spammers (the webmasters who create it) work on ways to circumvent these techniques. Considerable research is devoted to improving page rankings and spam detection, with much focus on link spam. Link spam occurs when a target page's rank is increased by having other pages point to each other and to the the target. Less time is spent on content analysis, where algorithms evaluate the actual content of pages. As far as we know, our research will be the first to provide theoretical performance guarantees for a web spam detection scheme.
Our spam detection scheme uses a game to detect web spam and uses the information gathered as votes for whether a page is spam or not. While a simpler scheme would collect data by letting users vote as they search, it also allows adversaries to control how a page is labeled. Our game is designed as a series of independent questions. For each question, players are shown the same query-snippet pair. (A snippet is a short piece of text that is assumed to be representative of the page content.) Players are given three choices for each question - to classify the snippet as "Highly Relevant", "Not Highly Relevant" or "Pass." Players get one attempt to answer the question and cannot change their minds.
The game can provably detect content spam. However there is reason to believe that we can find provable guarantees for link-based spam detection algorithms as well. While work proceeds on eliminating link-based spam, content-spam detection techniques can ensure that any results presented to the user are relevant, even if the webmaster is engaging in illicit behavior.