Web Images Videos Maps News Shopping Gmail more »
Search settings | Sign in
Go to Google Videos home
PranamKolariDefense2.mp4
2:07:03  - 2 years ago
Weblogs, or blogs, are an important new way to publish information, engage in discussions, and form communities on the Internet. Spam blogs, or splogs are blogs with auto-generated or plagiarized content with the sole purpose of hosting profitable contextual ads and/or inflating importance of linked-to sites. Though estimates vary, splogs account for more than 50% of blog content, and present a serious threat to their continued utility. Splogs impact search engines by increasing computational overhead and reducing user satisfaction. Hence, search engines try to minimize the influence of spam, both prior to indexing and after indexing, by eliminating splogs, comment spam, social media spam, or generic web spam. In this work we further the state of the art of splog detection prior to indexing. First, we have identified and developed techniques for splog detection in a supervised machine learning setting. While some of these are novel, a few others confirm the utility of techniques that have worked well for e-mail and Web spam detection in a new domain i.e. the blogosphere. Specifically, our techniques identify spam blogs using URL, home-page, and syndication feeds. To enable the utility of our techniques prior to indexing, the emphasis of our effort is fast online detection. Second, we have developed a novel system that filters out spam in a stream of update pings from blogs. Our approach is based on using filters serially in increasing cost of detection that better supports balancing cost and effectiveness. We have used such a system to support multiple blog related projects, both internally and externally. Next, we have developed an approach for updating classifiers in an adversarial setting. We show how an ensemble of classifiers can co-evolve and adapt when used on a stream of unlabeled instances susceptible to concept drift. We discuss how our system is amenable to such evolution by discussing approaches that can feed into it. Finally, we have characterized the specific nature of spam blogs along various dimensions, formalized the problem and created general awareness of the issue. We are the first to formalize and address the problem of spam in blogs and identify the general problem of spam in Social Media. We discuss how lessons learned can guide follow-up work on spam in social media, an important new problem on the Web.
Embed video