Google Tech Talks
August 17, 2006
Ziv Bar-Yossef joined Google from the Technion - Israel Institute of Technology in Haifa, Israel. He received his PhD from UC Berkeley in 2002, and was a Research Staff Member at the IBM Almaden Research Center prior to joining Technion.
This was an academic study conducted before Ziv Bar-Yossef joined Google and does not represent Google's views.
ABSTRACT
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface?
In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.
We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents.Google Tech Talks
August 17, 2006
Ziv Bar-Yossef joined Google from the Technion - Israel Institute of Technology in Haifa, Israel. He r...all »Google Tech Talks
August 17, 2006
Ziv Bar-Yossef joined Google from the Technion - Israel Institute of Technology in Haifa, Israel. He received his PhD from UC Berkeley in 2002, and was a Research Staff Member at the IBM Almaden Research Center prior to joining Technion.
This was an academic study conducted before Ziv Bar-Yossef joined Google and does not represent Google's views.
ABSTRACT
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface?
In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.
We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents.«
Download is starting. Save file to your computer. If the download does not start automatically, right-click this link and choose "Save As". How to get videos onto the iPod or PSP.