ECIR 2011, pp. 479-490.
selection [paper], [slides], [slides], [bib],
Full 40 page version with all proofs (arXiv:1012.3502): [paper (arXiv:1012.3502)], [bib], (Version Dec 2010)
Project page: Unique recall
Assume you crawl 20% of the Web. Are you able to learn 80% of the available information? This paper develops an analytic model (and uses generally accepted assumptions of power laws distributions in data) to show that we can expect to learn less then 40% of the Web’s content, hence the 80-20 rules does not hold. The paper further describes a new family of power law distribution which remains invariant under sampling, i.e. randomly sampling from this distribution will lead again to the original distribution in the sample.