Demonry.com | http://don-metzler.net/presentations/broder-cikm10.pdf
Exploi'ng Site-‐Level Informa'on to Improve Web Search Andrei Broder1, Evgeniy Gabrilovich1, Vanja Josifovski1, George Mavroma's1, Donald Metzler2, and Jane Wang1 1 2 Site-‐Speciﬁc Scoring • Corpus structure is known to be an important factor for eﬀec've textual scoring • Textual scoring of Web documents oPen ignores the local structure of the Web (=a page resides within a site) – Global structures is captured by e.g. pagerank • We propose scoring pages in the context of their site • Contribu'ons – Methods for represen'ng and indexing en're Web sites – Scoring approach that combines page and site evidence – Signiﬁcant improvements in search quality compared to page-‐only scoring Site-‐Speciﬁc Scoring • Dual index architecture – Tradi'onal page search index – Novel addi'onal site representa-on search index • 2 variants: page based and “anchor text”-‐based • Pages are scored according to a combina'on of their page-‐index and site-‐index scores – N. B. Framework can also be used for site retrieval Scoring Architecture 0.12+&('$ *-./( *-./($ 0.12+&(/ !"#$*-./(/ Page *+,($*-./(/ Page !"#$%&'() index *+,($%&'() 34(/5 Site Representa'ons • Input: a set of pages sampled from site S • Output: textual representa'on of site S. Two possibili'es 1. Page-‐based site representa'on • Concatena'on of the content of the pages in the set to form a “meta-‐document” • Representa'on may contain a large amount of non-‐relevant informa'on 2. “Anchor text”-‐based site representa'on • Concatena'on of the external anchor text for the pages in the set • More focused than page-‐based representa'on • Helps overcome anchor text sparsity problem for individual pages Scoring • The score of a page P with respect to query Q is a simple linear combina'on of the page score and the site score: • Other combina'on approaches possible • Also possible to use site score as a feature within a machine learned ranking func'on Scoring • Page and site scores are computed according to BM25F-‐SD, which has the following form: • where wt(.,P) are BM25F scores for unigrams, phrases, and term proximity matches, and the λ’s are free parameters Experimental Setup • Experiments use a random sample of queries from a major commercial Web search engine – Train: 20,120 queries, 416,183 query-‐page pairs – Test: 3,556 queries, 139,940 query-‐page pairs • Human relevance judgments – Grades: Perfect, Excellent, Good, Fair, and Bad • Metrics – DCG, NDCG, and ERR (a recently proposed metric by Chapelle et al. ) • Baseline: BM25F-‐SD using only page index, no site info Site Index Details • Site index consists of 207,222 sites • Covers 65% of all Web search results retrieved over 4 months • “Anchor text”-‐based site index is 43% smaller than the page-‐based site index (Both produced from samples) Size (GB) 3 2.5 2 1.5 1 0.5 0 Anchor Text-‐ Based Page-‐Based Results Conclusions • Textual scoring for Web search can be consistently and signiﬁcantly improved by combining page and site-‐level scores • “Anchor text”-‐based site representa'on is both smaller and more eﬀec've than page-‐ based representa'on • Our framework provides a means for retrieving en're sites, rather than just pages (URLs), in response to a query
1426113245_512cd6106d broder-cikm10. http://don-metzler.net/presentations/broder-cikm10.pdf http://don-metzler.net/presentations/broder-cikm10.pdf 2015-03-11 23:34:05 http://demonry.com/364752.html#broder-cikm10. http://demonry.com/1426113245_512cd6106d.html#broder-cikm10. http://demonry.com/cracker/1426113245_512cd6106d/broder-cikm10.pdf http://demonry.com/cracker/1426113245_512cd6106d/broder-cikm10.txt
Here will be a configuration form
The Library Demonian
Consider donating to the demonian archive to keep us ad free.