Multimedia Information Systems VO/KU (706.052/706.053)
Web Search Engines
Denis Helic
IICM, TU Graz
Historical (1)
- 1990: Table of contents for internet: Archie
- Data gatherer: looking up content of anonymous ftp-servers
- Stored the retrived data in a database (index)
- Regular expressions based access to the gathered data
Historical (2)
- 1992: Gopher system - a competitor to the Web
- Index of registered Gopher-sites
- Search engine Veronica supported boolean operations
- End of 1994: 15 Mio Gopher, FTP and HTML-pages
- User complaints, didn't find anything
Historical (3)
- Beginning 1993: World Wide Web Wanderer
- Robot, spider, crawler, wanderer
- Explores the Web by following links
- If you repeat it many times - the whole Web will be explored!
Historical (4)
- First versions of robots poorely implemented - retrieved pages a number of times per day
- Caused controversy - network performance degradation
- Controversy remained: robots bad or good?
- You can configure your Web server to forbid crawling
Historical (5)
- Nov. 1993: Aliweb (short for Archie-Like Indexing of the Web)
- Index was created and submited by owner of Web-Site itself
- Advantage: does not consume bandwith
- Disadvantage: Users need to create the index file
- Small index!
- Today typically combinations of submission and crawling
Historical (6)
- End of 1993: Robots took over the Web!
- 1993/1994: Jumpstation, World Wide Web Worm, RBSE Spider, Webcrawler
- Robots gather the data and store it in the index
Historical (7)
- Jumpstation: Title and header data
- Query results listed as they appear in the index
- World Wide Web Worm: Title and URLs
- Query results listed as they appear in the index
Historical (8)
- RBSE Spider: Title, URLs
- Query results sorted according to word counts
- Webcrawler: Full-text index
- The first robot to index the whole text of a page
Historical (9)
- 1994 Lycos
- Query results sorted acoording to word proximity
- 1995: Altavista, Inktomi
- Altavista: Phrase, boolean search, user interface (tips)
- Inktomi: investigated links between pages
Historical (10)
- Parallel to robots - Web directories
- 1994: Galaxy
- Browsable hierarchy of categories
- Documents checked by editors and assigned to categories
Historical (11)
- 1994: Yahoo
- Browsable and searchable directory!
- Checking by editors for inclusion into categories
- Smaller index but better relevance
- Easier for inexperienced users to find relevant information
Today (1)
- Commercial and making a lot of money!
- Pay for ranks by keywords
- For example, Google sells keywords are auctioned
- Companies pay keyword fees per day
- Try to make money on keywords on their own, e.g. by comparing prices on a particular keyword
Today (2)
- Number of pages in WWW: about 20 billions
- There is also The Deep Web behind firewalls, in dynamic content, ...
- Searching for other media types, e.g. images, videos, ...
- Language specific search
Today (3)
- Google: around 15 billions web pages
- 20 billion items in the index
- Google: 200 million search queries each day
- In the beginning: 10.000 Linux computers
- But now a lot of Google operations is secret
Categories
- Index search engines with spiders/robots
- Catalog search engines (Web directories)
- Combinations of index and catalog search engines
- Meta search engines
- Recommendation systems
Index search engines
- Robots collect data by following links
- Complex web-pages: HTML, CSS, JavaScript, text as graphics, flash, frames...
- problems for robots: noframes
- In experimental phase: indexing of Flash by Google
- The gathered data stored in database: user gets list of ranked search-results
Ranking (1)
- No problems in finding information
- Problems in ranking (millions of) results
- Ranking strategies
- Word counts (how many times does a search word appears?)
- Proximity (how close search words are?)
Ranking (2)
- Position of words in a document (title, meta tags, ...)
- Title and meta information
- Sponsors
- Combinations of ranking strategies
- Real breaktrough: linking
Google Ranking (1)
- Currently Google has the best results
- Ranking of Google based on two components
- Hits (Plain, Fancy) and combinations of hits
- PageRank (most important)
Google Ranking (2)
- How does it work?
- Find documents with hits
- Apply PageRank
- Calculate weigths
Google Ranking (3)
- Plain hits - full-text hits (words are somewhere in the text)
- Fancy hits
- URL, Meta Tag
- Font sizes of the text - relative to the document
- Anchor text (most important)
- Title (second important)
Google Ranking (4)
- Anchor text (text of a link)
- Counts towards the destination document
- Consequences
- If destination document is an image you can still find it!
- Similar with e-mails
PageRank (1)
- Google robot investigates links on the Web
- Calculationb of link statistics for the Web
- Pages that have more links pointing to them get higher PageRank
- Higher PageRank - more relevant
PageRank (2)

- d - constant, usually 0.85
- PR(A) - PageRank of Page A
- T1 ... Tn - all pages pointing to Page A
- PR(T1) ... PR (Tn) - PageRanks of pages pointing to Page A
- C(Tx) - number of outgoing links from Page Tx
PageRank (3)
- Formula is iterative
- To calculate PR(A) you need to know PR(T1) ... PR(Tn) - but you don't know it
- Start with 0 for all PRs and iterate until there is no difference in values
PageRank (4)
- The formula converges ;)
- For small networks 20-40 iteration steps needed
- For big networks - possibly millions of iterations (recollect 10.000 machines)
PageRank (5)
- Each page gives its PageRank to pages that it points to
- No discrimination each page shares its PageRank equally:

PageRank (6)
- PageRank forms a probability distribution
- The normalized sum of all PRs (in closed topology) is equal to 1

PageRank Example (1)

PageRank Example (2)

PageRank Example (3)

PageRank Example 3: Discussion (1)
- Average < 1
- Page C wastes its PR
- If no page wastes its PR Average = 1
- If number of pages is very high Average is approx 1
PageRank Example 3: Discussion (2)
- PageRank calculates probability that you can access a page if you browse randomly
- Each page gets at least something:

PageRank Example 3: Discussion (3)

- Obviously: PR(C) > PR(B) > PR(A)
PageRank Example (4)

PageRank Example (5)

PageRank Example (6)

PageRank Analysis (1)
- Hierarchy increases PR of the page on the top (homepage)
- If you point out you give away some of your PR
- Hope that what you give you will get back
- If links point to your homepage you will get a lot of PR
- Especially if a page with high PR points in!
PageRank Analysis (2)
- What does PageRank actually measure?
- Popularity!
- People create links to a page becuase they know about the page!
- Well-known page gets a lot of links - high PR
PageRank Analysis (3)
- It relies on the very nature of the Web and its community
- Social aspects, linking
- Important to have clean URLs so people can easily link to your pages
- Reasons for the success!
- But also disadvantage: is the most popular page always the most relevant or of the best quality?
PageRank Example (7): Google Boombing

PageRank Example (8): Google Boombing

Google Bombing Analysis (1)
- Quality is the most important
- Quality will always lead to popularity
- People will point to your page!
- Google can remove you from the index because of bombing!
Google Bombing Analysis (2)
- When bombing can be successful?
- Take some unusual anchor text
- Make many links with that text to a known page
- Submit a query with that text
Google Bombing Analysis (3)
Teoma: ExpertRank
- Teoma refines PageRank into ExpertRank
- PageRank - general popularity, how many links, no matter what links
- Subject-Specific Popularity, how many links, but which links
- Same as PageRank but counts only pages that belong to the same subject
- Teoma classifies pages into comunities, as they occur on the Web
PageRank: more info (2)

Catalog search engines
- Editorial office checks links/pages
- Less pages in index
- Easier to find (for beginners)
- e.g. Yahoo, Google directory
Meta search engines
- Search simultaneously more search engines
- Collect results
- No specific syntax to learn
- e.g. Metacrawler, Mamma
Recommendation systems (1)
- Bookstore: client likes book; users who liked that book were interested in these books as well
- Shops
- Books, videos, cds, ...
- e.g. Amazon
Recommendation systems (2)
- Social Bookmarking
- Tagging of web pages
- You can browse the tags and access associated pages
- Combine this with recommendation techniques
- e.g. del.icio.us