Modern web search engines are complex software systems using the technology that has evolved over the years. The largest search engines such as Google and Yahoo! utilize tens or hundreds of thousands of computers to process billions of web pages and return results for thousands of searches per secondFact: date=February 2007. High volume of queries and text processing requires the software to run in highly distributed environment with high degree of redundancyFact: date=February 2007. Modern search engines have the following main components:
Welcome to CWAnswers
CWAnswers is your guide to the sprawling world wide web. The directory aims to provide a useful guide made by users. You can share your knowledge as well - simply sign up and edit your first entry. For questions just contact the team at support - at - cwanswers.com.
Weblinks for Search Engine Technology
Top 10 for Search Engine Technology
Things about Search Engine Technology you find nowhere else.
Select content modules
Modern web search engines are complex software systems using the technology that has evolved over the years. The largest search engines such as Google and Yahoo! utilize tens or hundreds of thousands of computers to process billions of web pages and return results for thousands of searches per secondFact: date=February 2007. High volume of queries and text processing requires the software to run in highly distributed environment with high degree of redundancyFact: date=February 2007. Modern search engines have the following main components:
Crawl
The first step in preparing web pages for search is to find and index them. In the past, search engines started with a small list of URLs as seed list, fetched the content, parsed for the links on those pages, fetched the web pages pointed to by those links which provided new links and the cycle continued until enough pages were foundFact: date=February 2007. Most modern search engines now utilize a continuous crawl method rather than discovery based on a seed listFact: date=November 2007. The continuous crawl method is just an extension of discovery method but there is no seed list because the crawl never stopsFact: date=February 2007. The current list of pages is visited on regular intervals and new pages are found when links are added or deleted from those pagesFact: date=February 2007. Many search engines use sophisticated scheduling algorithms to decide when to revisit a particular page. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity and overall quality of site, speed of web server serving the page and resource constraints like amount of hardware and bandwidth of Internet connection. Search engines crawl many more pages than they make available for searching because crawler find lots duplicate content pages on the web and many pages don't have useful content. Duplicate and useless content often represents more than half the pages available for indexing.
Link Map
Pages discovered by crawlers are fed into (often distributed) service that creates a link map of the pages. Link map is a graph structure in which pages are represented as nodes connected by the links among those pages. This data is stored in data structures that allow fast access to the data by certain algorithms which compute the popularity score of pages on the web, essentially based on how many links point to a web page and the quality of those links. One such algorithm, PageRank, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention. The idea of doing link analysis to compute a popularity rank is older than PageRank and many variants of the same idea are currently in use. These ideas can be categorized in three main categories: rank of individual pages, rank of web sites, and nature of web site content (Jon Kleinberg's HITS algorithm). Search engines often differentiate between internal links and external links, with the assumption that links on a page pointing other pages on the same site are less valuable because they are often created by web site owners to artificially increase the rank of their web sites and pages. Link map data structures typically also store the anchor text embedded in the links because anchor text often provides a very good quality short-summary of a web page's content.

























