How Google’s Search Engine Works: Crawling and Indexing

Search Engines perform a marvellous feat. Based on a text query, they produce a list of (usually) relevant results from the billions of pages on the World Wide Web. Moreover, they do this at lightning speed, with search results appearing as you type.

To generate results, search engines employ three processes:

  1. Crawling: Search the Internet for content, analysing the code and content for each URL they find.
  2. Indexing: Store and organise the content found during the crawling process. Once a page is in the Index, it is eligible to be displayed as a result for relevant queries.
  3. Ranking: Serve the content they judge as the best answer to a searcher’s query, ordering by most relevant to least relevant.

Search Engines do not achieve this by searching the web but by accessing a massive web content database called the Search Engine’s ‘Index’. This database holds information about the content of millions of websites (e.g. text, images, videos) and their links. Search Engines use automated programs called Robots (a.k.a. spiders, bots or crawlers) to investigate new websites and record any changes to pages already in their Index. This process is known as Crawling. New content will be discovered by following links.

Crawling

There is no central registry of all web pages, so Google constantly looks for new and updated pages and adds them to its list of pages called its Index. This is know as ‘URL discovery’. Many pages are known to Google as they have already visited them. Other pages are found when Google follows a link to a new page. For example, when a new blog post is created, it is generally linked to by a category page, which Google will already have visited. Other pages cab be discovered when you submit a sitemap for Google to crawl.

When Google discovers a new page’s URL, it may visit (or crawl) the page to find its content. To do this, they use colossal server farms of computers to crawl billions of web pages every day. The program that does the crawling is called the Googlebot (also known as a crawler, robot, bot, or spider). The Googlebot decides algorithmically which sites to crawl, how often to crawl them.

Remember that the Googlebot does not crawl every page it discovers. Some pages may be blocked from crawling by the site owner. Other pages may not be accessible either intensionally or unintensionally.  More about this in a later video.

Indexing

Once crawled, Google tries to understand the page. This process is called indexing and it includes processing and analysing the page’s textual content and key content tags and attributes, These include <title> elements and alt attributes, images and videos.

During indexing, Google will determine whether a page is a duplicate of another page on the internet or ‘canonical’ i.e. the original content. The ‘canonical’ is the page that might be shown in search results. To select the canonical, they first group together (also known as clustering) the pages they have found on the internet that have similar content, and then they select the one that is most representative. The other pages in the group are alternate versions which may be served in different contexts, for example,  if the user is searching from a mobile device or is looking for a particular.

Google also stores signals about the canonical page and its contents, which might be used in the next stage to serve the page in search results. Signals include the page’s language, home country and usability.

Share

Facebook
Twitter
LinkedIn

Published by

Leave a Reply