Maximising Index Inclusion
Getting your site crawled and indexed is fundamental to showing up on the Search Engine Results Pages. As a website owner you want to ensure that as many of your pages are indexed as possible.
Frequently a search engine will be able to access parts of your site by crawling, but some pages or sections might be inaccessible. To maximise your natural search performance, it is important to ensure that search engines are able to discover all the content you want indexed.
Common issues include:
Your site is new
It takes Google a few days to visit and index new sites. If your site is still not indexed after a few weeks, then this could be because there are no other sites linking to it.
You can tell Google about your site but creating a Google Search Console account and adding a sitemap (see below).
If Google has judged that your site is of poor quality, then it may not be indexed. This is called Manual Action. The status of your website can be seen from Google Search Console.
Content hidden behind a login
If your site require users to log in or complete a form before accessing certain content, search engines will not be able to access these pages.
Text displayed using non-text content?
Google cannot read non-text media forms (images, video, GIFs, etc.) and so these should not be used to display text that you want to be indexed. It is best practise to add text within the <HTML> markup of your webpage. Furthermore, non-text content is not readable by screenreaders making your site inaccessible to the partially sighted.
Search engines cannot follow your site navigation?
To read your site, a crawler needs a path of links to guide it from page to page. A page will not be indexed if itis not linked to from other pages
Website best practise
Maximise Crawl budget
Crawl budget is the average number of pages that the Googlebot will crawl on your site before leaving. If you have a very large site (e.g. an eCommerce site), you need to ensure that your crawl budget is not all used up before it has indexed the important pages on your site.
Crawl budget optimization ensures that Googlebot is not wasting time crawling through unimportant pages and ignoring your important pages. Pages you may want to exclude could include:
- Checkout pages
- Pages with very little content
- Duplicate pages
To ensure that your site is index appropiately, you need to tell Google which pages to prioritise. There are three ways to do this:
- Robots.txt file
- Canonical Tag
- Google Search Console parameters
Create a Robots.txt file
Robots.txt files are placed in a website’s root directory (ex. vendlab.com/robots.txt) and tell search engines the parts of your site they should and should not crawl, as well as the speed they should crawl your site. All sites should have a Robot.txt file and most CMS will create one automatically.
Robot.txt files are an easy way of stopping Google from indexing the parts of your site. However, be careful to ensure that is does contain errors and is stopping Google from indexing pages you actually want indexed.
Googlebot will look for the robots.txt file when it visits a site and act in the following ways:
- If it can not find a robots.txt file, it will proceed to crawl the site.
- If it finds a robots.txt file for a site, it will usually follow its suggestions and proceed to crawl the site.
- If it encounters an error while trying to access a site’s robots.txt file, it will not crawl the site.
Use a flat directory structure
A site structure is how the pages on a website are organised. In general, should create a structure where your site’s pages are all be only a few links away from one another. This is called a flat structure and makes it easy for Google and other search engines to crawl all of your site’s pages.
Use a consistant URL structure
Your site’s URLs should follow a consistent and logical structure. This not only helps Google to crawl your site but also helps users understand where they are on your site. Furthermore, putting your pages under different categories gives Google extra context about each page in that category.
Use Breadcrumb navigation
Bread crumbs automatically add internal links to category and subpages on your site and Help both Google and user navigate a website.
e.g. 3. http://www.shop.com/catgory2/product
Home > Category 1 > Category 2 > Product name
Create redirects for removed pages
Use Canonical tags
If Google crawls the same content on different web pages, it will not always know which page to index in search results. By adding the rel=”canonical” tag to preferred pages you indentify to search engines which is preferred version of content i.e. the one to index.
For example, content management system like Shopify be configured to allow the same product to be accessed via multiple URLs e.g.:
By adding a canonical tag to option 1, you will tell Google which is the most important content to index and avoid duplicate content issues.
Create a sitemap
A sitemap is a list of pages on your site that crawlers can acces to discover and index your content. It is an simple way to ensure Google is notified of your highest priority pages.
Sitemaps should be created using Google’s specification and submitted via Google Search console.
Identifying Website issues
Using the ‘Site’ Search Operator and Google Search Console reports
One method of checking your indexed pages is to use the advanced search operator ‘site:yourdomain.com’ on Google search
The number of results Google returns (see ‘About XX results above) give you a good idea of the pages on your site that are indexed.
More accurate results can be obtained from the Index Coverage report in Google Search Console.
When crawling the pages on your site, a crawler may find errors. These can be viewed by visiting Google Search Console’s ‘Crawl Errors’ report.
Types of error include:
4xx errors are client errors indicating the requested URL contains bad syntax or cannot be found. The common 4xx error is the ‘404 – not found’ error. These can occur because of a URL typo, deleted page, or broken redirect. When search engines encounter a 404, they cannot access the URL and it will be exclused from search. When users hit a 404, they will probably leave your site and you will loose business. It is best practise to fix or redirect 404 errors.
5xx errors are server errors. This means that the server on which the web page is located failed to fulfill the request to access the page. These normally happen because the request for the URL timed out, so the Googlebot abandoned the request.
Fixing Crawl Errors
There is an easy way to tell searchers and search engines that your page has moved – the 301 (permanent) redirect.
The 301 status code indicates that the page has moved permanently to a new location, so do not redirect URLs to irrelevant pages. If a page is ranking for a query and you use a 301 to redirect to a URL with different content, it will probably drop in rank position because it is no longer relevant to the query.