Crawling is the process used by search engine web crawlers (bots or spiders) to visit and download a page and extract its links in order to discover additional pages.
Pages known to the search engine are crawled periodically to determine whether any changes have been made to the page’s content since the last time it was crawled. If a search engine detects changes to a page after crawling a page, it will update it’s index in response to these detected changes.
HOW DOES CRAWLING WORK?
Search engines use their own web crawlers to discover and access web pages.
All commercial search engine crawlers begin crawling a website by downloading its robots.txt file, which contains rules about what pages search engines should or should not crawl on the website. The robots.txt file may also contain information about sitemaps; this contains lists of URLs that the site wants a search engine crawler to crawl.
Search engine crawlers use a number of algorithms and rules to determine how frequently a page should be re-crawled and how many pages on a site should be indexed. For example, a page which changes a regular basis may be crawled more frequently than one that is rarely modified.
HOW CAN SEARCH ENGINE CRAWLERS BE IDENTIFIED?
The search engine bots crawling a website can be identified from the user agent string that they pass to the web server when requesting web pages.
Here are a few examples of user agent strings used by search engines:
Googlebot User Agent
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Bingbot User Agent
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Baidu User Agent
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Yandex User Agent
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Anyone can use the same user agent as those used by search engines. However, the IP address that made the request can also be used to confirm that it came from the search engine – a process called reverse DNS lookup.
CRAWLING IMAGES AND OTHER NON-TEXT FILES
Search engines will normally attempt to crawl and index every URL that they encounter.
However, if the URL is a non-text file type such as an image, video or audio file, search engines will typically not be able to read the content of the file other than the associated filename and metadata.
Although a search engine may only be able to extract a limited amount of information about non-text file types, they can still be indexed, rank in search results and receive traffic.
You can find a full list of file types that can be indexed by Google available here.
CRAWLING AND EXTRACTING LINKS FROM PAGES
Crawlers discover new pages by re-crawling existing pages they already know about, then extracting the links to other pages to find new URLs. These new URLs are added to the crawl queue so that they can be downloaded at a later date.
Through this process of following links, search engines are able to discover every publicly-available webpage on the internet which is linked from at least one other page.
Another way that search engines can discover new pages is by crawling sitemaps.
Sitemaps contain sets of URLs, and can be created by a website to provide search engines with a list of pages to be crawled. These can help search engines find content hidden deep within a website and can provide webmasters with the ability to better control and understand the areas of site indexing and frequency.
Alternatively, individual page submissions can often be made directly to search engines via their respective interfaces. This manual method of page discovery can be used when new content is published on site, or if changes have taken place and you want to minimise the time that it takes for search engines to see the changed content.
Google states that for large URL volumes you should use XML sitemaps, but sometimes the manual submission method is convenient when submitting a handful of pages. It is also important to note that Google limits webmasters to 10 URL submissions per day.
Additionally, Google says that the response time for indexing is the same for sitemaps as individual submissions.