How Search Engines Crawl & Index: Everything You Need to Know
Optimizing websites without first understanding how search engines function is akin to publishing your great novel without first learning how to write.
Certainly, a thousand monkeys at typewriters will eventually create something useful (at least this monkey likes to think he does from time to time), but it’s a lot easier if you know the core elements of a task beforehand.
So we must understand how search engines work to fully understand how to optimize for them.
While we will be focusing on organic search, we must first briefly talk about one critical truth about search engines.
Paid Search Results
Not Google, not Bing, nor any other major search engine is in the business of providing organic listings.
That is to say, organic results are the means to an end, but do not directly generate revenue for them.
Without organic search results, Google’s paid search results would appear less relevant (Overture anyone?), thus reducing eyeballs and paid clicks.
Basically, Google and Bing (and the others) are advertising engines that happen to draw users to their properties with organic listings. Organic, then, is the means to the end.
Why does this matter?
It’s the key point driving in:
- Their layout changes.
- The existence of search features like knowledge panels and featured snippets.
- The click-through rates (CTR) of organic results.
When Google adds a fourth paid search result to commercial-intent queries it’s because of this.
When Google displays a featured snippet so you don’t have to leave Google.com to get an answer to your query… it is because of this.
Regardless of what change you may see taking place it’s important to keep this in mind and always question not just what it will impact today but what further changes do they imply may be on the horizon.
How Search Engines Work Today: The Series
Alright, now that we have that baseline understanding of why Google even provides organic results let’s look at the nuts-and-bolts of how they operate.
To accomplish this we’re going to look at:
- Crawling and indexing
- Machine learning
- User intent
This piece will focus on indexing. So let’s dive in…
Indexing is where it all begins.
For the uninitiated, indexing essentially refers to the adding of a webpage’s content into Google.
When you create a new page on your site there are a number of ways it can be indexed.
The simplest method of getting a page indexed is to do absolutely nothing.
Google has crawlers following links and thus, provided your site is in the index already and that the new content is linked to from within your site, Google will eventually discover it and add it to its index. More on this later.
But what if you want Googlebot to get to your page faster?
This can be important if you have timely content or if you’ve made an important change to a page you need Google to know about.
One of the top reasons I use faster methods is when I’ve either optimized a critical page or I’ve adjusted the title and/or description to improve click-throughs and want to know specifically when they were picked up and displayed in the SERPs to know where the measurement of improvement starts.
In these instances there a few additional methods you can use:
1. XML Sitemaps
There are always XML sitemaps.
Basically, this is a sitemap that is submitted to Google via Search Console.
An XML sitemap gives search engines a list of all the pages on your site, as well as additional details about it, such as when it was last modified.
But when you need a page indexed immediately it’s not particularly reliable.
2. Request Indexing
In Search Console, you can “Request Indexing”.
You begin by clicking on the top search field which reads by default, “Inspect and URL in domain.com”
ter the URL you want to be indexed, then hit Enter.
If the page is already known to Google you will be presented with a bunch of information on it. We won’t get into that here but I recommend logging in and seeing what’s there if you haven’t already.
The important button, for our purposes here, appears whether the page has been indexed or not – meaning that it’s good for content discovery or just requesting Google to understand a recent change
You’ll find the button …
Within a few seconds to a few minutes, you can search the new content or URL in Google and find the change or new content picked up.
3. Host Your Content On Google
Crawling sites to index them is a time and resource-consuming process.
One alternative is to host your content directly with them.
This can be done a few different ways but most of us (myself included) have not adopted the technologies or approaches required and Google hasn’t pushed us to them.
We’re seeing the ability to give Google direct access to our content via XML feeds, APIs, etc. and unplug our content from our design.
Firebase, Google’s mobile app platform, gives Google direct access to the app content, bypassing any need to figure out how to crawl it.
This is the future – enabling Google to index content immediately, without effort, so it can then serve it in the format most usable based on the accessing technology.
While we aren’t quite where we need to be in our technologies to stress too much about this side of things, just know it is coming.
I cannot recommend enough following Cindy Krum’s MobileMoxie blog, where she discusses these and mobile-related subjects in great detail and with great insight.
4. And Bing, Too!
To get your content indexed and/or updates quickly by Bing, you will need a Bing Webmaster Tools account.
If you don’t have one, I can’t recommend it enough. The info provided within is substantial and will help you better assess problem areas and improve your rankings on Bing, Google and anywhere else – and probably provide a better user experience as well.
But for getting your content indexed you simply need click: Configure My Site > Submit URLs
From there you enter the URL(s) you want indexes and click “Submit”.
So – that’s almost everything that you need to know about indexing and how search engines do it (with an eye towards where things are going).
We can’t really talk about indexing without talking about crawl budget.
Basically, crawl budget is a term used to describe the amount of resources that Google will expend crawling a website.
The budget assigned is based on a combination of factors, the two central ones being:
- How fast your server is (i.e., how much can Google crawl without degrading your user experience).
- How important your site is.
If you run a major news site with constantly updating content that search engine users will want to be aware of your site will get crawled frequently (dare I say … constantly).
If you run a small barbershop, have a couple of dozen links, and rightfully are not deemed important in this context (you may be an important barber in the area but you’re not important when it comes to crawl budget) then the budget will be low.