Crawl optimization should be a priority for any large site looking to
improve their SEO efforts. By tracking, monitoring and focusing
Googlebot you can gain an advantage over your competition.
Crawl Budget

It’s important to cover the basics before discussing crawl
optimization. Crawl budget is the time or number of pages Google
allocates to crawl a site. How does Google determine your crawl budget?
The best description comes from an
Eric Enge interview of Matt Cutts.
The best way to think about it is that the number of
pages that we crawl is roughly proportional to your PageRank. So if you
have a lot of incoming links on your root page, we’ll definitely crawl
that. Then your root page may link to other pages, and those will get
PageRank and we’ll crawl those as well. As you get deeper and deeper in
your site, however, PageRank tends to decline.
Another way to think about it is that the low PageRank pages on your
site are competing against a much larger pool of pages with the same or
higher PageRank. There are a large number of pages on the web that have
very little or close to zero PageRank. The pages that get linked to a
lot tend to get discovered and crawled quite quickly. The lower PageRank
pages are likely to be crawled not quite as often.
In other words, your crawl budget is determined by authority. This should not come as a shock. But that was
pre-Caffeine. Have things changed since?
Caffeine

What is Caffeine? In this case it’s not the stimulant in your latte. But it
is a stimulant of sorts. In June of 2010,
Google rebuilt the way they indexed content.
They called this change ‘Caffeine’ and it had a profound impact on the
speed in which Google could crawl and index pages. The biggest change,
as I see it, was incremental indexing.
Our old index had several layers, some of which were
refreshed at a faster rate than others; the main layer would update
every couple of weeks. To refresh a layer of the old index, we would
analyze the entire web, which meant there was a significant delay
between when we found a page and made it available to you.
With Caffeine, we analyze the web in small portions and update our
search index on a continuous basis, globally. As we find new pages, or
new information on existing pages, we can add these straight to the
index. That means you can find fresher information than ever before—no
matter when or where it was published.
Essentially, Caffeine removed the bottleneck for getting pages indexed. The system they built to do this is aptly named
Percolator.
We have built Percolator, a system for incrementally
processing updates to a large data set, and deployed it to create the
Google web search index. By replacing a batch-based indexing system with
an indexing system based on incremental processing using Percolator, we
process the same number of documents per day, while reducing the
average age of documents in Google search results by 50%.
The speed in which Google can crawl is now matched by the speed of
indexation. So did crawl budgets increase as a result? Some did, but not
as much as you might suspect. And here’s where it gets interesting.
Googlebot seems willing to crawl more pages post-Caffeine but it’s
often crawling the same pages (the important pages) with greater
frequency. This makes a bit of sense if you think about Matt’s statement
along with the average age of documents benchmark. Pages deemed to have
more authority are given crawl priority.
Google is looking to ensure the most important pages remain the ‘freshest’ in the index.
Time Since Last Crawl

What I’ve observed over the last few years is that pages that haven’t
been crawled recently are given less authority in the index. To be more
blunt,
if a page hasn’t been crawled recently, it won’t rank well.
Last year I got a call from a client about a downward trend in their
traffic. Using advanced segments it was easy to see that there was
something wrong with their product page traffic.
Looking around the site I found that, unbeknownst to me, they’d
implemented pagination on their category results pages. Instead of all
the products being on one page, they were spread out across a number of
paginated pages.
Products that were on the first page of results seemed to be doing
fine but those on subsequent pages were not. I started to look at the
cache date on product pages and found that those that weren’t crawled
(I’m using cache date as a proxy for crawl date) in the last 7 days were
suffering.
Undo! Undo! Undo!
Depagination
That’s right, I told them to go back to
unpaginated results. What happened?

You guessed it. Traffic returned.
Since then I’ve had success with depagination. The trick here is to think about it in terms of
progressive enhancement and ‘mobile’ user experiences.
The rise of smartphones and tablets has made click based pagination a
bit of an anachronism. Revealing more results by scrolling (or swiping)
is an established convention and might well become the dominant one in
the near future.
Can you load
all the results in the background and
reveal them only when users scroll to them without crushing your load
time? It’s not always easy and sometimes there are tradeoffs but it’s a
discussion worth having with your team.
Because there’s no better way to get those deep pages crawled by having links to
all of them on that first page of results.
CrawlRank
Was I crazy to think that the time since last crawl could be a factor in ranking? It turns out I wasn’t alone.
Adam Audette (a
smart guy) mentioned he’d seen something like this when I ran into him
at SMX West. Then at SMX Advanced I wound up talking with
Mitul Gandhi, who had been tracking this in more detail at
seoClarity.

Mitul and his team were able to determine that content not crawled
within ~14 days receives materially less traffic. Not only that, but
getting those same pages crawled more frequently produced an increase in
traffic. (Think about that for a minute.)
At first, Google clearly crawls using PageRank as a proxy. But over
time it feels like they’re assigning a self-referring CrawlRank to
pages. Essentially, if a page hasn’t been crawled within a certain time
period then it receives less authority. Let’s revisit Matt’s description
of crawl budget again.
Another way to think about it is that the low PageRank
pages on your site are competing against a much larger pool of pages
with the same or higher PageRank. There are a large number of pages on
the web that have very little or close to zero PageRank.
The pages that aren’t crawled as often are pages with little to no
PageRank. CrawlRank is the difference in this very large pool of pages.
You win if you get your low PageRank pages crawled more frequently than the competition.
Now what CrawlRank is really saying is that document age is a
material ranking factor for pages with little to no PageRank. I’m still
not
entirely convinced this is what is happening, but I’m seeing success using this philosophy.
Internal Links
One might argue that what we’re really talking about is internal link structure and density. And I’d agree with you!
Not only should your internal link structure support the most
important pages of your site, it should make it easy for Google to get
to
any page on your site in a minimum of clicks.
One of the easier ways to determine which pages are deemed most
important (based on your internal link structure) is by looking at the
Internal Links report in Google Webmaster Tools.

Do the pages at the top reflect the most important pages on your site? If not, you might have a problem.
I have a client whose blog was receiving 35% of Google’s crawl each
day. (More on how I know this later on.) This is a blog with 400 posts
amid a total content corpus of 2 million+ URLs. Googlebot would crawl
blog content 50,000+ times a day! This wasn’t where we wanted Googlebot
spending its time.
The problem? They had menu links to the blog and
each blog
category on nearly all pages of the site. When I went to the Internal
Links report in Google Webmaster Tools you know which pages were at the
top? Yup. The blog and the blog categories.
So, we got rid of those links. Not only did it change the internal
link density but it changed the frequency with which Googlebot crawls
the blog. That’s crawl optimization in action.
Flat Architecture

Remember the advice to create a flat site architecture. Many ran out and got rid of subfolders thinking that if the
URL didn’t have subfolders then the architecture was flat. Um … not so much.
These folks destroyed the ability for easy analysis,
potentially removed valuable data in assessing that site, and did nothing to address the underlying issue of getting Google to pages faster.
How many clicks from the home page is each piece of content. That’s
what was, and remains, important. It doesn’t matter if the URL is
domain.com/product-name if it takes Googlebot (and users) 8 clicks to
get there.
Is that
mega-menu
on every single page really doing you any favors? Once you get someone
to a leaf level page you want them to see similar leaf level pages.
Related product or content links are the lifeblood of any good internal
link structure and are, sadly, frequently overlooked.
Depagination is one way to flatten your architecture but a simple HTML sitemap, or specific A-Z sitemaps can often be
very effective hacks.
Flat architecture shortens the distance between authoritative pages
and all other pages, which increases the chances of low PageRank pages
getting crawled on a frequent basis.
Tracking Googlebot
“A million dollars isn’t cool. You know what’s cool? A billion dollars.”
Okay, Sean Parker probably didn’t say that in real life but it’s an
apt analogy for the difference in knowing how many pages Googlebot
crawled versus where Googlebot is crawling, how often and with what
result.
The Crawl Stats graph in Google Webmaster Tools only shows you how many pages are crawled per day.

For nearly five years I’ve worked with clients to build their own Googlebot crawl reports.

That’s cool.
And it doesn’t always have to look pretty to be cool.

Here I can tell there’s a problem with this specific page type. More
than 50% of the crawl on that page type if producing a 410. That’s
probably not a good use of crawl budget.
All of this is done by parsing or ‘
grepping‘ log files (a line by line history of visits to the site) looking for Googlebot. Here’s a secret. It’s not
that hard, particularly if you’re even half-way decent with Regular Expressions.
I won’t go into details (this post is long enough as it is) but you can check out posts by
Ian Lurie and
Craig Bradford for more on how to grep log files.
In the end I’m interested in looking at the crawl by page type and response code.

You determine page type using RegEx. That sounds mysterious but all
you’re doing is bucketing page types based on pattern matching.
I want to know where Googlebot is spending time on my site. As
Mike King
said, Googlebot is always your last persona. So tracking Googlebot is
just another form of user experience monitoring. (Referencing it like
this might help you get this project prioritized.)
You can also drop the crawl data into a database so you can query
things like time since last crawl, total crawl versus unique crawl or
crawls per page. Of course you could also give seoClarity a try since
they’ve got a lot of this stuff right out of the box.
If you’re not tracking Googlebot then you’re missing out on the first part of the SEO process.
You Are What Googlebot Eats

What you begin to understand is that you’re assessed based on what
Googlebot crawls. So if they’re crawling a whole bunch of parameter
based, duplicative URLs or you’ve left the email-a-friend link open to
be crawled on every single product, you’re giving Googlebot a bunch of
empty calories.
It’s not that Google will penalize you, it’s
the opportunity cost for dirty architecture based on a finite crawl budget.
The crawl spent on junk could have been spent crawling low PageRank
pages instead. So managing your URL Parameters and using robots.txt
wisely can make a
big difference.
Many large sites will also have robust
external link graphs.
I can leverage those external links, rely less on internal link density
to rank well, and can focus my internal link structure to ensure low
PageRank pages get crawled more frequently.
There’s no patent right or wrong answer. Every site will be different. But experimenting with your internal link strategies and
measuring the results is what separates the great from the good.
Crawl Optimization Checklist
Here’s a quick crawl optimization checklist to get you started.
Track and Monitor Googlebot
I don’t care how you do it but you need this type of visibility to make
any
inroads into crawl optimization. Information is power. Learn to grep,
perfect your RegEx. Be a collaborative partner with your technical team
to turn this into an automated daily process.
Manage URL Parameters
Yes, it’s confusing. You will probably make some mistakes. But that
shouldn’t stop you from using this feature and changing Googlebot’s
diet.
Use Robots.txt Wisely
Stop feeding Googlebot empty calories. Use robots.txt to keep Googlebot focused and remember to make use of pattern matching.
Don’t Forget HTML Sitemap(s)
Seriously. I know human users might not be using these, but
Googlebot is a different type of user with slightly different needs.
Optimize Your Internal Link Structure
Whether you try depagination to flatten your architecture,
re-evaluate navigation menus, or play around with crosslink modules,
find ways to optimize your internal link structure to get those low
PageRank pages crawled more frequently.