Slides from Brighton SEO Talk, April 13th 2012

SearchBots: Lost Children or Hungry Psychopaths? What Do Searchbots Actually Do?

In addition to these slides below, there's also a write-up of my experience at the conference including a video and a lovely swirly picture of me here: Searchbots, Lost Children and More.

  • 2. Google occasionally posts a "Googlebot found an extremely high number of URLs on your site" message in Google Webmaster Tools
  • 3. Although this is a classic Google message in that it states it may mean varying things, one line in particular stands out: "Google ... may be unable to completely index all the content on your site"
  • 4. Given that this is such a dramatic message, this got me thinking: if there is the tiniest chance that Google doesn't index all our content, what hard factual evidence is there that tells us simply, practically what URLs Google (and other searchbots) do actually index?
  • 5. We could use the "site:" operator, but just how accurate is it? I have been told that its optimised for speed and not accuracy, and besides - it's quite an effort to keep digging into this if you have a website serving 10,000 let alone 100,000 URLs
  • 7. Google Analytics - Meh - primarily configured to report on human visits
  • 9. Google Webmaster Tools - gives us some information, but not really what we need to know - namely - what has Googlebot indexed and when?
  • 13. Which leaves us with webserver logfiles ....
  • 15. Problem with webserver logfiles is that they can become very large, so the prospect of digging through them is enough to make you shoot yourself. Maybe.
  • 16. However, if you can dig into them, you can start to extract real gold.
  • 19. For example, on a significant website, Googlebot spent almost 40% of its time requesting just two URLs. Doh! What about the 39,998 other URL's?
  • 22. Looking at just the top 100, or top 1000 URLs is only so useful if you serve 10,000 or 100,000 URL's, so instead we want to group the URLs by their top-level-section
  • 24. By grouping URLs in this way, we can plot the sections of the site that Googlebot requested. Each ball represents a section of the site, the height of the ball is the no. of requests, the size of the ball the no. of URLs within each section. This allows us a quick overview of how Googlebot spent its time. This is a visualisation of Googlebots visits over six months (to the same real, significant website).
  • 27. What we really need to know though is - is Googlebot actually requesting all our content? Because if it isn't, then not all of our content will be indexed, and without doubt we're therefore missing out on traffic. So, we compare the no. of URLs requested by anyone (people, robots, anything) to the no. of URLs requested by Googlebot. Each blue line represents the no. of URLs requested by anyone and anything to a section of the site; each red line represents the no. of URLs requested by Googlebot - and it's completely clear, that over six months, Googlebot never requests all URLs. Given that the x-axis is a logarithmic scale, Googlebot falls quite a bit short.
  • 29. If Googlebot (and other searchbots) don't request all our content, we really need to know what content they do request. This slide is an animation of Googlebots requests over six months. Each ball represents a section of the site, the height is the percentage of requests made by Googlebot, the size of the ball is the no. of URLs within each section made by Googlebot. What's really important to appreciate from this animation is that (a) Googlebot gets completely obsessed with a couple of sections of the site (e.g. the yellow and red balls), and (b) those are two sections that people request very rarely, and are not key areas of content to this website owner. So, Googlebot is really not spending its time very effectively.
  • 31. What we really want is a quick way of assessing whether Googlebot is spending its time cost-effectively - so we plot Googlebot visits (on the x axis) against human visits (on the y axis). The big green ball at the far right-hand side therefore represents a section of the site that Googlebot spends a huge amount of its time in, but that no-one visits. So not a great use of Googlebots time.
  • 32. What can we now conclude? To start with: Googlebot does not always request all content. This surely is an absolutely crucial finding? If Googlebot does not request all content, not all of our content is in SERPS, and hence we miss out on traffic.
  • 33. Googlebot can become distracted, obsessed or even lost, particularly in: (a) on-site search sections (b) additive filtered sections, clearly a major issue for retailers and (c) areas of thin or superficial content
  • 35. Googlebot can therefore become unfocused and really need our help being focused on the kind of content that we really value.
  • 36. Given that this is all rather profound (what's the point of link building if your content isn't going to be indexed), what documentation, blog posts, etc is there that mentions the simple fact that searchbots don't index our content? There's this comment from the recent "SEO Armageddon Oh My God We're All Doomed" posts - i.e. Googlebot is a bit stupid but they're going to make it smarter.
  • 37. Then there's this slide from last year, which is self-evident - if Googlebot doesn't think your content is up-to-much, it's not going to come back as often. So, if Googlebot is spending lots of its time in lesser-quality areas of your site, it may not perceive your site as useful to users and may not come back (the result of which is - less up to date results in SERPS, or just not all of your content in SERPS).
  • 39. Conclusions. As the slide above says.
  • 42. So, what should we do? Regularly check your searchbot behaviour, particularly if you regularly see the "Googlebot found an extremely high number of URLs on your site" message. If your searchbots are behaving oddly, not visiting the areas that you would like, then consider the options listed in the slide - and check your sitemaps (e.g do they contain all URLs). Use this kind of technique to find out whether (a) all content is being requested (b) If not, what is being requested - and work out how to make sure that all content is being requested.
  • 44. Are searchbots lost children (do they get lost in myriads of URLs), or psycopaths hungrily hoovering up all our content?
  • 45. Or, are they more like bright teenagers, who sometimes are easily distracted?
  • 49. What does Googlebot sound like? (this is the real programmatic sound of Googlebot)
  • 51. What does Bing sound like? Ahem.