How to verify Googlebot
by Matt Cutts
9/20/2006
Search robots in disguiseLately I've heard a couple smart people ask that search engines provide a way know that a bot is authentic. After all, any spammer could name their bot "Googlebot" and claim to be Google, so which bots do you trust and which do you block?
The common request we hear is to post a list of Googlebot IP addresses in some public place. The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range. So the crawl folks have provided another way to authenticate Googlebot. Here's an answer from one of the crawl people (quoted with their permission):
Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:This answer has also been provided to our help-desk, so I'd consider it an official way to authenticate Googlebot. In order to fetch from the "official" Googlebot IP range, the bot has to respect robots.txt and our internal hostload conventions so that Google doesn't crawl you too hard.
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.
November 29, 2006
Brent Hands, Program Manager, Live Search
There are plenty of bots out there and, as a result, some conventions have arisen. Well-behaved bots identify themselves with a unique user-agent. They also follow the robots.txt conventions, which allow webmasters to control how their sites are crawled.
Here at Live Search, our crawlers are identified by the user-agent ‘MSNBot’. This may seem a little non-intuitive, but many webmasters depend on this, and so we chosen not to change it. In order to make things a little more transparent, we also identify our different types of crawlers. The complete list is as follows:
MSNBot Main web crawler (www.live.com)But what about crawlers that aren’t so well-behaved? After all, anyone could call themselves ‘MSNBot’, and proceed to be as rude and aggressive as they like. Fortunately, there is a way you can catch these impersonators. Here is how it works:
MSNBot-Media Images & all other media (images.live.com)
MSNBot-NewsBlogs News and blogs (search.live.com/news)
MSNBot-Products Products & shopping (products.live.com)
MSNBot-Academic Academic search (academic.live.com)
- When you get a page view request, it specifies a user-agent and an IP address. As I described above, all requests from Live Search use a user agent starting with the word ‘MSNBot’.
- If you see the MSNBot user-agent, it’s time to check the identity of the bot. Starting with the IP address (i.e. 207.46.98.149), you can use reverse DNS lookup to find out the registered name of the machine.
- Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler.
- Finally, you need to verify that the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2 – if it doesn’t, it means the name was fake.
By verifying the crawler’s identity, you can catch masquerading crawlers. When you do catch one, you can simply return an HTTP Error, thus blocking them from seeing your content.
There are currently 1 users browsing this thread. (0 members and 1 guests)