vBulletin

Thank you for visiting. This is our website archive. Please visit our main website by clicking the banner above.
vBulletin FAQ is dedicated to helping the forum owner build, manage and profit from his vBulletin Forum
vBulletin Web Hosting - Free skins and styles for your vBulletin - Search Engine Optimization




How to verify Googlebot

minstrel
09-22-2006, 10:28 AM
How to verify Googlebot (http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html)
Posted by Matt Cutts
9/20/2006 11:45:00 AM

Lately I've heard a couple smart people ask that search engines provide a way know that a bot is authentic. After all, any spammer could name their bot "Googlebot" and claim to be Google, so which bots do you trust and which do you block?

The common request we hear is to post a list of Googlebot IP addresses in some public place. The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range. So the crawl folks have provided another way to authenticate Googlebot. Here's an answer from one of the crawl people (quoted with their permission):

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

This answer has also been provided to our help-desk, so I'd consider it an official way to authenticate Googlebot. In order to fetch from the "official" Googlebot IP range, the bot has to respect robots.txt and our internal hostload conventions so that Google doesn't crawl you too hard.

Peggy
09-22-2006, 10:45 AM
minstrel this is some great info, thank you for posting it. I have often wondered, and asked I think, how to be able to tell if a googlebot was indeed, a true googlebot.

I'm curious though, what does he mean when he says the bot has to respect robots.txt and our internal hostload conventions so that Google doesn't crawl you too hard What does that mean, crawl you too hard? I take it that's a bad thing?

minstrel
09-22-2006, 11:02 AM
Some spiders will hit your pages repeatedly at short intervals, causing problems with databases or bandwidth. You can limit this with instructions in a robots.txt file placed in the root directory of your site:

User-Agent: msnbot
Crawl-Delay: 10

User-Agent: Slurp
Crawl-Delay: 10


The above code is specific to MSN's spider, "MSNBot", and Yahoo's spider, "Slurp", and instructs the spiders to wait the specified amount of time, in seconds (10 seconds above, default is 1 second if not specified) before requesting another page from your site. MSNBot and Slurp have been known to index some sites very heavily, and this allows webmasters to slow down their indexing speed.

Googlebot is normally better behaved by default and so doesn't require this. At present, as far as I know, Googlebot does not support crawl-delay.

Addendum:

I just did a search for googlebot and crawl-delay and I found a number of entries from various forums and even robots.txt files that use

User-Agent: Googlebot
Crawl-Delay: 10


However, be aware that you're likely wasting your time putting this in:

http://www.mattcutts.com/blog/googlebot-keep-out/

(scroll down through the usual crap comments)

Matt Cutts Said,
March 18, 2006 @ 1:37 pm

Dave, Googlebot doesn’t support the Crawl-Delay suggestion in robots.txt. I intend to do a post about why not at some point. If you’re impatient, you can listen to the MP3 of pundits of search from the SES NYC show on webmasterradio.fm. I talked about why we don’t support crawl-delay there. I would like our crawl team to support some way of reporting how much to throttle Googlebot though.

Peggy
09-22-2006, 12:08 PM
hmmmm.. I don't have a robots.txt file in my directory. If you are of the opinion that it's to my advantage to make one, then I will.

minstrel
09-22-2006, 12:13 PM
You don't need one at all if:

1. you have no directories or files you wish NOT to be spidered; and

2. you don't perceive a need to slow down spidering by MSNBot or Slurp; and

3. there are no bots you want to try to block (although in reality if you're trying to block bad bots they'll ignore the robots.txt file anyway)

Peggy
09-22-2006, 12:22 PM
well, I don't know, really, if I want to slow them down or not. How does one know? Do they cause a server load? or what? (I'm really not dumb, I learning ;) )

Yahoo slurp is at both of my sites quite alot, sometimes 3 or 4 bots at a time. Or at least it's listed in my whos online 3 at a time sometimes. Is that too much? or no?

Geez I've been doing forums for years and never came across this.

minstrel
09-22-2006, 12:37 PM
I have never personally felt a need to discourage bots. I don't use those instructions myself. I have a robots.txt file to exclude certain directories or files.

Peggy
09-22-2006, 12:47 PM
ah ok. Thanks for the info and advice!

Big Dan
09-22-2006, 01:13 PM
Quick question: Does vB come prepackaged with a robots.txt? Both my vB directories have robots.txt and I never specifically created it.

minstrel
09-22-2006, 01:19 PM
No. Or at least my version didn't.

It may have been pre-installed by your host. What does it contain? Sometimes, an empty robots.txt file is uploaded to stop error 404 file not found, or one that allows everything:

User-agent: *
Disallow:

which says "disallow nothing -- spider everything".

Big Dan
09-22-2006, 01:35 PM
Let me open it up.. I did notice it was hidden, I had to check "show hidden files" in my FTP client:

User-agent: *
Disallow: /ajax.php
Disallow: /attachment.php
Disallow: /calendar.php
Disallow: /cron.php
Disallow: /editpost.php
Disallow: /global.php
Disallow: /image.php
Disallow: /inlinemod.php
Disallow: /joinrequests.php
Disallow: /login.php
Disallow: /member.php
Disallow: /memberlist.php
Disallow: /misc.php
Disallow: /moderator.php
Disallow: /newattachment.php
Disallow: /newreply.php
Disallow: /newthread.php
Disallow: /online.php
Disallow: /poll.php
Disallow: /postings.php
Disallow: /printthread.php
Disallow: /private.php
Disallow: /profile.php
Disallow: /register.php
Disallow: /report.php
Disallow: /reputation.php
Disallow: /search.php
Disallow: /sendmessage.php
Disallow: /showgroups.php
Disallow: /subscription.php
Disallow: /threadrate.php
Disallow: /usercp.php
Disallow: /usernote.php
Disallow: /spiders.php

I just realized I had copied that from one of the vB forums.. Now that looked at it, I remember making it. Sorry for the confusion.

Noppid
09-22-2006, 05:06 PM
Robots.txt is only read in the root folder, not sub-folders. ;)

Coder1
09-22-2006, 05:54 PM
I use a similar robots.txt... there is no reason for my calendar page to be indexed by a search engine, and robots shouldn't be trying to post new threads, etc.


vBulletin

seo book

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum