PDA

View Full Version : Spidering by Google


jolandia
19-04-2003, 01:48 PM
Anyone know of a way of finding out when Google (and other search engines) have come around to spider a site? Apparently, it happens about once a month. It would be useful for me to find out when this is happening (rough outline of the interval).

On another note, to Andy and Craig, apparently the rankings in Google tend to be better for sites that have more links to them. An idea I had was if maybe you offered a page on your site where customers could give links to their sites. Once your page got spidered Google would then see who else you link to and give them more prominence. You might have problems though with non affiliated people trying to load up links to their sites, but I guess it might be possible to use a system that will only allow sites with links to the IP addresses that you use.

If anyone else would find this feature useful then let us know!

Thanks for any insight into this.

Matt
19-04-2003, 04:15 PM
From what I've read.......If you are linked to by high ranking pages then your ranking goes up, but what I don't get is how does a site become high ranked through this process in the first place? But if you are linked to by low rankers I think your rating goes down.

You can get flump to link to you through the customer showcase, that has about a 5/10 ranking I think, which isn't bad really.

If you dont have a robots.txt, then you will get a 404 whenever the spider tries to find it......but you would loose the 'crawl' for that month......there must be other ways, dunno how tho, just I used to notice I was getting a 404 for a file I didn't know was spose to be there.

trevHCS
19-04-2003, 05:28 PM
Each search engine has its own timetable for spidering, but taking Google as the only really interesting one out there, it spiders pretty much all of the time.

As for identifying the bots, the only way you can normally do this is by downloading the raw site logs from Cpanel and either putting them through a log analyser or reading them manually.

Googlebots are either identified as Googlebot/1.2 or Googlebot/2 or similar. They can also be identfied by their IP addresses which come around the ranges 216.239.46.x although there are others around also.

With Google, you may not find that sites or updates get into the results instantly. Each month Google performs the Google Dance where it updates it's indexes with new content and rearranges which sites are the most important.

Btw, no I didn't invent that name! :)


Regarding links pages - that idea would work to a degree, but only really with Google and then its definately not the be-all and end-all of getting a good ranking.

Google does place a lot of emphasis on "link popularity", but this covers several factors.

1) The more sites that link to you, the more useful you are in the eyes of Google

2) Each link is checked to see if it comes from a popular site or not. The more popular the site is that links to you, the more that link will mean to Google.

3) Links from sites on the same IP address as your site won't be counted as quite as important as we discovered after moving everything to Aaron.


There are masses more checks done by Google to work out how much each link is worth and quite a few checks on your site also, but the checklist below might give a few ideas:

- Make sure you're not hosted on a freebie site like Geocities (probably doesn't apply in this case).

- *Do not* submit your site via one of these services that lists you on 10,000 search engines instantly. That is instant Google suicide.

- Get the Google Toolbar from toolbar.google.com and before requesting links from sites, check they have a pagerank of at least 4. You'll see the pagerank level as a green bar on the Google Toolbar.

- Submit your site to lots of the smaller search engines & directories out there, but not the paid for ones. Directories are especially good as they are more easily spidered.


Btw, if you want a list of all the spiders out there, see:

http://www.robotstxt.org/wc/active/html/index.html


Trev
PS: You could of course just pay us to do the hard work for you... :D

trevHCS
19-04-2003, 05:42 PM
Originally posted by Matt
From what I've read.......If you are linked to by high ranking pages then your ranking goes up, but what I don't get is how does a site become high ranked through this process in the first place? But if you are linked to by low rankers I think your rating goes down.

Looks like you sneaked in while I was typing :)

Yep thats it in a nutshell. Actually if you link to popular sites you can boost your importance by a little also - I'd guess thats why
Lake District Links (http://www.lake-district-links.co.uk/) works as it breaks quite a few other rules.

Getting high rankings generally requires one of the following:

- Lots of time & effort getting linked from other sites
- Providing a really useful resource which people want to link to

- Getting millions of £'s of advertising money
- Getting millions of £'s of taxpayers money

Actually as we've proven, the latter 2 don't always work especially when a site run by a couple of people in Cumbria on a budget of pretty much £0 outdoes them in the search engines... :P Me, smug, well yeah...


Regarding robots.txt, don't worry about your site not having one of those. You will get error404's if the spider looks for it but they are mainly meant to tell the spider what not to look at, eg: /my_secret_pics

If it doesn't find one it'll just assume it can spider your entire site but shouldn't think any worse of you.

Important security note: If using a robots.txt file to exclude certain directories, make sure those directories are password protected if they contain important info as hackers quite often look in robots.txt files to find the juicy stuff...


Trev

jolandia
21-04-2003, 09:32 AM
If you don't have a robots.txt file, is it possible to specifically see that it was the google spider that missed you though, for instance?

jamesb
02-05-2003, 08:40 PM
If your stats package allows you to look out for certain IP addresses then keep an eye out for

freshbot IP: 64.68.82.*
deepbot IP: 216.239.46.*

fresh bot usually visits every now and again, where deepbot should visit once a month.

It's good to make sure that deepbot visits each page on your site, otherwise it may not be indexed properly. I think the next deepbot crawl is in about 2 weeks.

cscarlet
03-05-2003, 02:52 PM
i only have robot.txt file on one of my sites so it doesn't spider certain folders tbh

iain_bspin
05-05-2003, 10:52 PM
apologies for the repost if this has been discussed before:

The Google Dance (http://www.google-dance.com/HTML-about.html)

has a lot of info about when and how Google does it's thang.

hth,

iain

www.bspin.co.uk

eskimo
06-05-2003, 03:38 AM
Good post iain_bspin.

I tend to drop into the webmasterworld forums every now again, where they have some really clued up guys, who get very excited when the google dance begins.

Its an excellent resource if you need to find out anything about Google, as a couple of the regular posters work for Google.

According to my Logs the last Google dance was 16th April, so the next one should be along soon.
Time to add those metatags!

david
11-05-2003, 05:56 PM
without trying to break the rules and getting all technical -THE best way to get a good ranking in google is to build a good quality, long establishes site over a number of years which gets lots of other sites linking to it

trevHCS
11-05-2003, 07:20 PM
Originally posted by david
without trying to break the rules and getting all technical -THE best way to get a good ranking in google is to build a good quality, long establishes site over a number of years which gets lots of other sites linking to it

Always wondered, if you pay for one of those green adverts or even the ones at the top of the pages, does the ranker look more favourably on your site?

Of course if you want instant traffic, forget Google, just add your site to AardvarkBusiness.net when it launches! :)

Trev

Matt
11-05-2003, 08:06 PM
From what I can see doing the £10k a month google advertising special (the green highlighted bar). One of the companies I know of who use it only has a 4/10 rating...which don't sound good to me.

eskimo
11-05-2003, 08:37 PM
Always wondered, if you pay for one of those green adverts or even the ones at the top of the pages, does the ranker look more favourably on your site?

From what I've heard the google ads have no effect on your actual google ranking, and are manged seprately.

After all if your ranking did improve, you wouldn't have to spend as much on google ads.

PaulD
11-05-2003, 09:21 PM
Found this interesting while looking for a google logger :
http://www.googlestats.com/ so far it seems interesting although I'm having trouble with the graphing.

Paul

eskimo
11-05-2003, 10:07 PM
Is this what your after?

http://www.darrinward.com/googletrax/

I've never used it I've just heard about it from the gys at SEOchat.com.


There is also this method.

"Normally when I want to look at Googlebots activity for my site I do the following:

1. Log into my server via SSH (or TelNet).
2. Find my raw access log, which in my case is called "access_log"
3. While in the same directory as the log file I perform this shell command

Code / Sample:
Grep '.googlebot.com' access_log > google.txt

This will create a file on your server called 'google.txt' and in that file there will be every hit from the Googlebot.

In the code above '.googlebot.com' is the name of Googles remote host. If your server does not perform a DNS lookup (some don't) then you can change this to 'Googlebot' which will track the Googlebot by its HTTP_REFERER"

May as well be Cantonese from what I can understand of it,
though it may be helpfull to someone.

iain_bspin
13-05-2003, 07:07 AM
I can't be the only one sitting here wondering WTF this fraggle (not even worthy of being a muppet) has actually posted this here.

Get the post clippers out, guys ;)

eskimo
13-05-2003, 09:16 AM
Found this interesting while looking for a google logger :