Simon Willison’s Weblog

Google conspiracy theories

Microdoc News have a poorly researched story suggesting that Google have been engineering their search results to favour their own properties:

It could be argued that the most important site that should appear when searching for the word blogs would be the generic site where anyone with a blog can get listed for her/his three minutes of fame, which includes any blog, anywhere in any system. Weblogs.com is a directory of sorts to any current post and is like, if you please, a central nervous system to the world of blogs. However, Google does not list weblogs.com as the primary site -- blogger.com is listed as the prime, first-up site in the listings that result from the blogs search. Is that because Google Inc., owns blogger.com, or is it that blogger.com is really what one would expect as the first result?

Time to break out Python again. I won’t explain the following code in detail, but essentially it downloads the HTML source of the front pages of both Blogger.com and Weblogs.com, strips out the HTML tags (defined as anything between two angle brackets) and counts the number of occurrences of the individual word ’blogs’.

>>> import urllib, re
>>> striptags = re.compile('<[^>]+>')
>>> blogs = re.compile(r'\bblogs\b', re.I)
>>> blogger = urllib.urlopen('http://www.blogger.com/').read()
>>> weblogs = urllib.urlopen('http://www.weblogs.com/').read()
>>> len(blogger), len(weblogs)
(26369, 394323)
>>> blogs.findall(striptags.sub('', blogger))
['blogs', 'blogs', 'blogs', 'Blogs']
>>> blogs.findall(striptags.sub('', weblogs))
['Blogs', 'Blogs', 'blogs']

The above code shows that while Blogger.com mentions the word ’blogs’ four times in 26,000 characters, Weblogs.com only mentions it three times in 394,000 characters! Blogger has a far higher ’blogs’ word density—in fact, the only occurrence of the word on Weblogs.com is when it happens to be a part of the name of one of the several thousand blogs listed on the page at any one time.

Although word density is a reasonably useful metric for telling if Google will like something, everyone knows that Google’s secret sauce is PageRank, which is based in part on the number of pages linking to a site. Two quick link: searches reveal 7,840 links to Weblogs.com, but a whopping 61,500 links to Blogger.com (no doubt helped by all those “powered by blogger” stickers).

So Blogger.com not only has a higher word density for the designated search term, it also has far more links to it overall. Is it really so surprising that it’s coming out on top?

Further more, if you run a search for ’weblogs’, Weblogs.com comes out as the number one result. It’s all in the name.

Dave Winer finds it strange that the Google Weblog (unaffiliated with Google the company) comes out as the first result in a search for ’weblog’. My guess is that this is a result of the blog’s name influencing the text of links made to it—when you link to Doc Searls or myself (both of whom have ’weblog’ in their site title) you can abbreviate it to “Doc Searls” or “Simon Willison”, but when you link to the Google Weblog you have to use the fully qualified name or your link won’t make sense. Google can be strongly affected by link text, as last year’s Google bombing epidemic aptly demonstrated.

This is Google conspiracy theories by Simon Willison, posted on 17th September 2003.

Tagged , ,

Next: "sexeger"[::-1]

Previous: Interactive Python

Previously hosted at http://simon.incutio.com/archive/2003/09/17/googleConspiracies