Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Google conspiracy theories

Microdoc News have a poorly researched story suggesting that Google have been engineering their search results to favour their own properties:

It could be argued that the most important site that should appear when searching for the word blogs would be the generic site where anyone with a blog can get listed for her/his three minutes of fame, which includes any blog, anywhere in any system. Weblogs.com is a directory of sorts to any current post and is like, if you please, a central nervous system to the world of blogs. However, Google does not list weblogs.com as the primary site -- blogger.com is listed as the prime, first-up site in the listings that result from the blogs search. Is that because Google Inc., owns blogger.com, or is it that blogger.com is really what one would expect as the first result?

Time to break out Python again. I won’t explain the following code in detail, but essentially it downloads the HTML source of the front pages of both Blogger.com and Weblogs.com, strips out the HTML tags (defined as anything between two angle brackets) and counts the number of occurrences of the individual word ’blogs’.

>>> import urllib, re
>>> striptags = re.compile('<[^>]+>')
>>> blogs = re.compile(r'\bblogs\b', re.I)
>>> blogger = urllib.urlopen('http://www.blogger.com/').read()
>>> weblogs = urllib.urlopen('http://www.weblogs.com/').read()
>>> len(blogger), len(weblogs)
(26369, 394323)
>>> blogs.findall(striptags.sub('', blogger))
['blogs', 'blogs', 'blogs', 'Blogs']
>>> blogs.findall(striptags.sub('', weblogs))
['Blogs', 'Blogs', 'blogs']

The above code shows that while Blogger.com mentions the word ’blogs’ four times in 26,000 characters, Weblogs.com only mentions it three times in 394,000 characters! Blogger has a far higher ’blogs’ word density—in fact, the only occurrence of the word on Weblogs.com is when it happens to be a part of the name of one of the several thousand blogs listed on the page at any one time.

Although word density is a reasonably useful metric for telling if Google will like something, everyone knows that Google’s secret sauce is PageRank, which is based in part on the number of pages linking to a site. Two quick link: searches reveal 7,840 links to Weblogs.com, but a whopping 61,500 links to Blogger.com (no doubt helped by all those “powered by blogger” stickers).

So Blogger.com not only has a higher word density for the designated search term, it also has far more links to it overall. Is it really so surprising that it’s coming out on top?

Further more, if you run a search for ’weblogs’, Weblogs.com comes out as the number one result. It’s all in the name.

Dave Winer finds it strange that the Google Weblog (unaffiliated with Google the company) comes out as the first result in a search for ’weblog’. My guess is that this is a result of the blog’s name influencing the text of links made to it—when you link to Doc Searls or myself (both of whom have ’weblog’ in their site title) you can abbreviate it to “Doc Searls” or “Simon Willison”, but when you link to the Google Weblog you have to use the fully qualified name or your link won’t make sense. Google can be strongly affected by link text, as last year’s Google bombing epidemic aptly demonstrated.

This is Google conspiracy theories by Simon Willison, posted on 17th September 2003.

Tagged , ,

View blog reactions

Next: "sexeger"[::-1]

Previous: Interactive Python

7 comments

  1. Poorly researched, indeed. It's cute that his myopic argument is built around just two sites: 'weblogs.com vs. blogger.com'. Cuter still that the site in question appears to be run on Manilla/Radio/whatever. And just plain adorable that half the links in his blogroll point to sites on weblogs.com, userland.com, manilasites.com, and salon.com.

    (All domains referenced apart from blogger.com are powered by UserLand software.)

    Dave S. - 17th September 2003 01:09 - #

    • Blogger.com has a pagerank of 8
    • Weblog.com has a pagerank of 7

    (according to google directory listings)

    Tim Parkin - 17th September 2003 09:30 - #

  2. My friend Brian said it best: why does anyone still think that single keyword searches return anything of value?

    Mark Morgan - 17th September 2003 18:22 - #

  3. Does anyone even link to weblogs.com in the "This is a worthwhile site to look at on a regular basis"?

    Sure every blog software in the world pings weblogs.com - but many also ping blo.gs, and at least that has features that make it useable to the public, rather than just a very flat list of blogs that moves too fast to be terribly useful.

    I would imagine that's hindering the placement of weblogs.com more than anything else.

    Dan Dickinson - 18th September 2003 19:48 - #

  4. Hi, Google are coming in for a lot of criticism at the moment for the results that they are producing. They are using an artificial intelligence engine .. roughly bound by semantics ( relationships of words ) to try to guess the intent of the searcher. Combine this with a flawed pagrank system ( well not flawed but one that can be manipulated ) and you get some crazy results . There is a growing belief also as is mentioned here that they may alter the results their own algorythim produces in order to deliver their own content first. David

    David Cowman Spain - 27th March 2004 13:38 - #

  5. I discovered a good web page about Google's collusion with the US military. The page made me stop and think. Check it out. http://fire.prohosting.com/beobee/google_conspirac y.htm

    Bob Lamdril - 2nd July 2004 19:56 - #

  6. Bob's "good web page about Google's collusion with the US military" states at its end: "This story, of course, is a work of fiction -- written to entertain you..."

    Sy Ali - 5th January 2005 11:03 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/09/17/googleConspiracies

A django site