Google conspiracy theories
17th September 2003
Microdoc News have a poorly researched story suggesting that Google have been engineering their search results to favour their own properties:
It could be argued that the most important site that should appear when searching for the word blogs would be the generic site where anyone with a blog can get listed for her/his three minutes of fame, which includes any blog, anywhere in any system. Weblogs.com is a directory of sorts to any current post and is like, if you please, a central nervous system to the world of blogs. However, Google does not list weblogs.com as the primary site -- blogger.com is listed as the prime, first-up site in the listings that result from the blogs search. Is that because Google Inc., owns blogger.com, or is it that blogger.com is really what one would expect as the first result?
Time to break out Python again. I won’t explain the following code in detail, but essentially it downloads the HTML source of the front pages of both Blogger.com and Weblogs.com, strips out the HTML tags (defined as anything between two angle brackets) and counts the number of occurrences of the individual word ’blogs’.
>>> import urllib, re
>>> striptags = re.compile('<[^>]+>')
>>> blogs = re.compile(r'\bblogs\b', re.I)
>>> blogger = urllib.urlopen('http://www.blogger.com/').read()
>>> weblogs = urllib.urlopen('http://www.weblogs.com/').read()
>>> len(blogger), len(weblogs)
(26369, 394323)
>>> blogs.findall(striptags.sub('', blogger))
['blogs', 'blogs', 'blogs', 'Blogs']
>>> blogs.findall(striptags.sub('', weblogs))
['Blogs', 'Blogs', 'blogs']
The above code shows that while Blogger.com mentions the word ’blogs’ four times in 26,000 characters, Weblogs.com only mentions it three times in 394,000 characters! Blogger has a far higher ’blogs’ word density—in fact, the only occurrence of the word on Weblogs.com is when it happens to be a part of the name of one of the several thousand blogs listed on the page at any one time.
Although word density is a reasonably useful metric for telling if Google will like something, everyone knows that Google’s secret sauce is PageRank, which is based in part on the number of pages linking to a site. Two quick link: searches reveal 7,840 links to Weblogs.com, but a whopping 61,500 links to Blogger.com (no doubt helped by all those “powered by blogger” stickers).
So Blogger.com not only has a higher word density for the designated search term, it also has far more links to it overall. Is it really so surprising that it’s coming out on top?
Further more, if you run a search for ’weblogs’, Weblogs.com comes out as the number one result. It’s all in the name.
Dave Winer finds it strange that the Google Weblog (unaffiliated with Google the company) comes out as the first result in a search for ’weblog’. My guess is that this is a result of the blog’s name influencing the text of links made to it—when you link to Doc Searls or myself (both of whom have ’weblog’ in their site title) you can abbreviate it to “Doc Searls” or “Simon Willison”, but when you link to the Google Weblog you have to use the fully qualified name or your link won’t make sense. Google can be strongly affected by link text, as last year’s Google bombing epidemic aptly demonstrated.
More recent articles
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023