Simon Willison’s Weblog

Subscribe

3 items tagged “crawling”

2018

Googlebot’s Javascript random() function is deterministic. random() as executed by Googlebot returns the same predicable sequence. More interestingly, Googlebot runs a much faster timer for setTimeout and setInterval—as Tom Anthony points out, “Why actually wait 5 seconds when you are a bot?” # 7th February 2018, 2:41 am

2009

Official Google Webmaster Blog: A proposal for making AJAX crawlable. It’s horrible! The Google crawler would map url#!state to url?_escaped_fragment_=state, then expect your site to provide rendered HTML that reflects that state (they even go as far as to suggest running a headless browser within your web server to do this). Just stick to progressive enhancement instead, it’s far less hideous. It looks like the proposal may have originated with the GWT team. # 8th October 2009, 5:52 pm

2007

robots.txt Adventure. Interesting notes from crawling 4.6 million robots.txt, including 69 different ways in which the word “disallow” can be mis-spelled. # 22nd September 2007, 12:36 am