Original URL: http://www.theregister.co.uk/2008/04/14/google_crawls_html_forms/
As part of an ongoing effort to index the so-called Invisible Web, Google's automated crawlers are now toying with HTML forms. But only on certain "high-quality sites."
"In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google," Googlers Jayant Madhavan and Alon Halevy have told the world from the company's Webmaster Central Blog (http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html). "Specifically, when we encounter a 'Form' element on a high-quality site, we might choose to do a small number of queries using the form."
In essence, Googlebots are plugging data into such forms - much as an ordinary web surfer would. This generates a new page, and then the bots crawl that page. "For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML," Madhavan and Halevy continue. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made.
"If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."
Of course, they insist these bots would never do evil. "Needless to say, this experiment follows good Internet citizenry practices. Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate."
This is all part of Google's plan to index stuff it's never indexed before. "HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines," the blog post concludes. "The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience."
According to a blog post (http://anand.typepad.com/datawocky/2008/04/the-story-behin.html) from a researcher who once worked with Halvey on similar technology, Google's new form-happy bots grew out of work done by Transformic, a company Google acquired back in 2005. "The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler," writes Anand Rajaraman. ®
Google (re-)branded world's greatest brand (22 April 2008)
http://www.theregister.co.uk/2008/04/22/google_is_worlds_greatest_brand/
Google paid click rate decelerates (again) (16 April 2008)
http://www.theregister.co.uk/2008/04/16/google_paid_clicks_growth_sales/
Schmidt and Benioff try to rain on Microsoft parade (15 April 2008)
http://www.channelregister.co.uk/2008/04/15/salesforce_google_dismiss_microsoft/
Google offers tools to find victims of child abuse (15 April 2008)
http://www.theregister.co.uk/2008/04/15/google_ncmec/
Google and Salesforce snuggle up with biz apps (14 April 2008)
http://www.theregister.co.uk/2008/04/14/google_salesforce_alliance_crm_apps/
Google App Engine: how much will you pay for freedom? (14 April 2008)
http://www.theregister.co.uk/2008/04/14/google_app_engine/
Google to open suspect Orkut albums to Brazil police (12 April 2008)
http://www.theregister.co.uk/2008/04/12/google_brazil_pledge/
Google data centers snub Africa, Oz, and anything near Wyoming (11 April 2008)
http://www.theregister.co.uk/2008/04/11/google_data_center_map/
Google pays for Affero ban (11 April 2008)
http://www.theregister.co.uk/2008/04/11/google_bans_aero/
Google and Yahoo! skewer anti-DoubleClick law (10 April 2008)
http://www.theregister.co.uk/2008/04/10/yahoo_google_new_york_online_ad_bill/
Yahoo! to post Google ads on Yahoo! (9 April 2008)
http://www.theregister.co.uk/2008/04/09/yahoo_tests_adsense/
Google opens private cloud to coders (8 April 2008)
http://www.theregister.co.uk/2008/04/08/google_unveils_app_engine/
Google to launch database service from Campfire (7 April 2008)
http://www.theregister.co.uk/2008/04/07/google_bigtable_rumor/
© Copyright 2008