So a little while ago (at 19:56:40) I posted a tweet containing just a single url. Within seconds, I had these bots requesting the file:
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
No rDNS, but it is NTT IP space, who are who twitter hosts with, so could well be some official url scraping bot. Forgot to get my robots.txt, though.
::ffff:66.249.67.72 - - [22/Jan/2011:19:56:43 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Google twitter search foo. Probably already had my robots.txt.
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
Twitter again.
::ffff:65.52.2.212 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
Microsoft space, no rDNS, maybe Bing with a fake UA? No robots.txt retrieval.
::ffff:216.52.242.14 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)"
Linkedin, obviously - not sure if they scrape all urls or just the ones on feeds people have listed on their profiles (which I have). No robots.txt retrieval.
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:44 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:45 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
Twitter again.
::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
The very polite and honest yahoo bot.
::ffff:38.113.234.181 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Voyager/1.0"
Some social media search company called ‘kosmix’. Forgot to ask for robots.txt.
::ffff:65.52.17.79 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
Another MS bot, maybe bing again.
::ffff:89.151.116.52 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 529 "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"
No idea. Reverses to bearpub.favsys.net, but (www.)favsys.net has no working web server. IP space belongs to some random dedicated server company. Possibly related to tweetmeme.com, some blah blah twitter trend thing.
::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Yahoo again.
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /robots.txt HTTP/1.1" 200 146 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
Some ec2-hosted link trawling thing I guess.
::ffff:216.52.242.14 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)"
Linkedin again.
::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"
::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"
Random ec2 customer with a fake User-agent.
2620:0:10c0:1002:a800:1ff:fe00:11fe - - [22/Jan/2011:19:57:22 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "@hourlypress"
Google ipv6 space! Given the user-agent and lack of rDNS, maybe it’s some software running on appengine?
::ffff:74.112.128.61 - - [22/Jan/2011:19:57:24 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
::ffff:74.112.128.62 - - [22/Jan/2011:19:57:24 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
Some social web search engine startup I guess.
::ffff:75.101.170.136 - - [22/Jan/2011:19:57:51 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "PycURL/7.18.2"
Some random python ec2-using bot that forgot to change their U-A or get my robots.txt.