What Loads Urls on Twitter

2011-01-22 #twitter #url #web

So a little while ago (at 19:56:40) I posted a tweet containing just a single url. Within seconds, I had these bots requesting the file:

::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"

No rDNS, but it is NTT IP space, who are who twitter hosts with, so could well be some official url scraping bot. Forgot to get my robots.txt, though.

::ffff:66.249.67.72 - - [22/Jan/2011:19:56:43 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Google twitter search foo. Probably already had my robots.txt.

::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"

Twitter again.

::ffff:65.52.2.212 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Microsoft space, no rDNS, maybe Bing with a fake UA? No robots.txt retrieval.

::ffff:216.52.242.14 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +<http://www.linkedin.com>)"

Linkedin, obviously - not sure if they scrape all urls or just the ones on feeds people have listed on their profiles (which I have). No robots.txt retrieval.

::ffff:128.242.241.122 - - [22/Jan/2011:19:56:44 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:45 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"

Twitter again.

::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; <http://help.yahoo.com/help/us/ysearch/slurp>)"
::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; <http://help.yahoo.com/help/us/ysearch/slurp>)"

The very polite and honest yahoo bot.

::ffff:38.113.234.181 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Voyager/1.0"

Some social media search company called ‘kosmix’. Forgot to ask for robots.txt.

::ffff:65.52.17.79 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Another MS bot, maybe bing again.

::ffff:89.151.116.52 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 529 "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"

No idea. Reverses to bearpub.favsys.net, but (www.favsys.net has no working web server. IP space belongs to some random dedicated server company. Possibly related to tweetmeme.com, some blah blah twitter trend thing.

::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; <http://help.yahoo.com/help/us/ysearch/slurp>)"
::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; <http://help.yahoo.com/help/us/ysearch/slurp>)"

Yahoo again.

::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /robots.txt HTTP/1.1" 200 146 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"

Some ec2-hosted link trawling thing I guess.

::ffff:216.52.242.14 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +<http://www.linkedin.com>)"

Linkedin again.

::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"
::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"

Random ec2 customer with a fake User-agent.

2620:0:10c0:1002:a800:1ff:fe00:11fe - - [22/Jan/2011:19:57:22 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "@hourlypress"

Google ipv6 space! Given the user-agent and lack of rDNS, maybe it’s some software running on appengine?

::ffff:74.112.128.61 - - [22/Jan/2011:19:57:24 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +<http://labs.topsy.com/butterfly/>) Gecko/2009032608 Firefox/3.0.8"
::ffff:74.112.128.62 - - [22/Jan/2011:19:57:24 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +<http://labs.topsy.com/butterfly/>) Gecko/2009032608 Firefox/3.0.8"

Some social web search engine startup I guess.

::ffff:75.101.170.136 - - [22/Jan/2011:19:57:51 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "PycURL/7.18.2"

Some random python ec2-using bot that forgot to change their U-A or get my robots.txt.