What loads URLs posted to twitter?

So a little while ago (at 19:56:40) I posted a tweet containing just a single url. Within seconds, I had these bots requesting the file:

::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"

No rDNS, but it is NTT IP space, who are who twitter hosts with, so could well be some official url scraping bot. Forgot to get my robots.txt, though.

::ffff:66.249.67.72 - - [22/Jan/2011:19:56:43 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Google twitter search foo. Probably already had my robots.txt.

::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"

Twitter again.

::ffff:65.52.2.212 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Microsoft space, no rDNS, maybe Bing with a fake UA? No robots.txt retrieval.

::ffff:216.52.242.14 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)"

Linkedin, obviously - not sure if they scrape all urls or just the ones on feeds people have listed on their profiles (which I have). No robots.txt retrieval.

::ffff:128.242.241.122 - - [22/Jan/2011:19:56:44 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:45 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"

Twitter again.

::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

The very polite and honest yahoo bot.

::ffff:38.113.234.181 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Voyager/1.0"

Some social media search company called ‘kosmix’. Forgot to ask for robots.txt.

::ffff:65.52.17.79 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Another MS bot, maybe bing again.

::ffff:89.151.116.52 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 529 "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"

No idea. Reverses to bearpub.favsys.net, but (www.)favsys.net has no working web server. IP space belongs to some random dedicated server company. Possibly related to tweetmeme.com, some blah blah twitter trend thing.

::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Yahoo again.

::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /robots.txt HTTP/1.1" 200 146 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"

Some ec2-hosted link trawling thing I guess.

::ffff:216.52.242.14 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)"

Linkedin again.

::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"
::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"

Random ec2 customer with a fake User-agent.

2620:0:10c0:1002:a800:1ff:fe00:11fe - - [22/Jan/2011:19:57:22 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "@hourlypress"

Google ipv6 space! Given the user-agent and lack of rDNS, maybe it’s some software running on appengine?

::ffff:74.112.128.61 - - [22/Jan/2011:19:57:24 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
::ffff:74.112.128.62 - - [22/Jan/2011:19:57:24 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"

Some social web search engine startup I guess.

::ffff:75.101.170.136 - - [22/Jan/2011:19:57:51 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "PycURL/7.18.2"

Some random python ec2-using bot that forgot to change their U-A or get my robots.txt.

  • comments

wsgiref + multiprocessing ftw

So, I wanted to do some very simple functional tests using a real-ish webserver, but I didn’t want to have to depend on twisted.web or an external setup. So I did this:

from wsgiref.simple_server import make_server
from multiprocessing import Process, Queue

class TestPosting(unittest.TestCase):
    # FIXME: THIS IS TERROBLE

    def setUp(self):
        self.q = Queue()
        self.p = Process(target=self.serve, args=(self.q,))
        self.p.start()
        self.port = self.q.get()

    def serve(self, q):
        httpd = make_server('', 0, simple_app_maker(q))
        port = httpd.server_port
        q.put(port)
        httpd.serve_forever()

    def tearDown(self):
        self.p.terminate()

    def test_something(self):
        url = "http://locahost:%s/whatever" % self.port
        ...
        result = self.queue.get()
        # make some assertions on the results

def simple_app_maker(queue):
    def simple_app(environ, start_response):
        post_env = environ.copy()
        post_env['QUERY_STRING'] = ''
        post = cgi.FieldStorage(
            fp=environ['wsgi.input'],
            environ=post_env,
            keep_blank_values=True
        )
        q.put(...) # put the relevant stuff in the queue
        status = '200 OK'
        response_headers = [('Content-type','text/plain')]
        start_response(status, response_headers)
        return ['Hello world!\n']
    return simple_app

just have the simple_app function put whatever you want to check in the queue, and your tests pull it out and make some assertions. Super nasty, but it actually works!

  • comments

Dune2 on Linux

Download it from here (with a browser, it does some redirect crap). Install dosbox:

sudo aptitude install dosbox

It is a self-extracting zip file, so you can either use wine to unzip (holy nuclear option, batman) or install unzip:

sudo aptitude install unzip

Create a directory that we can ask dosbox to mount as ‘C:’, then unzip it:

mkdir -p ~/.dos/Dune2
cd ~/.dos/Dune2
unzip ~/Downloads/Dune2.exe

Get a default config file by starting up dosbox:

dosbox

and (in dosbox) ask it to write out a config file for you:

config -writeconf /home/you/.dosbox.conf

type ‘exit’ in the dosbox window to, er, exit. Add the line to mount the directory we created:

echo "mount c /home/you/.dos" >> ~/.dosbox.conf

and start dosbox again (with that config):

dosbox -conf ~/.dosbox.conf

and run Dune2 in the dosbox window (dosbox’s shell has tab completion, yay):

cd Dune2
Dune2.exe

Clicking in the window will make dosbox grab your mouse and keyboard - ctrl-f10 to escape.

  • comments

Highlighting trailing whitespace in vim

This config snippet configures vim to highlight trailing whitespace in a horrendous red, making it easy to spot and remove.

:highlight ExtraWhitespace ctermbg=red guibg=red
autocmd Syntax * syn match ExtraWhitespace /\s\+$\| \+\ze\t/

Even better would be to only highlight it on lines that have been otherwise modified since the last commit…

  • comments

Using your iphone as a modem under Debian

Thanks to the awesome work of Diego Giagio (for writing it) and Paul McEnery (for packaging it for Debian), using your iPhone as a modem under Debian is about 60 seconds work:

  1. Install ipheth-dkms (the kernel module side of things) and ipheth-utils (the userspace pairing daemon).
  2. watch the postinst build the kernel driver for your current kernel
  3. plug your phone in
  4. dmesg|grep iPhone should show something like:

    [867025.370421] ipheth 3-2:4.2: Apple iPhone USB Ethernet device attached

    and you’ll find you have a new Ethernet interface

  5. enable tethering on the phone (Settings -> General -> Network -> Internet Tethering)
  6. now the ethernet interface is running a DHCP server - select it with nm-applet or ifup it or whatever you normally do
  • comments

Custom User-Agent with twisted.web.client.getPage

Since getPage just passes most of its’ args through to HTTPClientFactory, you can just make a simple wrapper to set the user-agent:

    from twisted.web.client import getPage
    ...
    def my_page_getter(*args, **kwargs):
        if 'agent' not in kwargs:
            kwargs['agent'] = 'your user agent/1.2'
        return getPage(*args, **kwargs)
  • comments

Excluding draft posts from your feeds

Since feeds in ikiwiki are just the result of the [[inline]] directive generation a list of pages, you can use the combination of a pagespec and the tag plugin to stop ikiwiki from syndicating draft pages. Just include “!tagged(draft)” in your page spec for the page that generates the feed (e.g. blog.mdwn):

[[!inline  pages="./blog/* and !*/Discussion and !tagged(draft)" show="100" ]]

then for each article you’d like to hide for now, simply add the ‘draft’ tag:

[[!tag  foo bar baz draft]]
  • comments

nginx and IPv6

nginx now has ipv6 support! Yay! To have it work on Debian, all you need to do is open up /etc/nginx/sites-available/default and replace:

listen   80;

with

listen [::]:80  default ipv6only=on;

ipv6only=on here is a bit of a lie - it will listen on both ipv4 and ipv6.

Then, in your /etc/nginx/sites-available/* files, add

listen       [::]:80;

just below

listen       80;

so nginx listens on both.

  • comments

PositiveSSL certificate chaining

If you have one of the free PositiveSSL certs that Namecheap gives away with new domains, you’ll find that it probably needs an intermediate cert to make OpenSSL stop complaining. Since they list a bunch here, let me save you some time: you need this one. If you’re using nginx, just add that file to the bottom of your signed cert (i.e. the thing PositiveSSL emailed you).

  • comments

Silly Nagios error

If nagios claims:

Error: 'bar' is not a valid parent for host 'foo'!

it is because bar doesn’t exist (e.g. you used a name directive instead of a hostname one).

  • comments