What loads URLs posted to twitter?
So a little while ago (at 19:56:40) I posted a tweet containing just a single url. Within seconds, I had these bots requesting the file:
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
No rDNS, but it is NTT IP space, who are who twitter hosts with, so could well be some official url scraping bot. Forgot to get my robots.txt, though.
::ffff:66.249.67.72 - - [22/Jan/2011:19:56:43 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Google twitter search foo. Probably already had my robots.txt.
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:43 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
Twitter again.
::ffff:65.52.2.212 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
Microsoft space, no rDNS, maybe Bing with a fake UA? No robots.txt retrieval.
::ffff:216.52.242.14 - - [22/Jan/2011:19:56:44 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)"
Linkedin, obviously - not sure if they scrape all urls or just the ones on feeds people have listed on their profiles (which I have). No robots.txt retrieval.
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:44 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
::ffff:128.242.241.122 - - [22/Jan/2011:19:56:45 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Twitterbot/0.1"
Twitter again.
::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
::ffff:67.195.115.246 - - [22/Jan/2011:19:56:46 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
The very polite and honest yahoo bot.
::ffff:38.113.234.181 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Voyager/1.0"
Some social media search company called ‘kosmix’. Forgot to ask for robots.txt.
::ffff:65.52.17.79 - - [22/Jan/2011:19:56:47 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 199 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
Another MS bot, maybe bing again.
::ffff:89.151.116.52 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 529 "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"
No idea. Reverses to bearpub.favsys.net, but (www.)favsys.net has no working web server. IP space belongs to some random dedicated server company. Possibly related to tweetmeme.com, some blah blah twitter trend thing.
::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
::ffff:72.30.161.218 - - [22/Jan/2011:19:56:48 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Yahoo again.
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /robots.txt HTTP/1.1" 200 146 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
::ffff:50.16.239.114 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8"
Some ec2-hosted link trawling thing I guess.
::ffff:216.52.242.14 - - [22/Jan/2011:19:56:52 +1100] "GET /what_loads_twitter_urls HTTP/1.1" 404 169 "-" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)"
Linkedin again.
::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"
::ffff:184.73.156.250 - - [22/Jan/2011:19:57:02 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "Firefox"
Random ec2 customer with a fake User-agent.
2620:0:10c0:1002:a800:1ff:fe00:11fe - - [22/Jan/2011:19:57:22 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "@hourlypress"
Google ipv6 space! Given the user-agent and lack of rDNS, maybe it’s some software running on appengine?
::ffff:74.112.128.61 - - [22/Jan/2011:19:57:24 +1100] "GET /robots.txt HTTP/1.0" 200 146 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
::ffff:74.112.128.62 - - [22/Jan/2011:19:57:24 +1100] "GET /what_loads_twitter_urls HTTP/1.0" 404 169 "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"
Some social web search engine startup I guess.
::ffff:75.101.170.136 - - [22/Jan/2011:19:57:51 +1100] "HEAD /what_loads_twitter_urls HTTP/1.1" 404 0 "-" "PycURL/7.18.2"
Some random python ec2-using bot that forgot to change their U-A or get my robots.txt.
wsgiref + multiprocessing ftw
So, I wanted to do some very simple functional tests using a real-ish webserver, but I didn’t want to have to depend on twisted.web or an external setup. So I did this:
from wsgiref.simple_server import make_server
from multiprocessing import Process, Queue
class TestPosting(unittest.TestCase):
# FIXME: THIS IS TERROBLE
def setUp(self):
self.q = Queue()
self.p = Process(target=self.serve, args=(self.q,))
self.p.start()
self.port = self.q.get()
def serve(self, q):
httpd = make_server('', 0, simple_app_maker(q))
port = httpd.server_port
q.put(port)
httpd.serve_forever()
def tearDown(self):
self.p.terminate()
def test_something(self):
url = "http://locahost:%s/whatever" % self.port
...
result = self.queue.get()
# make some assertions on the results
def simple_app_maker(queue):
def simple_app(environ, start_response):
post_env = environ.copy()
post_env['QUERY_STRING'] = ''
post = cgi.FieldStorage(
fp=environ['wsgi.input'],
environ=post_env,
keep_blank_values=True
)
q.put(...) # put the relevant stuff in the queue
status = '200 OK'
response_headers = [('Content-type','text/plain')]
start_response(status, response_headers)
return ['Hello world!\n']
return simple_app
just have the simple_app function put whatever you want to check in the queue, and your tests pull it out and make some assertions. Super nasty, but it actually works!
Dune2 on Linux
Download it from here (with a browser, it does some redirect crap). Install dosbox:
sudo aptitude install dosbox
It is a self-extracting zip file, so you can either use wine to unzip (holy nuclear option, batman) or install unzip:
sudo aptitude install unzip
Create a directory that we can ask dosbox to mount as ‘C:’, then unzip it:
mkdir -p ~/.dos/Dune2
cd ~/.dos/Dune2
unzip ~/Downloads/Dune2.exe
Get a default config file by starting up dosbox:
dosbox
and (in dosbox) ask it to write out a config file for you:
config -writeconf /home/you/.dosbox.conf
type ‘exit’ in the dosbox window to, er, exit. Add the line to mount the directory we created:
echo "mount c /home/you/.dos" >> ~/.dosbox.conf
and start dosbox again (with that config):
dosbox -conf ~/.dosbox.conf
and run Dune2 in the dosbox window (dosbox’s shell has tab completion, yay):
cd Dune2
Dune2.exe
Clicking in the window will make dosbox grab your mouse and keyboard - ctrl-f10 to escape.
Highlighting trailing whitespace in vim
This config snippet configures vim to highlight trailing whitespace in a horrendous red, making it easy to spot and remove.
:highlight ExtraWhitespace ctermbg=red guibg=red
autocmd Syntax * syn match ExtraWhitespace /\s\+$\| \+\ze\t/
Even better would be to only highlight it on lines that have been otherwise modified since the last commit…
Using your iphone as a modem under Debian
Thanks to the awesome work of Diego Giagio (for writing it) and Paul McEnery (for packaging it for Debian), using your iPhone as a modem under Debian is about 60 seconds work:
- Install ipheth-dkms (the kernel module side of things) and ipheth-utils (the userspace pairing daemon).
- watch the postinst build the kernel driver for your current kernel
- plug your phone in
dmesg|grep iPhone
should show something like:[867025.370421] ipheth 3-2:4.2: Apple iPhone USB Ethernet device attached
and you’ll find you have a new Ethernet interface
- enable tethering on the phone (Settings -> General -> Network -> Internet Tethering)
- now the ethernet interface is running a DHCP server - select it with nm-applet or ifup it or whatever you normally do
Custom User-Agent with twisted.web.client.getPage
Since getPage
just passes most of its’ args through to
HTTPClientFactory, you can just make a simple wrapper to set the
user-agent:
from twisted.web.client import getPage
...
def my_page_getter(*args, **kwargs):
if 'agent' not in kwargs:
kwargs['agent'] = 'your user agent/1.2'
return getPage(*args, **kwargs)
Excluding draft posts from your feeds
Since feeds in ikiwiki are just the result
of the [[inline]] directive generation a list of pages, you can
use the combination of a pagespec and the
tag plugin to stop ikiwiki from
syndicating draft pages. Just include “!tagged(draft)” in your page
spec for the page that generates the feed (e.g. blog.mdwn
):
[[!inline pages="./blog/* and !*/Discussion and !tagged(draft)" show="100" ]]
then for each article you’d like to hide for now, simply add the ‘draft’ tag:
[[!tag foo bar baz draft]]
nginx and IPv6
nginx now has ipv6 support! Yay! To have it work on Debian, all you
need to do is open up /etc/nginx/sites-available/default
and
replace:
listen 80;
with
listen [::]:80 default ipv6only=on;
ipv6only=on
here is a bit of a lie - it will listen on both ipv4 and ipv6.
Then, in your /etc/nginx/sites-available/*
files, add
listen [::]:80;
just below
listen 80;
so nginx listens on both.
PositiveSSL certificate chaining
If you have one of the free PositiveSSL certs that Namecheap gives away with new domains, you’ll find that it probably needs an intermediate cert to make OpenSSL stop complaining. Since they list a bunch here, let me save you some time: you need this one. If you’re using nginx, just add that file to the bottom of your signed cert (i.e. the thing PositiveSSL emailed you).
Silly Nagios error
If nagios claims:
Error: 'bar' is not a valid parent for host 'foo'!
it is because bar doesn’t exist (e.g. you used a name
directive instead of a hostname
one).