Dion Moult In honour of the late Dion Moult, 1992 - 2012In honour of the late Dion Moult, 1992 - 2012

Python steals XKCD Comics – snippet

Useful information stuff:
After 70-640, the students who are enrolled for 640-822 start preparing for their 70-236 as well as 640-863 so that they can be exempted from 642-436 and 1Y0-259 and can appear directly in 70-536, saving time and resources.

When somebody asks a programmer:XKCD Comics

Dude, why do you program? There’s nothing useful that you can make that hasn’t been made already. It’s like making your own hand-phone when you could just buy it at a shop. Go get a life and do something productive.

Then of course, the programmer smiles and replies:

Why, for:

  • The money
  • The girls (“ooh that looks terribly complex…”, oh yes, it is…“)
  • Nobody can check whether or not you’re really doing your work.

However, I personally think that programmers do it because they like to communicate with their computer. Since I’m learning Python, I like to learn it through making small things to speed up my day. I’ve made a blackjack game (OK, that slows down my day), a program that executes series of shell commands to speed up boring tasks, a to-do list program, and my latest creation:

“Something-that-finds-the-latest-comic-on-xkcd.com-and-downloads-it-to-a-file”

Of course, all you need to do is setup a cron-job to execute the snippet every time XKCD updates (Mondays, Wednesdays, Fridays) and bingo, you’ve just got yourself a personal archive of missed XKCD comics!

Here there be snippet:

#!/usr/bin/env python
import urllib
source = urllib.urlopen('http://xkcd.com/').read()
linebyline = source.splitlines()
found = 0
for value in linebyline:
    if found == 0:
        check = value.find('http://imgs.xkcd.com/comics/')

        if check != -1:
            found = 1
            # find the next occurance of the " to find end of URL.
            next = value.find('"', 10)
            image = value[check:next]
            length = len('http://imgs.xkcd.com/comics/')
            print 'Comic found: ' + image
            length = length + check
            filename = value[length:next]
            print 'Saved under: ' + filename
            path = '/home/dion/documents/Projects/Python/' # change this!
            image_file = urllib.urlretrieve(image, path + filename)

    else:
        break

Amazing, isn’t it? Here’s the latest one I grabbed:

…and oh yes, it was terribly complicated.

No related posts.


6 Comments

lorg says: (21 July 2008)

1. You should also harvest the alt text.
2. This could be a lot easier if you used regexps, or a ready html parser.

Justin says: (21 July 2008)

Bug: Missing comic alt-text.

Cheers!

admin says: (21 July 2008)

lorg: like BeautifulSoup? I’m just learning, so I’m seeing how far I can get just by referring to the Python documentation.

lorg & Justin:

OK, here is a modification that makes it add on the Alt text of the image to the filename (WP removes the tabs so you’ll have to re-add them):

#!/usr/bin/env python

import urllib
source = urllib.urlopen('http://xkcd.com/').read()
linebyline = source.splitlines()
found = 0

for value in linebyline:

if found == 0:
check = value.find('http://imgs.xkcd.com/comics/')

if check != -1:
found = 1
# find the next occurance of the "
next = value.find('"', 10)
image = value[check:next]
length = len('http://imgs.xkcd.com/comics/')
print 'Comic found: ' + image
length = length + check
filename = value[length:next]
# alt text!
alt = value.find('"', next + 1)
alt2 = value.find('"', alt + 1)
alttext = value[alt+1:alt2]
print 'Alternate text: ' + alttext
# Space in alt text converted to hyphens
alttext = alttext.replace(' ', '_')
# Add alt text to filename
filename = filename[:-4] + '_ALT_' + alttext + filename[-4:]
print 'Saved under: ' + filename
path = '/home/dion/documents/Projects/Python/' # Change this!
image_file = urllib.urlretrieve(image, path + filename)

else:
break

lorg says: (22 July 2008)

There are many options for html parsers, for example, htmllib in the standard library. For this kind of task I’d probably still use regexps, would probably be simpler.

Also, you are putting the alt text as part of the filename without escaping it. Consider http://xkcd.com/327/ ;)

About a year ago I saw another xkcd harvester, and he’d just put a text file along with the image. Another (big overkill) option is to put the text in the PNG’s text chunk.

admin says: (23 July 2008)

Wouldn’t a replace(‘\”, ”) be sufficient?
I looked at the Python documentation on htmlllib, and I think this method should be fine for now.

thinkMoult » Blog Archive » A Little Python Fun says: (4 January 2009)

[...] I last touched Python, I wrote a snippet to steal the latest comic off xkcd.com. Like most of the things that I do (this only applies to individual projects, not teamwork), there [...]

Leave a Comment