Archive for the ‘Python’ Category

New Metal Army: Overview

Thursday, April 3rd, 2008

For the last year I've been working on a project in TurboGears. Well no I can basically say version 1.0 if New Metal Army is done. It's been a long time (nearly a year) and it's far from complete but what is there is basically feature complete and it makes a nice site.

Here is a quick summary of New Metal Army:

  • Pulls together the news from the top metal and rock webites from around the world. All news is tagged and associated with appropriate bands
  • Has full gig listings for the uk rock/metal scene with full venue details and links to buy tickets
  • Brings together band details from wikipedia, flickr, youtube, musicbrainz and amazon

As it currently stands there are 2500+ bands, 400+ gigs, 100+ venues, 500,000+ band pictures and 1000000+ band videos and it's growing all the time.

I've learned a lot about the internet doing this project. Most of my time was spent researching different ways to scrape and understand websites. Here is a summary of what I learned:

  • The internet is a mess. There are a lot of sites with HTML that isn't just slightly wrong but VERY wrong.
  • BeautifulSoup goes a long way to parsing the bad markup in the internet.
  • Structural markup (bold, italic, div etc) were the least of my worries. The semantic meaning of the data is very hard to discern. Now this is obvious but when I started I didn't really think about it. I assumed that MicroFormats would come to my rescue... but no one uses them (well very few people use them). I must confess that I have a task in my Trac to add them but it's not a priority at the moment. So even I don't use them!
  • Even with MicroFormats the data I needed is tainted by human input (like most of the internet). I deal a lot with band names and bands names are often misspelt and adjectives like 'a' and 'the' added and removed.
  • Following on from spellings: people in the world still insist on using foreign languages that have funny accents and despite what you read unicode, while simple to code in python, is not simple to think in. When will everyone learn to speak local :)
  • A human algorithm for tagging things is... to tag things with a scatter gun approach. The code base looks in flickr for pictures of bands. It does this by looking at the tags associated with a picture. The trouble is when someone takes some photos at a concert they mass upload them and tag them all with the names of all the bands at the concert and sometimes with bands like the bands at the concert... not to useful.
  • Python does scale pretty well to large projects... but it's easy to get 'leaks' which make your heap grow and grow. They aren't really leaks they are normally lists of things that you are forgetting to tidy up and python is dutifully holding them for you. This is something I really didn't think about until I deployed the site and associated tools and noticed that my 'Job Runner' (threaded application that ran various jobs) just grew and grew.
  • Before you start writing anything in Python look for a module on the internet that does it. If you can't find a module to do it then write your own. However, write you own module knowing this: As soon as you have finished, somehow you will find a module that is more complete then yours on the internet and you will kick yourself for not finding it before!
  • Zombies rock... everyone likes zombie and Simon (my Zombie) is no exception to this :)

I intend to add some articles on how I did various bits and bobs over the next few weeks.

FreeBSD 6.3 and Turbogears

Saturday, January 19th, 2008

I upgraded a test server to FreeBSD 6.3 (released a few days ago) and all was working well apart from my TurboGears app. I run a TurboGears instance behind mod_wsgi and it wouldn't start. Here is the error I got in http_errors.log

[Sat Jan 19 11:32:42 2008] [error] [client 207.155.93.149] mod_wsgi (pid=1292): Exception occurred within WSGI script '/home/m/release1.0/apache/turbogears.wsgi'.
[Sat Jan 19 11:32:42 2008] [error] [client 207.155.93.149] Traceback (most recent call last):
[Sat Jan 19 11:32:42 2008] [error] [client 207.155.93.149]   File "/home/m/release1.0/apache/turbogears.wsgi", line 67, in <module>
[Sat Jan 19 11:32:42 2008] [error] [client 207.155.93.149]     import turbogears
[Sat Jan 19 11:32:42 2008] [error] [client 207.155.93.149] ImportError: No module named turbogears

That's odd. I've not uninstalled TurboGears and my background processes that #import TurboGears still work. Infact if I go to the python command line and type #import TurboGears it all works... bugger.

To complicate matters (in this case) I use a workingenv to contain a very specific version of TurboGears and all of it's dependencies. In order for the wsgi script to access the sandbox environment I use an excellent script which tweaks the runtime environment to include the paths in a working env. My first though is that something here had gone wrong. So I turned to prints and some basic error capture.

# Load all distributions into the working set.
from pkg_resources import working_set, Environment
 
env = Environment(root)
env.scan()
 
distributions, errors = working_set.find_plugins(env)
for dist in distributions:
    working_set.add(dist)

Printing out errors revealed:

errors:
{Amara 1.2.0.2 (/usr/home/m/tgenv1_0_32/lib/python2.5/Amara-1.2.0.2-py2.5.egg):
   DistributionNotFound(Requirement.parse('4Suite-XML>=1.0.2'),),
TGCaptcha 0.11 (/usr/home/m/tgenv1_0_32/lib/python2.5/TGCaptcha-0.11-py2.5.egg):
   DistributionNotFound(Requirement.parse('pycrypto>=2.0.1'),)}

Well I hadn't uninstalled those packages and I'm pretty sure that freebsd-update hadn't uninstalled them so where the hell have they gone! Looking in the workingenv sandbox package directory

ls -la /usr/home/m/tgenv1_0_32/lib/python2.5
4Suite_XML-1.0.2-py2.5-freebsd-6.2-RELEASE-i386.egg
Amara-1.2.0.2-py2.5.egg
BeautifulSoup-3.0.5-py2.5.egg
Cheetah-2.0.1-py2.5-freebsd-6.2-RELEASE-i386.egg
Cheetah-2.0rc8-py2.5-freebsd-6.2-RELEASE-i386.egg
CherryPy-2.2.1-py2.5.egg
...
PasteScript-1.3.6-py2.5.egg
PyProtocols-1.0a0dev_r2302-py2.5-freebsd-6.2-RELEASE-i386.egg
Routes-1.7.1-py2.5.egg
RuleDispatch-0.5a0.dev_r2306-py2.5-freebsd-6.2-RELEASE-i386.egg
SQLAlchemy-0.3.10-py2.5.egg
...
moved_aside_site.py
psycopg2-2.0.6-py2.5-freebsd-6.2-RELEASE-i386.egg
pycrypto-2.0.1-py2.5-freebsd-6.2-RELEASE-i386.egg
python_dateutil-1.3-py2.5.egg
...
setuptools.pth
simplejson-1.7.3-py2.5-freebsd-6.2-RELEASE-i386.egg
...
 

BUGGER, there are packages in there with the OS version number in that need to be updated:

easy_install -U amara
easy_install -U pycrypto
easy_install -U psycopg2
...

fixed all the problems and finally the site is up again :) So I've fixed the problem but I don't know why my other processes and the python command line worked. If anyone knows, I love to know too. Cheers.

Python Style Plugins Made Easy

Wednesday, January 2nd, 2008


Sometimes you need to write code that loads python at runtime. Plugin architectures are a good example of this. Plugins allow extensibility but more importantly (for me at least) they enforce a strict API. Anyway, I've written this code a few times so I thought I'd modularize it.

The specific bit of code I am going to post is the python code to look for and load a series of python plugins. Plugins (in this case) are just classes that are subclasses of the Plugin base class. This plugin base class dictates the API that all plugins must implement. Here is an example plugin abstract base class

class Plugin(object):
    def setup(self):
        """called before the plugin is asked to do anything"""
        raise NotImplementedError
 
    def teardown(self):
        """called to allow the plugin to free anything"""
        raise NotImplementedError
 
    def domagic(self):
        """do whatever it is the plugin does"""
        raise NotImplementedError
 

Here the is code to look for and return a list of classes that can be instantiated:

def find_subclasses(path, cls):
    """
    Find all subclass of cls in py files located below path
    (does look in sub directories)
 
    @param path: the path to the top level folder to walk
    @type path: str
    @param cls: the base class that all subclasses should inherit from
    @type cls: class
    @rtype: list
    @return: a list if classes that are subclasses of cls
    """
 
    subclasses=[]
 
    def look_for_subclass(modulename):
        log.debug("searching %s" % (modulename))
        module=__import__(modulename)
 
        #walk the dictionaries to get to the last one
        d=module.__dict__
        for m in modulename.split('.')[1:]:
            d=d[m].__dict__
 
        #look through this dictionary for things
        #that are subclass of Job
        #but are not Job itself
        for key, entry in d.items():
            if key == cls.__name__:
                continue
 
            try:
                if issubclass(entry, cls):
                    log.debug("Found subclass: "+key)
                    subclasses.append(entry)
            except TypeError:
                #this happens when a non-type is passed in to issubclass. We
                #don't care as it can't be a subclass of Job if it isn't a
                #type
                continue
 
    for root, dirs, files in os.walk(path):
        for name in files:
            if name.endswith(".py") and not name.startswith("__"):
                path = os.path.join(root, name)
                modulename = path.rsplit('.', 1)[0].replace('/', '.')
                look_for_subclass(modulename)
 
    return subclasses
 

and here is how you would call it:

classes = find_subclasses("./pluginsfolder/", Plugin)
 
#lets create an instance of the first class
inst = classes[0]()
 

So there you go, create folder of python files with classes in them that subclass the Plugin class and you are away.