Sunday 29 November 2015

Python Weekly #7 - An Easy to use extend plugin framework

An easy to extend plugin framework

In a number of my projects I have made several iterations of developing a plugin framework - that is, a way that functionality can be extended easily (by me, or anyone else) without having to explicitly editing the main code.
An example might be a game where the user has to manage a number of different types of resources produced by different types of buildings, and with specialist people/aliens/trolls etc staffing those buildings and using those resources. With an efficient plugin system, it is relatively easy to imagine  the base game defining a starting set of buildings, resources, and types of workers/aliens etc, and then be able to extend the game adding new buildings, new resources etc by simply adding a new plugin, and without anyone editing the main game.
A plugin system has to have the following characteristics : 
  1. Automatic Discovery : Newly added plugins should be able to be automatically discoverable - i.e. the application should be able to find new plugins easily without any user intervention
  2. Clear Functionality : It should be obvious to the application what "type" of functionality the plugin adds (using the game example above does the plugin add new resources, new buildings or new workers - or all 3 ?).
  3. Is there a simple way for the user to use the plugin once it is added; for instance are plugins added to existing menus, or do they create obvious new menus etc.
This article is going to describe a simple framework that can be used to extend your python applications. The framework certainly addresses the first two points of the list above, and I will give you some pointers on the 3rd item, as there is no generic solution to it - it really does depend on your application

Source Code

Source code to accompany this article is published on GitHub : plugin framework. This includes a package which encapsulates the code within a simple to use class, an example very simple application (showing how to use the framework), and a very simple skeleton plugin, showing the very simple requirements that any plugin should implement to work with this framework.

1 -Automatic Discovery

This is actually three parts; can we find the code that forms the plugin, can we load this code (a python module or package), and can we identify which parts of this loaded code are actually the plugins and which are supporting code only being used by the plugin.

Finding the python code

Perhaps unsurprisingly, this is the simplest problem to address - the application can keep all of the plugins in a standard directory (or maybe two - a system wide and user specific directory) :
import sys
import os

def get_plugin_dirs( app_name ):
    """Return a list of plugin directories for this application or user

    Add all possible plugin paths to sys.path - these could be <app_path>/plugins and ~/<app_name>/plugins 
    Only paths which exist are added to sys.path : It is entirely possible for nothing to be added.
    return the paths which were added to sys.path
    """
    # Construct the directories into a list
    plugindirs = [os.path.join(os.path.dirname(os.path.abspath(sys.argv[0])), "plugins"),
                  os.path.expanduser("~/.{}/plugins".format(app_name) )]

    # Remove any non-existant directories
    plugindirs =  [path for path in plugindirs if os.path.isdir(path)]
  
    sys.path = plugindirs + sys.path
    return plugindirs

Note

The get_plugin_dirs function presented above relies heavily on the os.path library, since this is the most portable way to ensure that the application correctly constructs valid file paths etc.

We have a list of zero or more directories which may contain plugin code, so - lets identify code in those directories.
In Python code could exist as either :
  • An uncompiled python file, with the `.py` extension.
  • A compiled python file, with the `.pyc` extension
  • A C extension, with `.so` extension (or similar)
Thankfully - python makes it very easy to identify all of these files : use imp.get_suffixes() (in python 3.5 you should use importlib.get_suffixes()). Because of the features we want to use later we actually only want to use the python files (compiled and uncompiled) - and not any of the C extensions.

Plugins written in C ?

If you are adept enough to write an extension in C which you want to use as a plugin, then you can also easily write one or more wrappers in Python around you C extension code so that it complies with our framework - more on that later.
import imp
import importlib

def identify_modules(dir_list):
    """Generate a list of valid modules or packages to be imported

    param: dir_list : A list of directories to search in
    return: A list of modules/package names which might be importable
    """
    # imp.get_suffixes returns a list of tuples : (<suffix>, <mode>, <type>)
    suff_list = [s[0] for s in imp.get_suffixes() if s[2] in [imp.PY_SOURCE, imp.PY_COMPILED]]
      
    # By using a set we easily remove duplicated names - e.g. file.py and file.pyc
    candidates = set()

    # Look through all the directories in the dir_list
    for dir in dir_list:
        # Get the content of each dir - don't need os.walk
        dir_content = os.listdir(dir)

        # Look through each name in the directory
        for file in dir_content:

            # Does the file have a valid suffix for a python file
            if os.path.isfile(os.path.join(dir,file)) and os.path.splitext(file)[1] in suff_list:
                candidates.add(os.path.splitext(file)[0])
              
            # Is the file a package (i.e. a directory containing a __init__.py or __init__.pyc file 
            if os.path.isdir(os.path.join(dir, file)) and
                      any(os.path.exists(os.path.join(dir, file, f)) for f in ["__init__"+s for s in suff_list]):
                candidates.add(os.path.splitext(file)[0])
    return candidates 
In the final discovery step - we need to see if any of the identified files actually implement a plugin, and for this step we can use a hidden gem of the Python Standard Library - the inspect library. The inspect provides functionality to look inside python modules and classes, including ways to list the classes within modules, and methods in classes (and a lot more besides). We are also going to make use of a key feature of Object Oriented programming - inheritance. We can define a basic PluginBase class, and use the inspect library to look at each of out candidate modules to find a class which inherits from the plugin class. In order to comply with our framework, the classes which implement our plugins must inherit from PluginBase.

Location of PluginBase

Currently we are presenting our framework as a set of functions - without identifying a module etc. If our plugin classes are going to inherit from PluginBase, then our framework, and especially PluginBase will need to exists in a place where it can be easily imported by our plugin modules. This is achieved by having the PluginBase class defined in a top level module, or a module in a top level package. (A top level module/package is one that exists directly under one of the entries in sys.path).
import inspect
import importlib

class PluginBase(object):
    @classmethod
    def register(cls_):
        """Must be implemented by the actual plugin class
           
           Must return a basic informational string about the plugin
        """
        raise NotImplemented("Register method not implemented in {}".format(cls_))

def find_plugin_classes(module_list):
   """Return a list of classes which inherit from PluginBase
   param: module_list: a list of valid modules - from identify_modules
   return : A dictionary of classes, which inherit from PluginBase, and implement the register method
            The class is the key in the dictionary, and the value is the returned string from the register method
   """
   cls_dict = {}
   for mod_name in module_list:
       m = importlib.import_module(mod_name)
       for name, cls_ in inspect.getmembers(m, inspect.isclass):
           if issubclass(cls_, PluginBase):
              try:
                 info = cls_.register()
              except NotImplemented:
                  continue
              else:
                  cls_dict[cls_] = info

   return cls_dict
And there we have it - the basis of plugin characteristic #1- That the plugin is automatically discoverable. Using the code above - all that the Plugin implementation needs to do is to be in a module which exists in one of the two plugin directories, be a class which inherit from the PluginBase class, and implement a sensible register method.

2 - Clear Functionality

The clear functionality characteristic is incredibly easy to implement, again using inheritance. Your application will have a number of base classes which define the basic functionality that each element of your game will implement, so just ensure that your plugin classes inherit from one of these classes.

import collections

def categorise_plugins(cls_dict, base_classes):
    """Split the cls_dict into one or more lists depending on which base_class the plugin class inherits from"""
   
   categorise = collections.defaultdict(lambda x: {}) 
   for base in base_classes:
       for cls_ in cls_dict:
          if issubclass(cls_,base):
             categorise[base][cls_] = cls_dict[cls_]
   return categorise
We can put all of this together into a useful helper class for the loading and unloading of the plugin functionality - see plugin_framework on GitHub for the full implementation, and a demonstration simple application.

3 - Simple to Use

How your application makes the plug-in simple to use and accessible really does depend on your application, but as promised here are  some pointers :
  • The register() method on the plugin-class could be used to return information on which menus etc this plugin should appear - or even if a new menus, toolboxes etc should be created to allow access to this plugin. 
  • In most cases the plugin class should also allow itself to be instantiated, and each instance may well be in a different state at any given time. The class therefore will need to implement methods to allow the the application to use those instances, to change their state etc.
It is up to the application to define the expected interface that is expected of each different BaseClass (i.e. the attributes, classmethods, staticmethods and instance methods). This definition should be clearly documented.

Saturday 21 November 2015

Python Weekly #6 : Regular Expressions - a starter for 10

Regular expressions

or "How I stopped worrying and learned to love regular expressions"
(with apologies to Stanley Kubrick)

In a previous post (Batteries Included - 5 modules every python developer should know) I mentioned that everyone should learn about regular expressions and the re package in the Standard Library. I will be completely honest at this point and say that when I wrote that post, I really did not understand the package myself, but I did understand how important it is, and that I really should learn it one day. That day is now - I have started to learn regular expressions, and some of my early experiments are documented in this article.

In general terms, a regular expression (or regex) is a way of executing a very flexible search (and potentially replace) type operation against a piece of text, without having to worry about parsing each character at a time. Python has the re package which implements a powerful set of regex syntax, and I will be using this in this article.

Useful sites

A few useful sites which would be worth bookmarking :
This article will work through building a regex to extract telephone numbers from some arbitary text. Throughout this article, I am going to use the findall method to find matching sub strings. Depending on your uses, it could be more advantageous to use some of the other methods within the re package. For instance if you just need the first match, it is likely to be much more efficient to use the search method, or if you plan to simply separate the text into various sections - then the split method is probably what you need.
Lets start by simply trying to find numbers within your text, and the regex will build with complexity and functionality as we go.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"\d+", text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334', '456', '789', '07664', '282', '777', '8', '6', '08', '00', '18','00']
And there is the power of regex in a nutshell - one simple line. Let's explain that one line :
return re.findall(r"\d+", text_str)
The key part of this is the first argument to the findall - this is the regular expression (regex):
  • r : this signifies a raw string - which means that python won't try to do anything with the '\' characters within the string - i.e. what gets passed to the findall method is exactly what you type. It is strongly recommended that you always use raw strings when constructing regexs.
  • \d : A special regex sequence which specifically matches a single digit
  • + : A special character (a repeater) which means that you should match at least one or more of the previous matched sequence - i.e. we match one or more digits
The second argument to the findall method is termed the 'target text', which is the text to be searched or matched. The findall method searches the target text to find all the non-overlapping sub strings with match the regex - in our case all substrings which are numbers. This regex is greedy, which means that each match is the longest possible substring it can be - which is what we wanted here.

Greedy vs not-greedy

Many discussions and tutorials use the terms 'greedy' and 'non-greedy' without clear explanations. Hopefully these examples will make things clear.
  • A regex of r"\d+" with a target text of "1234" will match "1234" - it is a greedy match, and uses the longest possible string.
  • A regex of r"\d+?" with a target text of "1234" will match "1" - it is non-greedy match - and uses the shortest possible string.
As you can see - findall returns a list of the matched sub-strings if there are any, or any empty list if there are no matches, but we don't actually want all of those matches (as you can see it has found the text relating to the opening times, as well as just the phone numbers.

Let's start building up a more complex regex - that will just extract the phone numbers (or at least things that look like phone numbers). From now on I will use the verbose flag - so that we can document the regex as we go.
Lets start with our first format - that will match a phone number of the format nnnn nnnn nnn (4 digits, 4 digits, 3 digits), with simple spaces separating the various number groups.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...         (?x)              # Verbose mode
...         \d{4}             # 4 digits
...         \s                # space
...         \d{4}             # 4 digits
...         \s                # space
...         \d{3}           # 3 digits - followed by a non digit
...         """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789']
This seems to do what we want - but just for one format, and it would probably be inefficient to do separate searches for each possible format, so we will use the '|' combination operator to combine multiple searches into a single regex
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...         (?x)                   # Verbose mode
...         \d{4}\s\d{4}\s\d{3}    # nnnn nnnn nnn
...         |                      # Or
...         \d{5}\s\d{3}\s\d{3}    # nnnnn nnn nnn
...         """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']
There is no limit to how complex the search sequence can be between the '|' characters, and brackets are not needed to group together the regex patterns either side of the '|'.
 Our regex has found both of the phone numbers, and importantly it no longer finds the time information. So how could we extend it ? In the uk - all full format numbers will always start with a zero :
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...         (?x)                   # Verbose mode
...         0\d{3}\s\d{4}\s\d{3}    # 0nnn nnnn nnn
...         |                      # Or
...         0\d{4}\s\d{3}\s\d{3}    # 0nnnn nnn nnn
...         """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']

We continue to add functionality : In the UK we have a habit of putting brackets around the local code (the first block of numbers), but the brackets are optional - so let's extend our regex:
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...     (?x)                    # Verbose mode
...     \(?                     # Optional opening bracket
...     0\d{3}                  # 0nnn dialing code
...     \)?                     # Optional closing bracket
...     \s\d{4}\s\d{3}          # remainder of number : nnnn nnn
...     |                       # Or
...     \(?                     # Optional opening bracket
...     0\d{4}                  # 0nnnn dialing code 
...     \)?                     # Optional closing bracket
...     \s\d{3}\s\d{3}          # remainder of number : nnn nnn
...     """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']
>>> find_nums( "Call us on (0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']
There is a problem - this simple regex is very naive - and will match a number which is probably malformed : 
>>> import re
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334) 456 789', '07664 282 777']
and this is probably not what you want to happen - it certainly looks wrong, so lets make our regex more robust :
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...      (?x)                    # Verbose mode
...      (                       # Outer Group 0nnn nnnn nnn or 0nnnn nnn nnnn
...          (                   # Group 1 - 
...               \(0\d{3}\)     # match (0nnn)
...               |              # or
...               0\d{3}         # match 0nnn
...          )                   # Close Group 1
...          \s\d{4}\s\d{3}      # remainder of number : nnnn nnn
...          |                   # Or
...          (                   # Group 2 
...               \(0\d{4}\)     # match (0nnnn)
...               |              # or
...               0\d{4}         # match 0nnnn
...          )                   # Close Group 2
...          \s\d{3}\s\d{3}      # remainder of number : nnn nnn
...      )                       # Close Outer Group
...          """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('0133 4456 789', '', ''), ('07664 282 777', '', '')]
>>> find_nums( "Call us on (0133) 4456 789 or (07664) 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133) 4456 789', '(', ''), ('(07664) 282 777', '', '(')]
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '')]
Because we are now using groups - the findall method returns a list of tuples (where each element of the tuple is associated with a group in the regex, in the same order that the group starts in the regex - if you are not sure simply count along group starting '(' in your regex - be beware of other '(' in your regex which don't start groups.

Beware

One issue that is commonly faced is that the regex syntax uses the same symbols for lots of different things, and they can be tough to read. This is why the verbose mode is so useful, as you can clearly annotate exactly what your regex does, and where groups start (and end).
In our case we have a outer group which surrounds all of our regex - and this appears as the 1st element in our tuple. The other elements are associated with the groups we have used to capture the opening brackets, and they can be ignored in this example.

This seems like the finished article - we can detect both common number formats (with and without brackets), and we can reject numbers with incorrect brackets. But - there is an issue (we are ignoring that there are a number of other UK number formats or UK numbers with the international prefix). We tested for a misplaced closing bracket - but what about a misplaced opening bracket ?
>>> find_nums( "Call us on (0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133 4456 789', '' ''), ('07664 282 777', '', '')]
By including another test case we can see that our current regex will ignore that one of our numbers has a leading brace and no closing brace, and match against it anyway. The problem we have is that although the regex allows for brackets around the dialling code, it doesn't exclude a '(' in the case which is meant to be matching the un-bracketed local code, so lets fix our regex. We will use a special syntax called a Negative Lookbehind Assertion : the syntax is (?<!...). This looks back along the string being matched and generates a positive result if the preceeding characters do not match the given pattern - in our case (?<!\() will only match if the previous character is not a '(' - which is exactly what we need.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...      (?x)                    # Verbose mode
...      (                       # Outer Group 0nnn nnnn nnn or 0nnnn nnn nnnn
...          (                   # Group 1 - 
...               \(0\d{3}\)     # match (0nnn)
...               |              # or
...               (?<!\()0\d{3}         # match 0nnn
...          )                   # Close Group 1
...          \s\d{4}\s\d{3}      # remainder of number : nnnn nnn
...          |                   # Or
...          (                   # Group 2 
...               \(0\d{4}\)     # match (0nnnn)
...               |              # or
...               (?<!\()0\d{4}         # match 0nnnn
...          )                   # Close Group 2
...          \s\d{3}\s\d{3}      # remainder of number : nnn nnn
...      )                       # Close Outer Group
...          """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('0133 4456 789', '0133', ''), ('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133) 4456 789 or (07664) 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133) 4456 789', '(0133)', ''), ('(07664) 282 777', '', '(07664)')]
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [(('07664 282 777', '', '07664')]
This now seems to do exactly what we expect - well formed phone numbers are extracted from an arbitrary piece of text. The lesson from this step is if there is an optional substring, your regex might need two paths: one which looks for the optional substring, and a second path which works on the premise that the option substring. There is one final issue to address, which can be illustrated by the following test case :
>>> find_nums( "Call me on 0133 4456 78977")
 [('0133 4456 789', '', '')]
Our regex matches numbers which are too long, but which contain a valid number format - we need to ensure that there are no digits after the last one we match. This is relatively easy - We can use a Negative Lookahead Assertion : the syntax is (?!...). This looks back forward along the string being matched and generates a positive result if the next characters do not match the given pattern - in our case (?!\d) will only match if the next characters are not digits - which is exactly what we need.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...      (?x)                        # Verbose mode
...      (                           # Outer Group 0nnn nnnn nnn or 0nnnn nnn nnnn
...          (                       # Group 1 - 
...               \(0\d{3}\)         # match (0nnn)
...               |                  # or
...               (?<!\()0\d{3}      # match 0nnn
...          )                       # Close Group 1
...          \s\d{4}\s\d{3}(?!\d)    # remainder of number : nnnn nnn
...          |                       # Or
...          (                       # Group 2 
...               \(0\d{4}\)         # match (0nnnn)
...               |                  # or
...               (?<!\()0\d{4}      # match 0nnnn
...          )                       # Close Group 2
...          \s\d{3}\s\d{3}(?!\d)    # remainder of number : nnn nnn
...      )                           # Close Outer Group
...          """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('0133 4456 789', '0133', ''), ('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133) 4456 789 or (07664) 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133) 4456 789', '(0133)', ''), ('(07664) 282 777', '', '(07664)')]
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '07664')]
>>> find_nums( "Call me on 0133 4456 78977")
 []

Non-capturing regex

There is a concept called a non-capturing regex, the syntax is (?:...). The intention is that these patterns match against the 'target text', consume the characters that are matched, but unlike ordinary groups, the results aren't captured as a separate group in the results (i.e. no extra elements in the reported tuple from the findall method, or reported as group in the results from the search or match methods.
At first glance it seems that we could have used them instead of the Lookahead or Lookbehind assertions (which we used in the last two iterations), as all we want to do is ensure that certain substrings exist (or don't), but we don't care what they are. But, in reality, there is a wrinkle :
It is true that uhe non-capturing group (?[^\(]) will only match if there isn't a '(', and there wont be a group created for whatever is there, but the character sequence that is there will be reported as part of the containing group, and since the entire regex is treated as group, the character sequence will be reported as the content of the regex - which is not what we want in this case.
The main benefit of a non-capturing regex is that they can be added to existing sequences without impacting the numbering of the existing groups, and there is a small performance improvement.

Final Comments

It should be noted that we are only testing our regex pattern with 5 test cases - which is definitely not sufficient for production software, but it does seem like we have most of the issues solved.

For those of you who like picture, this is a visualization (created by Debugex) of our final regex.

Saturday 14 November 2015

Python Weekly #5 : Batteries Included - 5 Standard Library Packages every python developer should know

Batteries Included

5 Standard Library Packages every python developer should know

One of the reasons that python is incredibly popular is that as a language it is relatively fully formed out of the box. By fully formed I mean that many of the low level, and even medium level functionality is already developed and included in the Standard Library (this is what is often referred to as Batteries Included). The Standard Library is also very well documented for all versions of python - so it is difficult to go wrong.

It is also often the case that even if the standard library doesn't include what you need - someone else may have already published what you need : (See the pyPi - Python Package Index - for a complete list of everything that has been published by other developer - 69350 packages so far. It can be tricky to search if you aren't sure exactly how to describe what you need).

In this blog I am going to summarise 5 packages within the standard library that every developer should learn - in no particular order

re : Regular expression matching

Many applications will require some form of text proceesing, and if you have to go any more complex than finding a simple string in a piece of input, then you should learn how to use regular expressions - as implemented by the re module.
Regular expressions are a fantastic way to fuzzy search a text string, and extract sections of the text. In a future blog post I will cover the module in more detail.
Despite the power of regular expressions, some text problems are not really conducive to using regular expressions - for instance parsing XML, HTML or CSS : In these cases it is far better to use dedicated libraries to parse those types of input.
A word of warning though : regular expressions are very powerful, and can be complex. They will take time to learn and even longer to master - they are worth it - there are plenty of tutorials available.

datetime : Manipulation of dates and time stamps

Why bother with datetime ? -  very simply (unless you are asked to do it as a college assignment) life is far too short to try to write your own module to handle date and times. Seemingly simple rules can become very complex when you have to take into account leap years, leap seconds, time zones, different formats, and all the other nuances. Don't make the effort, even to write a simple module, when the datetime module (and it's partners calendar) already exist, and are known to work.

logging : flexible event logging system

The logging module is a flexible and standard method for getting useful information from your application, instead of including lots of print statements/functions in your code while developing, and then removing them before delivery (and risking breaking your code). With the logging module in use - just change your logging level - and your debug logging will cease - although the best way to debug is to actually use a debugger, in combination with your logging information, and of course your automated test cases.

argparse : Command line argument parse

Many complex applications will support a command line with one or more options and arguments. The argparse package within the standard library contains all the tools you need to parse the command line and extract the arguments and options. You will need to write code to identify invalid combinations, or invalid arguments, although in many cases this should not be that complex.

unittest : A unit testing framework

I mentioned in my previous blog post : 10 things to learn as a python developer, and again in : Incremental Development, it is vital that you use some technique to allow you to run a consistent set of tests against your code, to ensure that it works and continues to work as you develop it. There are plenty of mechanisms that do this, but I would strongly recommend that you use unittest, a very powerful set of tools which allow you to manage and execute your tests. 

Monday 9 November 2015

Python Gotcha #3 : Default Arguments, be careful

Python Gotchas - Default arguments

or how not to get caught out when you start using python

At first glance python seems very familiar, especially if you have used other procedural languages such as C or Java - but actually Python is different - in some cases very different, and those differences can trip you up as you progress along your python journey. In this series of occasional posts, I am going to cover some of those gotchas.

Default Arguments

Once you have spent any time with python, you will be very aware that the language supports the ability to define default values for function arguments.
>>> #---------------- Example 1
>>> def add(number, increment=1):
...    return number + increment
>>>
>>> add(10)
11
>>> add(10, 2)
12
 
This can be incredibly powerful feature which can dramatically improve readability of your code, by ensuring that you only need to specify an argument if you are doing something which is not usual or common. The example above is not a very good example of using default argument values.

For instance look at the str.find method (for finding one string in another); The method has both the start and stop arguments (if you need to search only part of the string), but both of those arguments have sensible default values so that in most cases you encounter,  when you want to search the entire string - all you need to do is provide a single argument :

>>> #---------------- Example 2
>>> stra = "This is a dead parrot, deceased, no-longer living."
>>> stra.find("e")   # Find the first 'e' in the string.
11
>>> stra.find("e", 12) # Find the first 'e' starting from the 11th character
24
>>> stra.find("e", 12,23) # Find the first "e" between character 12 & 23
-1


Say for instance that you write a function, which amongst other things, create and populates a file, and then sets the permissions on that file. Most of the time when your application calls the function you want the file permissions set to "rw" (read/write), but on some rare occasions you want your application to set the permission to "r" (read only). It would be a good idea to make the permissions argument have a default value of "rw" :

#---------------- Example 3
def write_file(data, file_name, permission="rw"):
    ...
    # Code to write file and set permission goes here
    return  0 # Success
 
write_file(data1, "data1.txt") # written with rw permission
...
write_file(data2, "data2.txt")
...
write_file(LicenseInfo, "LicenseReadme.txt", "r") # Don't want the license file to be changed


You can see how having a sensible default value makes the code readable, but not cluttered.

And now to the gotcha : There is a danger lurking here - which I will demonstrate in my next set of examples :

Imagine you are writing a system to record students as they enrol for courses at a college, and the first thing you need to do is to record the Student's name on a list - so you write a function as below :

#---------------- Example 4
>>> def RegisterStudent(student, existing_students=[]):
...    existing_students.append[student]
...    return existing_students
>>>
>>> class1 = RegisterStudent("John")
>>> class1
["John"]
>>> class1 = RegisterStudent("Mark", class1)
>>> class1
["John", Mark"]
>>> # ---------------------------- All Good so far


You intention is by having the default argument as an empty list, you can create multiple lists one for each course, and you can signify the creation of a new list by omitting the existing_students argument (since when it is a new list there are no existing students.

Now - lets try to create another student list for a second course, using the RegisterStudent function above

#---------------- Example 5
>>> class2 = RegisterStudent("Lucas") # This should work - shouldn't it ?
>>> # -------- Lets check
>>> class2
['John', 'Mark', 'Lucas'] # We have the names from class1 in our list too
>>> # -------- and even worse ?
>>> class1
['John', 'Mark', 'Lucas']
>>> class1 is class2   # They are the same list (the same object)
True


So clearly this does not work - but why not ? How did our two default lists, end up with the same object.

Looking at the definition of RegisterStudent in example 4, it would be reasonable to expect that if the 2nd argument (existing students) isn't provided then a new empty list would be created. However, that is not what happens, and the reality is a bit complicated - so stay with me :
  1. Compilation :  The compiler sees the item existing_students=[] for the first time during the compilation phase; and it creates a new empty list object, and associates that to existing_students argument (within the scope of the RegisterStudent function). 
  2. Execution : When the function is then called the interpreter checks if the existing_students argument has been provided, and if not, then the object which was created during step 2 is passed as into the function body as the existing_students argument.
  3. In the body of the function - when the code changes the existing_students list (by appending to it - and append is a change in place - i.e. no new object is created), this is a change to the object created by the compiler.
  4. The next time that the function is called with a missing existing_students argument, the interpret does everything in step 2, and the body of the function will be passed the changed list.
The summary of this is, that if you use a mutable value (list, dictionary, set etc) as the default argument in one of your functions, and then change that variable within your function, then you will actually be changing the value of the default argument which will be used when you call the function again : and often this is not what you want.

A better version of our RegisterStudent function :

#---------------- Example 6
>>> def RegisterStudent(student, existing_students=None): 
...    if existing_students is None:
...        existing_students = [] 
...    existing_students.append[student]
...    return existing_students
>>>
>>> class1 = RegisterStudent("John")
>>> class1
["John"]
>>> class1 = RegisterStudent("Mark", class1)
>>> class1
["John", Mark"]
>>> # ---------------------------- All Good so far
>>> class2 = RegisterStudent("Lucas")
>>> class2 
["Lucas"]
>>> class1
["John", "Mark"]
>>> class1 is class2
False


By using None (instead of []) we can avoid the issue with using mutable default arguments, and ensure that with the addition of a simple if statement, that whenever the existing_students argument is omitted in a function call - we get a brand new list to start adding to.

There are other ways of avoiding the mutable default argument issue - but the above method of using None is the method recommended even in the official documentation.

Note : The if statement can be made even simpler by using a conditional expression :

#---------------- Example 7
>>> def RegisterStudent(student, existing_students=None):
...    existing_students = existing_students if existing_students else []
...    existing_students.append[student]
...    return existing_students


This conditional expression works due to the rules that python uses to determine the Truth value of a value which isn't strictly True/False. For a list - if the list is None or [] then the Truth value is False, otherwise it is evaluated as True.

Note 2: If you are writing code that is sharing data between a number of different functions, then you would probably be better off investigating writing a class, which can hold the data, rather than pass the data around as arguments.

Functions as default arguments

It is also important to remember that this doesn't just happen when a list or dictionary is used as a default argument. You will also get a potentially unexpected results if you try to use a function call as a default argument. For instance it might seem logical to write code like the example below

#---------------- Example 8
>>> from datetime import datetime
>>> def log_message(msg, ts=datetime.now()):
...     """Create a log message in a known format, adding the time stamp to the message (defaulting to now)"""
...     return "Log : {} {}".format(ts, msg)


But as you might now have worked out, the ts=datetime.now() is evaluated only once (when the file is initially compiled, or is initially imported), and if we were to use the the log_message function, then it would create messages with the timestamp of the date/time of the import, every time it is called with the ts argument omitted, which is clearly not the expected functionality.

Thank fully we can use the same mechanism as in Examples 6 or 7 above (using a default argument of None) in order to get the expected functionality :

#---------------- Example 9
>>> from datetime import datetime
>>> def log_message(msg, ts=None):
...    """Create a log message in a known format, adding the time stamp to the message (defaulting to now)"""
...    ts = ts  if ts else datetime.now()
...    return "Log : {} {}".format(ts, msg)

And finally :

There is a case when using a mutable type (dictionary, list etc) as a default argument can be very useful.

Imagine writing a function that will generate the nth Fibonacci number :

>>> #---------------- Example 10
>>> def fib(n):
...     if n == 0:
...             return 0
...     if n == 1:
...             return 1
...     return fib(n-1) + fib(n-2)


This function certainly works, but it does have an issue : in that if you calculate a lot of different values, then you will recalculate many of the lower values multiple times. Since the Fibonacci series doesn't change - is there some way we can store the values we have already calculated, to make our code run faster ?

>>> #---------------- Example 11
>>> def fib2(n, cache={0:0, 1:1} ):
...     if n in cache:
...         return cache[n]
...     cache[n] =  fib2(n-1) + fib2(n-2)
...     return cache[n]


In this new function we now have a cache dictionary as a default argument, although the cache parameter is never used when we call the fib2 function, but as you can see as the function calculates new values, it adds them to the cache, and we know that the cache object is shared between multiple calls. The big time difference in the cached version arises from making multiple calls to a function is a lot slow than accessing one value from a dictionary.

This technique is memoization, and from the timings below, the speed improvement is considerable :

$ python -m timeit -n 1 -r 10 -s "import fib" "[fib.fib(n) for n in range(30)]"
10 loops, best of 10: 407 msec per loop
$ python -m timeit -n 1 -r 10 -s "import fib" "[fib.fib2(n) for n in range(30)]"
10 loops, best of 10: 5.01 usec per loop

Yes that is really 407 milli seconds (i.e. 0.4 seconds) compared to 5 micro seconds (i.e. 0.000005 seconds) - and the code is only building a list of the first 30 numbers - imagine the savings from building bigger lists.

A graph of the timings is illuminating - building a list of Fibonacci numbers from 0 to n.
As you can see the time taken as n increases exponentially for the non-memoized version (example 10), where-as the timing for the memoized function increases but at a far lower rate.

Beware though - in this case the optimization only works because we are building a list of values and as we attempt to calculate the higher values in the list, the lower values have already been calculated and stored. However, if we just called the memoized version to just calculate a single value, you may well find that it is similar timings or even slower than the non-memoized version.

Saturday 7 November 2015

Python Weekly # 4 : Scoping - arguments and variables.

Python Scoping

Arguments, variables and other things

In this article I wanted to explore something of the rules which govern python scoping. Scoping is how Python decided when and where a name is valid and accessible. In Python names are not the same as objects or data, so even if the name is no longer valid, the object may well still exist (it depends on how many other references there are to the object - in simple terms how many other names are bound to that object or how many times the object appears in a dictionary or list).

When a python program executes, scopes are created for each module that is imported, and each class that is defined (although there is a twist to class scopes explained below). A scope is also created each time a function or method runs. You can think of a scope as like a dictionary which matches a name to the object that it is bound to.  As the program runs, python keeps track of the various scopes which are created, in a nested fashion, and at any point, there one more scopes which exist.

The scoping rules are actually relatively simple : 
  • When a name is bound to an object it is created in the inner most scope by default. If a function is being executed, than a name will be created in that function's scope, unless the name has been listed on a global statement
  • When a name is referenced (i.e. not created), the name is searched for  inner most scope first, and then going outwards towards the module and builtin scopes
A name is bound when :
  • It is the name of a module which is imported
  • It is the name of a class which is being defined
  • It is the name of a function which is being defined
  • It is the name of an argument to a function which is being executed.
  • It is a name which appears on the left hand side of an assignment
  • It is a name which appears as the loop variable in a for loop 
  • It is a name which appears as in an except statement.
Some examples would be helpful here :
  • A function defined in a module : If a name is defined in the function which matches a name defined in a module, then the version defined in the function will be used in the function (unless the global statement is used for that name) - that includes any arguments which are defined for that function.
  • If a function is defined in another function : The inner function can refer to any names defined in the outer function (or of course the module/builtin scope), but the outer function can't refer to names in the inner function. If the inner function rebinds a name used in the outer function, that doesn't change the binding made in the outer function :
     
    
    >>> def outer():
    >>>    a = 1
    >>>    def inner():
    >>>        a = 2
    >>>        print "Inner ",a
    >>>    print
    >>>    print "Outer a",a
    >>>    inner()
    >>>    print "Outer b",a
    >>>
    >>> outer()
    
    Outer a 1
    Inner 2
    Outer b 1
    
     
    
  • If a function is defined in a class (i.e. a method), then the method cannot access anything at the class scope, without using either the instance or class identity and qualifying the name - i.e using either self.name or self.__class__.name (or something similar) - this is the class scope twist that was mentioned above. The decision to use a qualified name (rather than just the name) is so that there is a single way that your code refers to or rebind names in the outer scope (contrast that with the nested function example above where the inner function has no way to rebind the name defined in the outer scope).
These scoping rules are fairly sensible (the principle of "least surprise" is common in Python - meaning code should always do what the developer expects it to do) and doesn't hold that many surprises to people already used to other programming languages. However, unlike in some other languages there are no compiler/interpret warnings if you define a name which hides/masks one of the builtin names. In theory that means you could redefine one of the very critical functions - like open (although it is not recommended unless you really know what you are doing, as you can easily break things).

Beware !!!

In Python 2.7 any name defined in a list comprehension is treated exactly as if the name had been defined in a for loop, or similar.
 
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
>>> l = [a for a in range(10)]
>>> a
9
 
This is one example of where the "least surprise" principle isn't actually maintained, especially when a similar generator expression does not do the same - in this case trying to access the generator loop variable does not succeed:
 
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
>>> l = [(a for a in range(10))]
>>> a 
NameError: name 'a' is not defined 
 
This surprise is due to how Python2.7 implements the list comprehension in the first example, and this implementation issue is resolved in Python 3. It is definitely not recommended that you write any form of code that relies on this implementation detail in Python 2.7, as it is simply not clear what the expected value should be, and your code will break when translated from Python 2 to Python 3.