Monday 9 November 2015

Python Gotcha #3 : Default Arguments, be careful

Python Gotchas - Default arguments

or how not to get caught out when you start using python

At first glance python seems very familiar, especially if you have used other procedural languages such as C or Java - but actually Python is different - in some cases very different, and those differences can trip you up as you progress along your python journey. In this series of occasional posts, I am going to cover some of those gotchas.

Default Arguments

Once you have spent any time with python, you will be very aware that the language supports the ability to define default values for function arguments.
>>> #---------------- Example 1
>>> def add(number, increment=1):
...    return number + increment
>>>
>>> add(10)
11
>>> add(10, 2)
12
 
This can be incredibly powerful feature which can dramatically improve readability of your code, by ensuring that you only need to specify an argument if you are doing something which is not usual or common. The example above is not a very good example of using default argument values.

For instance look at the str.find method (for finding one string in another); The method has both the start and stop arguments (if you need to search only part of the string), but both of those arguments have sensible default values so that in most cases you encounter,  when you want to search the entire string - all you need to do is provide a single argument :

>>> #---------------- Example 2
>>> stra = "This is a dead parrot, deceased, no-longer living."
>>> stra.find("e")   # Find the first 'e' in the string.
11
>>> stra.find("e", 12) # Find the first 'e' starting from the 11th character
24
>>> stra.find("e", 12,23) # Find the first "e" between character 12 & 23
-1


Say for instance that you write a function, which amongst other things, create and populates a file, and then sets the permissions on that file. Most of the time when your application calls the function you want the file permissions set to "rw" (read/write), but on some rare occasions you want your application to set the permission to "r" (read only). It would be a good idea to make the permissions argument have a default value of "rw" :

#---------------- Example 3
def write_file(data, file_name, permission="rw"):
    ...
    # Code to write file and set permission goes here
    return  0 # Success
 
write_file(data1, "data1.txt") # written with rw permission
...
write_file(data2, "data2.txt")
...
write_file(LicenseInfo, "LicenseReadme.txt", "r") # Don't want the license file to be changed


You can see how having a sensible default value makes the code readable, but not cluttered.

And now to the gotcha : There is a danger lurking here - which I will demonstrate in my next set of examples :

Imagine you are writing a system to record students as they enrol for courses at a college, and the first thing you need to do is to record the Student's name on a list - so you write a function as below :

#---------------- Example 4
>>> def RegisterStudent(student, existing_students=[]):
...    existing_students.append[student]
...    return existing_students
>>>
>>> class1 = RegisterStudent("John")
>>> class1
["John"]
>>> class1 = RegisterStudent("Mark", class1)
>>> class1
["John", Mark"]
>>> # ---------------------------- All Good so far


You intention is by having the default argument as an empty list, you can create multiple lists one for each course, and you can signify the creation of a new list by omitting the existing_students argument (since when it is a new list there are no existing students.

Now - lets try to create another student list for a second course, using the RegisterStudent function above

#---------------- Example 5
>>> class2 = RegisterStudent("Lucas") # This should work - shouldn't it ?
>>> # -------- Lets check
>>> class2
['John', 'Mark', 'Lucas'] # We have the names from class1 in our list too
>>> # -------- and even worse ?
>>> class1
['John', 'Mark', 'Lucas']
>>> class1 is class2   # They are the same list (the same object)
True


So clearly this does not work - but why not ? How did our two default lists, end up with the same object.

Looking at the definition of RegisterStudent in example 4, it would be reasonable to expect that if the 2nd argument (existing students) isn't provided then a new empty list would be created. However, that is not what happens, and the reality is a bit complicated - so stay with me :
  1. Compilation :  The compiler sees the item existing_students=[] for the first time during the compilation phase; and it creates a new empty list object, and associates that to existing_students argument (within the scope of the RegisterStudent function). 
  2. Execution : When the function is then called the interpreter checks if the existing_students argument has been provided, and if not, then the object which was created during step 2 is passed as into the function body as the existing_students argument.
  3. In the body of the function - when the code changes the existing_students list (by appending to it - and append is a change in place - i.e. no new object is created), this is a change to the object created by the compiler.
  4. The next time that the function is called with a missing existing_students argument, the interpret does everything in step 2, and the body of the function will be passed the changed list.
The summary of this is, that if you use a mutable value (list, dictionary, set etc) as the default argument in one of your functions, and then change that variable within your function, then you will actually be changing the value of the default argument which will be used when you call the function again : and often this is not what you want.

A better version of our RegisterStudent function :

#---------------- Example 6
>>> def RegisterStudent(student, existing_students=None): 
...    if existing_students is None:
...        existing_students = [] 
...    existing_students.append[student]
...    return existing_students
>>>
>>> class1 = RegisterStudent("John")
>>> class1
["John"]
>>> class1 = RegisterStudent("Mark", class1)
>>> class1
["John", Mark"]
>>> # ---------------------------- All Good so far
>>> class2 = RegisterStudent("Lucas")
>>> class2 
["Lucas"]
>>> class1
["John", "Mark"]
>>> class1 is class2
False


By using None (instead of []) we can avoid the issue with using mutable default arguments, and ensure that with the addition of a simple if statement, that whenever the existing_students argument is omitted in a function call - we get a brand new list to start adding to.

There are other ways of avoiding the mutable default argument issue - but the above method of using None is the method recommended even in the official documentation.

Note : The if statement can be made even simpler by using a conditional expression :

#---------------- Example 7
>>> def RegisterStudent(student, existing_students=None):
...    existing_students = existing_students if existing_students else []
...    existing_students.append[student]
...    return existing_students


This conditional expression works due to the rules that python uses to determine the Truth value of a value which isn't strictly True/False. For a list - if the list is None or [] then the Truth value is False, otherwise it is evaluated as True.

Note 2: If you are writing code that is sharing data between a number of different functions, then you would probably be better off investigating writing a class, which can hold the data, rather than pass the data around as arguments.

Functions as default arguments

It is also important to remember that this doesn't just happen when a list or dictionary is used as a default argument. You will also get a potentially unexpected results if you try to use a function call as a default argument. For instance it might seem logical to write code like the example below

#---------------- Example 8
>>> from datetime import datetime
>>> def log_message(msg, ts=datetime.now()):
...     """Create a log message in a known format, adding the time stamp to the message (defaulting to now)"""
...     return "Log : {} {}".format(ts, msg)


But as you might now have worked out, the ts=datetime.now() is evaluated only once (when the file is initially compiled, or is initially imported), and if we were to use the the log_message function, then it would create messages with the timestamp of the date/time of the import, every time it is called with the ts argument omitted, which is clearly not the expected functionality.

Thank fully we can use the same mechanism as in Examples 6 or 7 above (using a default argument of None) in order to get the expected functionality :

#---------------- Example 9
>>> from datetime import datetime
>>> def log_message(msg, ts=None):
...    """Create a log message in a known format, adding the time stamp to the message (defaulting to now)"""
...    ts = ts  if ts else datetime.now()
...    return "Log : {} {}".format(ts, msg)

And finally :

There is a case when using a mutable type (dictionary, list etc) as a default argument can be very useful.

Imagine writing a function that will generate the nth Fibonacci number :

>>> #---------------- Example 10
>>> def fib(n):
...     if n == 0:
...             return 0
...     if n == 1:
...             return 1
...     return fib(n-1) + fib(n-2)


This function certainly works, but it does have an issue : in that if you calculate a lot of different values, then you will recalculate many of the lower values multiple times. Since the Fibonacci series doesn't change - is there some way we can store the values we have already calculated, to make our code run faster ?

>>> #---------------- Example 11
>>> def fib2(n, cache={0:0, 1:1} ):
...     if n in cache:
...         return cache[n]
...     cache[n] =  fib2(n-1) + fib2(n-2)
...     return cache[n]


In this new function we now have a cache dictionary as a default argument, although the cache parameter is never used when we call the fib2 function, but as you can see as the function calculates new values, it adds them to the cache, and we know that the cache object is shared between multiple calls. The big time difference in the cached version arises from making multiple calls to a function is a lot slow than accessing one value from a dictionary.

This technique is memoization, and from the timings below, the speed improvement is considerable :

$ python -m timeit -n 1 -r 10 -s "import fib" "[fib.fib(n) for n in range(30)]"
10 loops, best of 10: 407 msec per loop
$ python -m timeit -n 1 -r 10 -s "import fib" "[fib.fib2(n) for n in range(30)]"
10 loops, best of 10: 5.01 usec per loop

Yes that is really 407 milli seconds (i.e. 0.4 seconds) compared to 5 micro seconds (i.e. 0.000005 seconds) - and the code is only building a list of the first 30 numbers - imagine the savings from building bigger lists.

A graph of the timings is illuminating - building a list of Fibonacci numbers from 0 to n.
As you can see the time taken as n increases exponentially for the non-memoized version (example 10), where-as the timing for the memoized function increases but at a far lower rate.

Beware though - in this case the optimization only works because we are building a list of values and as we attempt to calculate the higher values in the list, the lower values have already been calculated and stored. However, if we just called the memoized version to just calculate a single value, you may well find that it is similar timings or even slower than the non-memoized version.

No comments:

Post a Comment