Speeding Up Python ��� Part 2: Optimization

The goal of this post and its predecessor is to provide some toolsand tips for improving the performance of Python programs. In the previouspost, we examined profiling tools — sophisticated stopwatchesfor timing programs as they execute. In this post, we will use these tools todemonstrate some general principles that make Python programs run faster.

Remember: If your program already runs fast enough, you do not need to worryabout profiling and optimization. Faster is not always better, especially ifyou end up with code that is difficult to read, modify, or maintain.

Overview

We can summarize our principles for optimizing performance as follows:

Debug first. Never attempt to optimize a program that does not work.Focus on bottlenecks. Find out what takes the most time, and work onthat.Look for algorithmic improvements. A different theoretical approachmight speed up your program by orders of magnitude.Use library functions. The routines in NumPy, SciPy, and PyPlot havealready been optimized.Eliminate Python overhead. Iterating over an index takes more time thanyou think.

If fast code is important to you, you can start writing new code with theseguidelines in mind and apply them to existing code to improve performance.

Debug first.

Your primary objective in programming is to produce a working program. Unlessthe program already does what it is supposed to do, there is no point in tryingto speed it up. Profiling and optimization are not part of debugging. Theycome afterward.

Focus on bottlenecks.

The greatest gains in performance come from improving the most time-consumingportions of your program. The profiling tools described in Part 1of this two-part post can help you identify bottlenecks — portions of aprogram that limit its overall performance. They can also help you identifyfaster alternatives. Once you have identified the bottlenecks in your program,apply the other principles of optimization to mitigate or eliminate them.

Look for algorithmic improvements.

The greatest gains in speed often come from using a different algorithm to solveyour problem.

When comparing algorithms, one analyzes how the computation time scales withsome measure of the size of the problem. Some number \(N\) usually characterizesthe size of a problem — e.g., the number of elements in a vector or the number ofitems in a list. The time required for a program to run is expressed as afunction of \(N\). If an algorithm scales as \(N^3\), it means that when \(N\) islarge enough, doubling \(N\) will cause the algorithm to take roughly 8 times aslong to run. If you can find an algorithm with better scaling, you can oftenimprove the performance of your program by orders of magnitude.

As an example, consider the classic problem of sorting a jumbled list ofnumbers. You might consider using the following approach:

Create an empty list to store the sorted numbers.Find the smallest item in the jumbled list.Remove the smallest item from the jumbled list and place it at the end of thesorted list.Repeat steps 2 and 3 until there are no more items in the jumbled list.

This method is calledinsertion sort.It works, and it is fast enough for sorting small lists. You can prove thatsorting a jumbled list of \(N\) elements using an insertion sort will require onthe order of \(N^2\) operations. That means that sorting a list with 1 millionentries will take 1 million times longer than sorting a list of 1000 entries.That may take too long, no matter how much you optimize!

Delving into a computer science textbook, you might discover mergesort, a different algorithm thatrequires on the order of \(N \log N\) operations to sort a jumbled list of \(N\)elements. That means that sorting a list of 1 million entries will take roughly6000 times longer than sorting a list of 1000 entries — not 1 million. Forbig lists, this algorithm is orders of magnitude faster, no matter whatprogramming language you are using.

As another example, you might need to determine a vector \(\mathbf{x}\) in thelinear algebra problem \(A \cdot \mathbf{x} = \mathbf{b}\). If \(A\) is an \(N\times N\) matrix, then inverting \(A\) and multiplying the inverse with\(\mathbf{b}\) takes on the order of \(N^3\) operations. You quickly reach thelimits of what your computer can handle around \(N = 10,000\). However, if yourmatrix has some special structure, there may be an algorithm that takes advantageof that structure and solve the problem much faster. For example, if \(A\) issparse (most of its elements are zero), you can reduce the scaling to \(N^2\) —or even \(N\) — instead of \(N^3\). That is a huge speedup if \(N = 10,000\)!These kinds of algorithmic improvements make an “impossible” calculation“trivial”.

Unfortunately, there is no algorithm for discovering better algorithms. It isan active branch of computer science. You need to understand your problem,delve into texts and journal articles, talk to people who do research in thefield, and think really hard. The payoff is worth the effort.

Use library functions.

Once you have a working program and you have identified the bottlenecks, you areready to start optimizing your code. If you are already using the bestavailable algorithms, the simplest way to improve the performance of Python codein most scientific applications is to replace homemade Python functions withlibrary functions.

It’s not that you are a bad coder! It’s just that someone else took the time torewrite a time-consuming operation in C or FORTRAN, compile and optimize it,then provide you with a simple Python function to access it. You are simplytaking advantage of work that someone else has already done.

Let’s use the profiling tool %timeit to look at a some examples of the speeduppossible with library functions.

Building an array of random numbers

Suppose you need an array of random numbers for a simulation. Here is aperfectly acceptable function for generating the array:

import numpy as npdef Random(N): """ Return an array of N random numbers in the range [0,1). """ result = np.zeros(N) for i in range(N): result[i] = np.random.random() return result

This code works and it is easy to understand, but suppose my program does notrun fast enough. I need to increase the size of the simulation by a factor of10, but it already takes a while. After debugging and profiling, I see thatthe line of my program that calls Random(N) takes 99 percent the executiontime. That is a bottleneck worth optimizing!

I can start by profiling this function:

$ %timeit Random(1000)1000 loops, best of 3: 465 us per loop

When I look at the documentation for np.random.random, I discover it iscapable of doing more than generate a single random number. Maybe I should justuse it to generate the entire array …

$ %timeit np.random.random(1000)10000 loops, best of 3: 28.9 us per loop

I can generate exactly the same array in just 6 percent of the time!

In this hypothetical example, my entire program will run almost 20 times faster.I can increase the size of my calculation by a factor of 10 and reduce theoverall calculation time simply by replacing Random(N) by np.random.random(N).

Array operations

Let’s look at an even more dramatic example. In the previous post,I introduced a library to add, multiply, and take the square root of arrays.(The library, calculator.py is included at the bottom of thispost.) Suppose, once again, I have identified the bottleneck in aworking program, and it involves the functions in the calculator.py module.

NumPy has equivalent operations with the same names. (When you write x y fortwo arrays, it is shorthand for np.add(x,y).) Let’s see how much we can speedthe operations up by switching to the NumPy equivalents. First, import themodules and create two random two-dimensional arrays to act on:

import calculator as calcimport numpy as npx = np.random.random((100,100))y = np.random.random((100,100))

Now, time the operations in calculator.py and NumPy:

$ %timeit calc.add(x,y)100 loops, best of 3: 9.76 ms per loop$ %timeit np.add(x,y)100000 loops, best of 3: 18.8 us per loop

Here we see an even more significant difference. The addition function writtenin pure Python takes 500 times longer to add the two arrays. Use %timeit tocompare calc.multiply with np.multiply, and calc.sqrt with np.sqrt, andyou will see similar results.

When to write your own functions

The implications of these examples are clear: NumPy array operations are much,much faster than the equivalent Python code. The same is true of specialfunctions in the SciPy and PyPlot packages and many other Python libraries. Tospeed up your code, use functions from existing libraries when possible. Thiscan save time in writing code, optimizing code, and running code.

So should you ever write your own functions?

I was once told, “Never write your own linear algebra routines. Somebodyalready wrote a faster one in the 1970s.” That may be generally true, but it isbad advice nonetheless. If you never write your own routine to invert a matrix,it is difficult to fully understand how these routines work and when they canfail, and you will certainly never discover a better algorithm.

If speed of execution is not important or if your goal is to understand how analgorithm works, you should write your own functions. If you need to speed up aworking Python program, look to library functions.

Eliminate Python overhead.

Why are library functions are so much faster than their Python equivalents? Theanswer is a point we discussed in Chapter 2 of A Student’s Guide to Python forPhysical Modeling: In Python, everything is an object. When you type“x = 1”, Python does not just store the value 1 in a memory cell. Itcreates an object endowed with many attributes and methods, one of which is thevalue 1. Type dir(1) to see all of the attributes and methods of aninteger.

What’s more, Python has no way of knowing what type of objects are involved in asimple statement like z = x y. First, it has to determine what kind of objectx is and what kind of object y is (type-checking). Then it has tofigure out how to interpret “ ” for these two objects. If the operation makessense, Python then has to create a new object to store the result of x y.Finally, it has to assign this new object to the name z. This gives Python alot of flexibility: x and y can be integers, floats, arrays, lists, strings,or just about anything else. This flexibility makes it easy to write programs,but it also adds to the total computation time.

To speed up programs, eliminate this overhead. In other words, make Pythondo as little as possible.

Using library functions from NumPy, SciPy, and PyPlot eliminates overhead, andthis is the main reason they run so much faster. In the example above,np.add(x,y) is not doing anything fundamentally different thancalc.add(x,y); it simply does addition and iteration in the background,without Python objects. Recall from the previous post thatcalc.add(x,y) spent almost 30 percent of its time iterating of the index jin the inner for loop.

Other ways to eliminate overhead are

Use in-place operations. Operations like =, -=, *=, and /=operate on an existing object instead of creating a new one.Use built-in methods. These methods are often optimized.Use list comprehensions and generators instead of forloops. Initializing a list and accessing its elements take time.Vectorize your code. (Section 2.5.1 of A Student’s Guide to Python forPhysical Modeling )

Use %timeit to compare the performance of these functions. They use theprinciples above to eliminate some of the Python overhead in square_list0.

def square_list0(N): """ Return a list of squares from 0 to N-1. """ squares = [] for n in range(N): squares = squares [n**2] return squaresdef square_list1(N): """ Return a list of squares from 0 to N-1. """ squares = [] for n in range(N): # In-place operations: Replace "x = x ..." with "x = ..." squares = [n**2] return squares def square_list2(N): """ Return a list of squares from 0 to N-1. """ squares = [] for n in range(N): # Built-in methods: Replace "x = x ..." with "x.append(...)" squares.append(n**2) return squares def square_list3(N): """ Return a list of squares from 0 to N-1. """ # Use list comprehension instead of for loop. return [n**2 for n in range(N)] def square_array(N): """ Return an array of squares from 0 to N-1. """ # Vectorize the entire operation. from numpy import arange return arange(N)**2

In my tests, square_list3(1000) ran about 18 times faster thansquare_list0(1000), and square_array(N) was about 350 times faster thansquare_list0(1000). The last function virtually eliminates Python overheadby using NumPy arrays in vectorized code.

More Options

If performance is still not satisfactory after attempting the optimizationsdescribed here, you can try compiling your Python code. Compiling is beyond thescope of this post. You can find out more aboutNumba(which is included in the Anaconda distribution) orCythonby following these links. Numba allows you to compile pure Python code. Cythonallows you to write fast C extensions for Python without learning C.

For users of the Anaconda distribution of Python, there is an optional add-oncalled Accelerate. This add-onwill replace the standard NumPy, SciPy, and other scientific libraries withequivalent libraries that use Intel’s MKL routines for linear algebra. On manymachines, this will improve performance without any effort on your part beyondinstalling the package. Accelerate also includes NumbaPro, a proprietaryversion of the Numba package. Accelerate is free to academic users.

Summary

To summarize, there are a few simple ways to speed up a Python program. Onceyou have a working program and you have identified its bottlenecks, you can lookfor library functions to replace the slowest functions in your program, you canrewrite your code to eliminate Python’s overhead, and you can search for fasteralgorithms to solve your problem. As you develop your programming skills, youwill start to incorporate these principles automatically. Happily, this meansless time profiling and optimizing!






Code SamplesThe calculator.py Module# -----------------------------------------------------------------------------# calculator.py# ----------------------------------------------------------------------------- """This module uses NumPy arrays for storage, but executes array math using Pythonloops."""import numpy as npdef add(x,y): """ Add two arrays using a Python loop. x and y must be two-dimensional arrays of the same shape. """ m,n = x.shape z = np.zeros((m,n)) for i in range(m): for j in range(n): z[i,j] = x[i,j] y[i,j] return zdef multiply(x,y): """ Multiply two arrays using a Python loop. x and y must be two-dimensional arrays of the same shape. """ m,n = x.shape z = np.zeros((m,n)) for i in range(m): for j in range(n): z[i,j] = x[i,j] * y[i,j] return zdef sqrt(x): """ Take the square root of the elements of an arrays using a Python loop. """ from math import sqrt m,n = x.shape z = np.zeros((m,n)) for i in range(m): for j in range(n): z[i,j] = sqrt(x[i,j]) return zdef hypotenuse(x,y): """ Return sqrt(x**2 y**2) for two arrays, a and b. x and y must be two-dimensional arrays of the same shape. """ xx = multiply(x,x) yy = multiply(y,y) zz = add(xx, yy) return sqrt(zz)
 •  0 comments  •  flag
Share on Twitter
Published on September 28, 2015 12:43
No comments have been added yet.