Appendix Y: Generators


Goals


Motivation for Generators

We often write functions that return large lists, with the intention to iterate over the items once.

def permute(word):
    if len(word) == 1:
        return [word]
    else:
        result = []
        for i in range(len(word)):
            for perm in permute(word[:i] + word[i + 1 :]):
                result.append(word[i] + perm)
        return result
permute("abc")
['abc', 'acb', 'bac', 'bca', 'cab', 'cba']

This returns n! items for a string of length n.

We often write functions that return lists of substantial size, for example the functions you write to load a list of data.

Yet we often only use a single item at a time:

for item in permute("cat"):
    print(item)
cat
cta
act
atc
tca
tac

Consider that this code is the same as:

items = permute("cat")
for item in items:
    print(item)

How can we avoid this?

One option would be to print from within the permute function, and this would avoid the need for a list.

This would reduce our ability to reuse functions. It seems like perhaps it creates a tension between well-structured code and efficiency.

In practice, we will turn to generators.

Generator Functions

Generator functions are a special type of function. They can be identified by their use of the yield keyword.

def simple_generator_func():
    print("start")
    yield 1
    print("still running")
    yield 2
    print("one more")
    yield 3

When you call a generator function, it seems like nothing happens:

# nothing prints?
g = simple_generator_func()

The function returns an object of type generator.

print(type(g))
<class 'generator'>

A generator is a special type of object that is iterable, but is not a container type.

for item in g:
    print("printing from loop", item)
start
printing from loop 1
still running
printing from loop 2
one more
printing from loop 3

Note that the output alternates between output from within the function and outside.

When the function encounters a yield, it pauses. The calling code receives the value as if it has been returned.

The difference is that on each iteration, the function resumes.

def simple_generator_func():
    print("start")
    yield 1
    print("still running")
    yield 2
    print("one more")
    yield 3

Let’s rewrite permute as a generator function:

def ipermute(word):
    if len(word) == 1:
        yield word
    else:
        for i in range(len(word)):
            for perm in ipermute(word[:i] + word[i + 1 :]):
                yield word[i] + perm

This works similarly, avoiding the N! list:

# this works as expected, but avoids the N! list
for perm in ipermute("abcd"):
    print(perm)
abcd
abdc
acbd
acdb
adbc
adcb
bacd
badc
bcad
bcda
bdac
bdca
cabd
cadb
cbad
cbda
cdab
cdba
dabc
dacb
dbac
dbca
dcab
dcba

Iterables

How does this work?

We’ve seen that Python syntax is largely hiding calls to dunder methods.

A for loop is “syntactic sugar” for calling two other methods: __iter__ and __next__.

ll = [1, 2, 3, 4]
for item in ll:
    print(item)
1
2
3
4
ll = [1, 2, 3, 4]
iterator = ll.__iter__()
while True:
    try:
        item = iterator.__next__()
    except StopIteration:
        break
    print(item)
1
2
3
4
  • __iter__ returns an iterator object, an intermediary object that tracks the current position in the iterable. We usually don’t see this variable.
  • __next__ returns the next item in the iterable, and raises StopIteration when there are no more items.

If you implement a type that has these methods, it can be used in for loops and other iteration contexts.

Generators in Practice

You’ve been using something very generator-like without realizing it.

What does range return?

# does this create a list with 1 trillion elements?
r = range(1_000_000_000_000)
print(f"object {r} is {type(r)=}")
object range(0, 1000000000000) is type(r)=<class 'range'>
# get the iterator object (not needed for generators, but range is generator-like)
ri = iter(r)
# the function `next` gets a single item from any iterable
print(next(ri))
# it is the same as calling __next__
print(ri.__next__())
# it yields one value at a time
print(next(ri))
0
1
2

With generators could write our own range function:

# simplified form with one parameter
def our_range(n):
    i = 0
    while i < n:
        yield i
        i += 1
for x in our_range(10):
    print(x)
0
1
2
3
4
5
6
7
8
9

If you’ve used map and filter, these are generator functions as well (as of Python 3).

Python 2 vs 3 Difference

Though you are increasingly unlikely to encounter Python 2, there is a fair bit of legacy Python 2 code out there.

In all Python 2.x versions range does in fact return a list! Therefore, the code above would crash, since you would not have the RAM to store a list of 1 trillion elements.

Generators were introduced in Python 2.2. For backwards-compatibility reasons it wouldn’t be until Python 3 that the language was able to update range to use them.

In Python it is almost always correct to use xrange, the generator version. This became range in Python 3..

This demonstrates two of the primary benefits of generators:

  1. Generators use significantly less memory, only needing the memory for a single item and a bit of overhead– significantly less than the entire list.
  2. Generators enable lazy evaluation, work is not done until/unless needed.

Infinite Generators

To demonstrate this idea of lazy evaluation we can look at infinite generators as an extreme case:

# Generators do not have to ever exit.
# Here is an infinite generator:
def powers_of_two():
    n = 2
    while True:
        yield n
        n *= 2

This is an infinite loop!

But since it will pause each time yield is reached, we can use this as long as the calling code breaks the loop:

for x in powers_of_two():
    print(x)
    if x > 1000:
        break
2
4
8
16
32
64
128
256
512
1024

Generator Expressions

In Python we can make a list comprehension:

# list of squares
[x ** 2 for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

A dict comprehension:

# maps values to their squares
{x: x ** 2 for x in range(10)}
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

A set comprehension:

# a set of cubes
{x ** 3 for x in range(10)}
{0, 1, 8, 27, 64, 125, 216, 343, 512, 729}

But, if you have ever tried to make a tuple comprehension you didn’t get what you expected:

(x ** 2 for x in range(10))
<generator object <genexpr> at 0x1173c3bc0>

This is in fact a generator expression:

# this is a generator, it will be lazy-evaluated
g = (x ** 2 for x in range(10000000000))
# we can iterate over it and get a single item at a time
for item in g:
    print(item)
    if item > 1000:
        break
0
1
4
9
16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484
529
576
625
676
729
784
841
900
961
1024

itertools

The itertools module contains many useful functions for working with iterators, all of which are implemented as generators.

Useful functions include:

  • itertools.permutations - permutations of an iterable
  • itertools.combinations - combinations without replacement
  • itertools.product - Cartesian product of multiple iterables (like nested for loops)
  • itertools.chain - concatenate iterators
  • itertools.islice - slice an iterator the way you would a list
  • itertools.groupby - group items by a key
  • itertools.tee - create multiple iterators from one