Introduction
This is the second post in the my Interview Question Deep Dive series, the first one of which was about generating balanced parentheses. In this post, we will start by looking at a simple question about flattening an iterator of iterators, and consider a slightly harder variation of it, and finally see how our solution to the harder variation is analogous to an important theorem in a beautiful area of mathematics, namely set theory, which Hilbert referred to as Cantor's paradise, hence the title of the article.
Understanding the problem
Here's the coding question I was asked that inspired this post:
Write a program that flattens an iterator that iterates over iterators.
Reading that, the question might come across as a bit confusing, both in terms of what an "iterator" is and what "flattening" means. Let's clarify both of these, specifically in the context of Python as the language we are using. If you are already familiar with iterators, you can skip the rest of this section and jump to the next section for the first solution.
Abstractly speaking, an iterator is an object that allows visiting a set of objects one at a time and in sequence. An iterator should retain enough state to go from one element to the next and should be able to determine when we are at the end of the set. It's important to mention early on that iterators do not have to be finite, and do not have to iterate over objects that are in fully present in memory (e.g., file iterators, or an iterator that returns the next Tweet corresponding to a hashtag, and so on). An iterator can keep returning new elements each time it is called without ever reaching the end of the set.
Let's consider the above definition more concretely by looking at how iterators work in Python, which is the language we will use for our solutions. In Python, we define an iterator as any object that has a __next__ function, and that behaves as follows. Calling __next__ should return the next object of the set the iterator is iterating over. If no next element exists, a StopIteration exception is thrown. It is also assumed that once a StopIteration exception is thrown, the iterator has finished iterating and any future calls to __next__ will also throw a StopIteration exception.
Objects for which an associated iterator class exists are often called "iterables". In Python, iterables are defined as objects that have an __iter__ member function. Calling that member function should return a new instance of an iterator for that object. Note that you usually do not call __iter__ and __next__ directly, but rather use the iter() and next() functions to call them. For example, let's look at how we can iterate over the same list using two different iterators:
In [1]: a = [1, 2, 3]
In [2]: it1 = iter(a)
In [3]: it2 = iter(a)
In [4]: next(it1)
Out[4]: 1
In [5]: next(it1)
Out[5]: 2
In [6]: next(it2)
Out[6]: 1
In [7]: next(it1)
Out[7]: 3
In [8]: next(it1)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
Note how the two iterators are iterating over the same list independently.
To loop over an iterator, we can call the next function until a StopIteration exception is thrown. For example, the following will print every element given an iterator:
def print_all(it):
while True:
try:
print(next(it))
except StopIteration:
return
Of course, if you're familiar with iterators in Python you're probably thinking that's silly. And you're right. It is silly because Python's for loops are a shortcut for the above! Given any iterator you can loop over all their elements using a for loop:
def print_all(it):
for x in it:
print(x)
To demonstrate how iterators can be infinite, let's consider the simple example of Python's itertools.count iterator:
In [1]: from itertools import count
In [2]: it = count()
In [3]: next(it)
Out[3]: 0
In [4]: next(it)
Out[4]: 1
In [5]: next(it)
Out[5]: 2
As you can see, an instance of itertool.count simply counts up, starting from 0 and onward, and will never stop (theoretically eventually it would run out of memory needed to keep the state which is the current number, but that would take a very long time given that Python will switch over to using a big integer once a regular integer overflows).
Next let's consider what "flatten" means in the context of this question. To understand it, consider a list of lists: [[1, 2], [3, 4], [], [5]]. To flatten this list means to generate the following: [1, 2, 3, 4, 5], which are all the elements in the lists the top-level list contains. Note that the interview question does not ask for recursive flattening (at least in its first iteration), but rather just one layer of flattening. An example of recursive flattening would be turning [[1, 2], [[3, 4], [5]]] to [1, 2, 3, 4, 5].
A basic solution
The question, once understood, might come across as deceivingly simple: we can simply loop through the first iterator, then the second, and so on, until we iterate through all the inner elements. Namely, we just use two for loops:
def flatten_simple(top_it):
for it in top_it:
for x in it:
yield x
Very simple, but what could go wrong here? I encourage to stop here for a minute and think about it before reading further.
Dealing with infinite iterators
Here's what could go wrong. Our code works perfectly well if all the inner iterators are finite but if any of the iterators are infinite, we will get stuck in them, and never visit any of the iterators coming after them. To notice this, consider [itertools.count(), [-1]] as the input to the above flatten_simple function. The output will simply be the output of itertools.count() and -1 will never be reached.
Noticing this, we could simply say that that's expected behaviour, and warn users about that behaviour in our documentation. However, let's assume we add the following requirement to the problem:
Write a program that flattens an iterator that iterates over iterators. If an object x is visited by any of the iterators in a finite amount of time, your code should also visit x in a finite amount of time.
Reading that, it's not immediately obvious that such a thing is possible. We will look at this in detail in a bit, but first, let's come up with a way of visualizing the problem which will help us come up with a solution.
Visualizing the problem
We can visualize the input as a potentially infinite graph that is not fully "discovered" (i.e., the full shape of the graph is not known a priori, or even in memory). The graph is based on the basic idea that we can treat the next function as an directed edge pointing to the next item. An example can make this clear. For the input [[1, 2], [3], [4, 5, 6]] we can draw the following graph:
The graph has three types of nodes here, the yellow smaller ones correspond to starting points of iterators, the very top left corner one being the top-level input iterator. The orange ones are the individual elements generated by the inner iterators, and I use blue squares to indicate reaching the end of each iterator. In this way of visualizing things, you can think of the top-level iterator moving down vertically and generating the yellow nodes while each inner iterator moves horizontally, generating the orange nodes.
Now let's try visualizing an example with infinite iterators. Let's take [[-1, -2], itertools.count(), [-3, -4, -5]] as the example. Here is what the graph would look like:
Viewing the problem as graph traversal
Visualized as above, it is easy to see the problem as traversing a potentially infinite graph. Our basic solution above was akin to a DFS traversal. Because of its depth-first approach, it of course had the problem of getting stuck in infinite branches of the graph. We can visualize the traversal corresponding to the above simple algorithm as the following animation, which shows how each node is discovered one at a time:
A BFS approach to handle infinite iterators
To solve the issue of getting stuck in an infinite branch of the graph, we need to do a BFS style traversal of the graph instead of DFS. For this, as usual with BFS, we will use a queue, moving forward in each discovered iterator one at a time. The desired traversal, visualized as an animation, would look like the following:
One simple implementation of the above BFS approach is given below in Python. You can view it as visiting items in the order of their distance from the starting point (the top-left yellow node representing the top-level iterator which is the input). We use a special item (the string 'top') in the queue to keep track of visiting the top-level iterator items (yellow nodes in the visualization) since they are the only nodes with two outgoing edges (which is why we append two items onto the queue each time we visit one of these nodes.)
def flatten_bfs(input):
from collections import deque
top_iter = iter(input)
q = deque(['top'])
while q:
it = q.popleft()
if it == 'top':
try:
q.append('top')
q.append(iter(next(top_iter)))
except StopIteration:
pass
else:
try:
x = next(it)
yield x
q.append(it)
except StopIteration:
pass
Set theoretic interpretation
As I mentioned before, the problem we looked at in this post, and the BFS-based solution to it, correspond to an important theorem and a possible proof of the theorem in set theory. Let's take a brief look at how that's the case. I will assume basic familiarity with set theory. If you want to learn more on the subject, I recommend Pinter's A Book of Set Theory or the more axiomatic treatment of Suppes in Axiomatic Set Theory.
The fundamental concept related to our algorithms above is that of countable sets. Countable sets are, in a limited sense, the mathematical equivalent of iterables in that they can be iterated over (or "counted" hence the term "countable"). Let's define as the set of natural numbers. This set is of course infinite, and corresponds to itertools.count() in Python. A set is said to be countable if and only if it can be put in one-to-one and onto correspondence with or a subset of . The one-to-one and onto correspondence in this case is analogous to an iterator, as the iterator can be seen as mapping each element it iterates over to a natural number (starting from 0 and onward).
Back to the problem, our iterator containing iterators is then analogous to a (disjoint) union of a countable number of countable sets. The problem of flattening such an iterator in a way such that any element will be iterated over in a finite amount of time then mirrors the following theorem from set theory:
Theorem: Assume and for all are all countable. Then is also countable.
Finally, our BFS-style algorithm then is the equivalent of a possible proof of the above theorem, and can be translated to a formal mathematical proof of the theorem.
The theorem itself might seem a bit obvious at first but its proof is not completely trivial, and its ramifications are quite wide-ranging. For example, the proof that , the set of rational numbers, is countable is a simple corollary of this theorem and is usually relatively counter-intuitive for most people. Furthermore, since all countable subsets of real numbers can be shown to have zero measure, it follows that has zero measure, which is arguably even more counter-intuitive.
Sources And Further Reading
As mentioned above, Pinter's A Book of Set Theory and Suppes's Axiomatic Set Theory are both good books to start learning about set theory.
In addition to the above, Cantor's original essays are still one of the best pieces of writing on the subject, if you don't mind the denser mathematics and older notation. I highly recommend Contributions to the Founding of the Theory of Transfinite Numbers as reading for anyone interested in set theory.
For for information on iterators in Python, PEP 234 Iterators is a good starting point.
Comments