A Review of Basic Algorithms and Data Structures in Python - Part 1: Graph Algorithms

Introduction

Recently, while reviewing basic graph algorithms, I decided to write down my study notes as an article in case someone else finds them useful. To verify my understanding, I wrote minimal implementations of the algorithms in Python which make up the bulk of this article. Simple unit tests accompany the code. The unit tests can also be used as examples of using the code.

I'm hoping to write at least a few follow-up posts, focusing on combinatorial algorithms, string algorithms, and maybe even one on computational geometry; hence the "Part 1" in the title.

Most of the code was written to be easy to understand without having to reference much else (with a few exceptions, for example Kruskal's algorithm uses the disjoint set structure defined in another section). This results in some duplication, especially in the unit tests. I consider this to be acceptable, given that the purpose of the code is to be used as educational material and not as code in production use that needs day to day maintenance.

One last thing before we start: I wrote the article and all the code relatively quickly. Mistakes and bugs are definitely possible. Corrections are appreciated; please comment below if you find any.

Table Of Contents

Algorithms and data structures in this article:

Disjoint Set (Union-Find)

The disjoint set structure is used to keep track of a partitioning of a set of objects into subsets. The main question it needs to answer is "do X and Y belong to the same subset?" and the main operation it needs to support is joining two subsets so that elements in either of the subsets will belong to the same larger subset afterwards.

A quick and minimal implementation is provided below. The implementation below uses a forest to keep track of the subsets in the partition. Each tree in the forest is one subset, and the root of the tree is the "representative" element of the subset. To check if two elements belong to the same subset, we check if they have the same representative element.

Noting that the ideal tree in this implementation is a star (this minimizes the number of recursive find calls), we "compress" the paths on each call to find. That is, we set the parent of all the elements on the path to the representative to the representative as we unwind down the recursive call stack.

class DisjointSet(object):
  def __init__(self, n):
    """
    Initializes a disjoint set structure consisting of n disjoint sets.
    """
    self.parent = list(range(n))

  def find(self, x):
    """Returns the representative element of the set x belongs to."""
    if self.parent[x] != x:
      self.parent[x] = self.find(self.parent[x])
    return self.parent[x]

  def union(self, x, y):
    """Joins the sets containing x and y."""
    self.parent[self.find(x)] = self.find(y)

And the accompanied unit test:

import unittest
from union_find import DisjointSet


class DisjointSetTest(unittest.TestCase):
  def test_initialized_state(self):
    d = DisjointSet(3)
    self.assertEqual(d.find(0), 0)
    self.assertEqual(d.find(1), 1)
    self.assertEqual(d.find(2), 2)

  def test_basic_union(self):
    d = DisjointSet(3)
    d.union(0, 1)
    self.assertEqual(d.find(0), d.find(1))
    self.assertNotEqual(d.find(1), d.find(2))

  def test_basic_union_idempotent(self):
    d = DisjointSet(2)
    d.union(0, 1)
    d.union(0, 1)
    self.assertEqual(d.find(0), d.find(1))

  def test_union_all(self):
    d = DisjointSet(100)
    for i in range(1, 100):
      d.union(i - 1, i)
    for i in range(1, 100):
      self.assertEqual(d.find(0), d.find(i))

Kruskal's Minimum Spanning Tree (MST)

Kruskal's minimum spanning tree algorithm is a good example of a greedy algorithm. Starting with a forest consisting of individual disjoint vertices, at each step we pick the next best edge (one with minimal weight) provided it does not introduce a cycle into the forest, and continue until the forest becomes a tree. It's rather easy to prove that the resulting tree is a minimum spanning tree.

Using the disjoint set structure shown above to keep track of the minimum spanning forest, the implementation below is very simple:

from collections import namedtuple
from union_find import DisjointSet


# Putting weight as the first element means Edges will sort by weight first,
# then source and target (lexicographically).
Edge = namedtuple('Edge', ['weight', 'source', 'target'])


def kruskal_mst(n, edges):
  """
  Given a positive integer n (number of vertices) and a collection of Edge
  namedtuple objects representing the undirected edges of a graph, returns a
  list of edges forming a minimal spanning tree of the graph. Assumes the
  vertices are numbers in the range 0 to n - 1.  Also assumes input is a
  valid connected undirected graph and that for two vertices v and w only one
  of (v, w) or (w, v) is an edge in the input. Output is undefined if these
  assumptions are not satisfied.
  """
  d = DisjointSet(n)
  mst_tree = []
  for edge in sorted(edges):
    if d.find(edge.source) != d.find(edge.target):
      mst_tree.append(edge)
      if len(mst_tree) == n - 1:
        break
      d.union(edge.source, edge.target)
  return mst_tree

And the accompanied unit test:

import unittest
from kruskal import kruskal_mst, Edge


class KruskalMSPTest(unittest.TestCase):
  def test_single_vertex_graph(self):
    self.assertEqual(kruskal_mst(1, []), [])

  def test_single_edge_graph(self):
    edges = [Edge(source=0, target=1, weight=10)]
    self.assertEqual(kruskal_mst(2, edges), edges)

  def test_cycle_5(self):
    edges = [
      Edge(source=0, target=1, weight=50),
      Edge(source=1, target=2, weight=30),
      Edge(source=2, target=3, weight=60),
      Edge(source=3, target=4, weight=20),
      Edge(source=4, target=0, weight=10),
    ]
    # Everything except the heaviest edge. Output sorted by weight.
    self.assertEqual(kruskal_mst(5, edges), [
      Edge(source=4, target=0, weight=10),
      Edge(source=3, target=4, weight=20),
      Edge(source=1, target=2, weight=30),
      Edge(source=0, target=1, weight=50),
    ])

  def test_complete_graph_4(self):
    edges = [
      Edge(source=0, target=1, weight=10),
      Edge(source=0, target=2, weight=30),
      Edge(source=0, target=3, weight=40),
      Edge(source=1, target=2, weight=20),
      Edge(source=1, target=3, weight=50),
      Edge(source=2, target=3, weight=60),
    ]
    self.assertEqual(kruskal_mst(4, edges), [
      Edge(source=0, target=1, weight=10),
      Edge(source=1, target=2, weight=20),
      Edge(source=0, target=3, weight=40),
    ])

Depth First Search (DFS)

Depth first search is arguably the simplest graph traversal algorithm. It's a simple recursive algorithm that just needs to keep track of which vertices have already been processed. In fact, many other recursive algorithms can be thought of as a DFS on some underlying graph (e.g. binary search is guided DFS on the binary search tree). DFS can be used to determine if there is a path from a vertex to another and to visit every vertex starting from a source vertex. Variations of DFS can be used for determining connected components, and doing topological sorting. The code below simply uses DFS to return all vertices reachable from a starting vertex.

def dfs(graph, source):
  """
  Given a directed graph (format described below), and a source vertex,
  returns a set of vertices reachable from source.

  The graph parameter is expected to be a dictionary mapping each vertex to a
  list of vertices indicating outgoing edges. For example if vertex v has
  outgoing edges to u and w we have graph[v] = [u, w].
  """
  visited = set()

  def _recurse(v):
    if v in visited:
      return
    visited.add(v)
    for w in graph[v]:
      _recurse(w)

  _recurse(source)

  return visited

And the accompanied unit test:

import unittest
from dfs import dfs


class DFSTest(unittest.TestCase):
  def test_single_vertex(self):
    graph = {0: []}
    self.assertEqual(dfs(graph, 0), {0})

  def test_single_vertex_with_loop(self):
    graph = {0: [0]}
    self.assertEqual(dfs(graph, 0), {0})

  def test_two_vertices_no_path(self):
    graph = {
      0: [],
      1: [],
    }
    self.assertEqual(dfs(graph, 0), {0})
    self.assertEqual(dfs(graph, 1), {1})

  def test_two_vertices_with_simple_path(self):
    graph = {
      0: [1],
      1: [],
    }
    self.assertEqual(dfs(graph, 0), {0, 1})
    self.assertEqual(dfs(graph, 1), {1})

  def test_complete_graph(self):
    def _complete_graph(n):
      return {v: list(set(range(n)) - {v}) for v in range(n)}

    for n in range(2, 10):
      graph = _complete_graph(n)
      for v in range(n):
        self.assertEqual(dfs(graph, v), set(range(n)))

  def test_cycle_5(self):
    graph = {
      0: [1],
      1: [2],
      2: [3],
      3: [4],
      4: [0],
    }
    for v in range(5):
      self.assertEqual(dfs(graph, v), {0, 1, 2, 3, 4})

Breadth First Search (BFS)

BFS is one of the simplest graph algorithms and a good algorithm to understand prior to Dijkstra's, which is coming up next. It can be used to simply traverse a graph and visit every vertex, to search for a particular vertex, or find the shortest path (assuming edges don't have weights) to every vertex starting from a single vertex.

from collections import deque


def bfs(graph, source, target):
  """
  Given a directed graph (format described below), and source and target
  vertices, returns a shortest unweighted path as a list of vertices going
  from source to target, or None if no such path exists. Returned path will
  not include the source vertex in it.

  The graph parameter is expected to be a dictionary mapping each vertex to a
  list of vertices indicating outgoing edges. For example if vertex v has
  outgoing edges to u and w we have graph[v] = [u, w].
  """
  q = deque([source])
  # previous_vertex[v] holds the immediate vertex before v in the shortest
  # path from source to v. This dictionary also acts as our "visited" set
  # since we set previous_vertex[v] as soon as the vertex enters our queue.
  previous_vertex = {source: source}
  while q:
    v = q.popleft()
    if v == target:
      return _construct_path(previous_vertex, source, target)
    for w in graph[v]:
      if w not in previous_vertex:
        previous_vertex[w] = v
        q.append(w)
  return None


def _construct_path(previous_vertex, source, target):
  if source == target:
    return []
  return _construct_path(previous_vertex, source,
               previous_vertex[target]) + [target]

And the accompanied unit test:

import unittest
from bfs import bfs


class BFSTest(unittest.TestCase):
  def test_single_vertex(self):
    graph = {0: []}
    self.assertEqual(bfs(graph, 0, 0), [])

  def test_single_vertex_with_loop(self):
    graph = {0: [0]}
    self.assertEqual(bfs(graph, 0, 0), [])

  def test_two_vertices_no_path(self):
    graph = {
      0: [],
      1: [],
    }
    self.assertEqual(bfs(graph, 0, 1), None)

  def test_two_vertices_with_simple_path(self):
    graph = {
      0: [1],
      1: [],
    }
    self.assertEqual(bfs(graph, 0, 1), [1])

  def test_complete_graph(self):
    def _complete_graph(n):
      return {v: list(set(range(n)) - {v}) for v in range(n)}

    for n in range(2, 10):
      graph = _complete_graph(n)
      for v in range(n):
        for w in range(n):
          self.assertEqual(bfs(graph, v, w),
                   [] if v == w else [w])

  def test_cycle_5(self):
    graph = {
      0: [4, 1],
      1: [0, 2],
      2: [1, 3],
      3: [2, 4],
      4: [3, 0],
    }
    self.assertEqual(bfs(graph, 0, 2), [1, 2])
    self.assertEqual(bfs(graph, 0, 3), [4, 3])

Kahn's Topological Sort Algorithm

Given a directed acyclic graph (DAG) representing a set of, say, tasks and their dependencies, topological sort is the problem of finding an order of task execution that will satisfy all the dependencies. This problem arises in a variety of applications. Examples include task scheduling, build systems (e.g. Bazel), parallel pipelines (e.g. Hadoop), and formula evaluation (e.g. in spreadsheets).

While a variation of DFS can be used for topological sorting, my personal favourite algorithm for doing topological sorts is Kahn's algorithm, due to its intuitiveness. The idea behind the algorithm is simple: start with vertices with no incoming edges, process them, and then remove them and all their outgoing edges from the graph and continue until there's nothing left in the graph.

In the code below, instead of returning a particular topological sort, the algorithm assigns a "sequence" to each vertex, such that if sequence[v] < sequence[w] then v should be before w in any topological sort of the graph. This simplifies unit testing, and also allows for easier use of the output in cases where parallelization is possible (since all tasks with the same sequence number can be executed in parallel).

from collections import deque, namedtuple

Vertex = namedtuple('Vertex', ['name', 'incoming', 'outgoing'])


def build_doubly_linked_graph(graph):
  """
  Given a graph with only outgoing edges, build a graph with incoming and
  outgoing edges. The returned graph will be a dictionary mapping vertex to a
  Vertex namedtuple with sets of incoming and outgoing vertices.
  """
  g = {v:Vertex(name=v, incoming=set(), outgoing=set(o))
     for v, o in graph.items()}
  for v in g.values():
    for w in v.outgoing:
      if w in g:
        g[w].incoming.add(v.name)
      else:
        g[w] = Vertex(name=w, incoming={v}, outgoing=set())
  return g


def kahn_top_sort(graph):
  """
  Given an acyclic directed graph (format described below), returns a
  dictionary mapping vertex to sequence such that sorting by the sequence
  component will result in a topological sort of the input graph. Output is
  undefined if input is a not a valid DAG.

  The graph parameter is expected to be a dictionary mapping each vertex to a
  list of vertices indicating outgoing edges. For example if vertex v has
  outgoing edges to u and w we have graph[v] = [u, w].
  """
  g = build_doubly_linked_graph(graph)
  # sequence[v] < sequence[w] implies v should be before w in the topological
  # sort.
  q = deque(v.name for v in g.values() if not v.incoming)
  sequence = {v: 0 for v in q}
  while q:
    v = q.popleft()
    for w in g[v].outgoing:
      g[w].incoming.remove(v)
      if not g[w].incoming:
        sequence[w] = sequence[v] + 1
        q.append(w)

  return sequence

And the accompanied unit test:

import unittest
from kahn import kahn_top_sort


class KahnTopSortTest(unittest.TestCase):
  def test_single_vertex(self):
    graph = {
      0: [],
    }
    self.assertEqual(kahn_top_sort(graph), {
      0: 0,
    })

  def test_total_order_2(self):
    graph = {
      0: [1],
      1: [],
    }
    self.assertEqual(kahn_top_sort(graph), {
      0: 0,
      1: 1,
    })

  def test_total_order_3(self):
    graph = {
      0: [1],
      1: [2],
      2: [],
    }
    self.assertEqual(kahn_top_sort(graph), {
      0: 0,
      1: 1,
      2: 2,
    })

  def test_two_independent_total_orders(self):
    # 0 -> 1 -> 2
    # 3 -> 4 -> 5
    graph = {
      0: [1],
      1: [2],
      2: [],
      3: [4],
      4: [5],
      5: [],
    }
    self.assertEqual(kahn_top_sort(graph), {
      0: 0,
      3: 0,
      1: 1,
      4: 1,
      2: 2,
      5: 2,
    })

  def test_simple_dag_1(self):
    # 0 -> 1 -> 2
    #   \ /
    #  3
    graph = {
      0: [1, 3],
      1: [2],
      2: [],
      3: [1],
    }
    self.assertEqual(kahn_top_sort(graph), {
      0: 0,
      3: 1,
      1: 2,
      2: 3,
    })

Dijkstra's Shortest Path Algorithm

Dijkstra's shortest path algorithm is very similar to BFS, except a priority queue is used instead of a regular queue. A proper implementation would use a priority queue with an "update key" operation which would reduce the redundant items in the queue. The implementation below, for the sake of simplicity, uses the built-in Python PriorityQueue which does not support "update key".

The invariant in the algorithm is that each time we get an item from the queue, we know that we have the shortest path from source to it already (this is where the guarantee of non-negative weights is key, as this invariant can fail if we have negative weights.)

from collections import namedtuple, defaultdict
from Queue import PriorityQueue

Edge = namedtuple('Edge', ['target', 'weight'])


def dijkstra(graph, source, target):
  """
  Given a directed graph (format described below), and source and target
  vertices, returns a shortest path as a list of vertices going from source
  to target, along with the distance of the shortest path, or None if no such
  path exists. Returned path will not include the source vertex in it.
  Assumes non-negative weights.

  The graph parameter is expected to be a dictionary mapping each vertex to a
  list of Edge named tuples indicating the vertex's outgoing edges. For
  example if vertex v has outgoing edges to u and w with weights 10 and 20
  respectively, we have graph[v] = [Edge(u, 10), Edge(w, 20)].
  """
  q = PriorityQueue()
  q.put((0, source))
  # previous_vertex[v] holds the immediate vertex before v in the shortest
  # path from source to v. This dictionary also acts as our "visited" set
  # since we set previous_vertex[v] as soon as the vertex enters our queue.
  previous_vertex = {source: source}
  # Arguably not the best way to represent infinity but it works for the sake
  # of learning the algorithm.
  shortest_distance = defaultdict(lambda: float('inf'))
  shortest_distance[source] = 0
  while not q.empty():
    (distance, v) = q.get()
    if v == target:
      return (distance, _construct_path(previous_vertex, source, target))

    for edge in graph[v]:
      alt_distance = edge.weight + distance
      if alt_distance < shortest_distance[edge.target]:
        shortest_distance[edge.target] = alt_distance
        q.put((alt_distance, edge.target))
        previous_vertex[edge.target] = v
  return None


def _construct_path(previous_vertex, source, target):
  if source == target:
    return []
  return _construct_path(previous_vertex, source,
               previous_vertex[target]) + [target]

And the accompanied unit test:

import unittest
from dijkstra import dijkstra, Edge


class DijkstraTest(unittest.TestCase):
  def test_single_vertex(self):
    graph = {0: []}
    self.assertEqual(dijkstra(graph, 0, 0), (0, []))

  def test_two_vertices_no_path(self):
    graph = {
      0: [],
      1: [],
    }
    self.assertEqual(dijkstra(graph, 0, 1), None)

  def test_two_vertices_with_path(self):
    graph = {
      0: [Edge(target=1, weight=10)],
      1: [],
    }
    self.assertEqual(dijkstra(graph, 0, 1), (10, [1]))

  def test_cycle_3(self):
    graph = {
      0: [Edge(target=1, weight=10), Edge(target=2, weight=30)],
      1: [Edge(target=0, weight=10), Edge(target=2, weight=10)],
      2: [Edge(target=0, weight=30), Edge(target=1, weight=30)],
    }
    self.assertEqual(dijkstra(graph, 0, 2), (20, [1, 2]))

  def test_clrs_example(self):
    graph = {
      's': [
        Edge(target='t', weight=3),
        Edge(target='y', weight=5),
      ],
      't': [
        Edge(target='x', weight=6),
        Edge(target='y', weight=2),
      ],
      'y': [
        Edge(target='t', weight=1),
        Edge(target='z', weight=6),
      ],
      'x': [
        Edge(target='z', weight=2),
      ],
      'z': [
        Edge(target='x', weight=7),
        Edge(target='s', weight=3),
      ],
    }
    distance, path = dijkstra(graph, 's', 'z')
    self.assertEqual(distance, 11)
    self.assertIn(path, [
      ['y', 'z'],
      ['t', 'y', 'x', 'z'],
    ])

    distance, path = dijkstra(graph, 's', 'x')
    self.assertEqual(distance, 9)
    self.assertIn(path, [
      ['t', 'x'],
      ['y', 'x'],
    ])

Bellman-Ford Shortest Path Algorithm

Bellman-Ford is another single-source shortest path algorithm. It's very easy to implement but has worse running time than Dijkstra's. While in Dijkstra's we relax edges greedily based on the next closest vertex to the source, in Bellman-Ford we relax every edge exactly n1n-1 times. Each such iteration guarantees to increase the number of vertices for which we have the shortest path by at least one, and hence after n1n-1 iterations we have the shortest path to every vertex. We then do a final loop over all the edges and try to relax further. If we succeed, we know a negative cycle exists. This is the key advantage of Bellman-Ford as compared to Dijkstra's (Dijkstra's algorithm does not work if negative weights exist.)

Here's a basic implementation:

from collections import namedtuple, defaultdict

Edge = namedtuple('Edge', ['target', 'weight'])


def bellman_ford(graph, source, target):
  """
  Given a directed graph (format described below), and source and target
  vertices, returns a shortest path as a list of vertices going from source
  to target, along with the distance of the shortest path, or None if no such
  path exists and -1 if a negative loop is found. Returned path will not
  include the source vertex in it. Assumes non-negative weights.

  The graph parameter is expected to be a dictionary mapping each vertex to a
  list of Edge named tuples indicating the vertex's outgoing edges. For
  example if vertex v has outgoing edges to u and w with weights 10 and 20
  respectively, we have graph[v] = [Edge(u, 10), Edge(w, 20)].
  """
  # previous_vertex[v] holds the immediate vertex before v in the shortest
  # path from source to v. This dictionary also acts as our "visited" set
  # since we set previous_vertex[v] as soon as the vertex enters our queue.
  previous_vertex = {source: source}
  # Arguably not the best way to represent infinity but it works for the sake
  # of learning the algorithm.
  shortest_distance = defaultdict(lambda: float('inf'))
  shortest_distance[source] = 0

  # Run n - 1 times. We start by knowing the shortest path to 1 vertex
  # (source itself) and each iteration below increases the vertices for which
  # we have the shortest path to by one. This means at the end we have the
  # shortest path to 1 + (n - 1) = n vertices.
  for i in range(len(graph) - 1):
    for v in graph:
      for edge in graph[v]:
        alt_distance = shortest_distance[v] + edge.weight
        if alt_distance < shortest_distance[edge.target]:
          shortest_distance[edge.target] = alt_distance
          previous_vertex[edge.target] = v
  # Final loop over all edges to check for negative loops. If at this point
  # we find a shorter alternative path it means a negative loop exists.
  for v in graph:
    for edge in graph[v]:
      alt_distance = shortest_distance[v] + edge.weight
      if alt_distance < shortest_distance[edge.target]:
        return -1

  if shortest_distance[target] < float('inf'):
    return (shortest_distance[target],
        _construct_path(previous_vertex, source, target))

  return None


def _construct_path(previous_vertex, source, target):
  if source == target:
    return []
  return _construct_path(previous_vertex, source,
               previous_vertex[target]) + [target]

And as before, accompanied unit test, which is a copy of the one used for Dijkstra's, with an additional test for negative cycles:

import unittest
from bellman import bellman_ford, Edge


class BellmanFordTest(unittest.TestCase):
  def test_single_vertex(self):
    graph = {0: []}
    self.assertEqual(bellman_ford(graph, 0, 0), (0, []))

  def test_two_vertices_no_path(self):
    graph = {
      0: [],
      1: [],
    }
    self.assertEqual(bellman_ford(graph, 0, 1), None)

  def test_two_vertices_with_path(self):
    graph = {
      0: [Edge(target=1, weight=10)],
      1: [],
    }
    self.assertEqual(bellman_ford(graph, 0, 1), (10, [1]))

  def test_cycle_3(self):
    graph = {
      0: [Edge(target=1, weight=10), Edge(target=2, weight=30)],
      1: [Edge(target=0, weight=10), Edge(target=2, weight=10)],
      2: [Edge(target=0, weight=30), Edge(target=1, weight=30)],
    }
    self.assertEqual(bellman_ford(graph, 0, 2), (20, [1, 2]))

  def test_negative_cycle_3(self):
    graph = {
      0: [Edge(target=1, weight=10), Edge(target=2, weight=30)],
      1: [Edge(target=0, weight=10), Edge(target=2, weight=10)],
      2: [Edge(target=0, weight=-30), Edge(target=1, weight=30)],
    }
    self.assertEqual(bellman_ford(graph, 0, 2), -1)

  def test_clrs_example(self):
    graph = {
      's': [
        Edge(target='t', weight=3),
        Edge(target='y', weight=5),
      ],
      't': [
        Edge(target='x', weight=6),
        Edge(target='y', weight=2),
      ],
      'y': [
        Edge(target='t', weight=1),
        Edge(target='z', weight=6),
      ],
      'x': [
        Edge(target='z', weight=2),
      ],
      'z': [
        Edge(target='x', weight=7),
        Edge(target='s', weight=3),
      ],
    }
    distance, path = bellman_ford(graph, 's', 'z')
    self.assertEqual(distance, 11)
    self.assertIn(path, [
      ['y', 'z'],
      ['t', 'y', 'x', 'z'],
    ])

    distance, path = bellman_ford(graph, 's', 'x')
    self.assertEqual(distance, 9)
    self.assertIn(path, [
      ['t', 'x'],
      ['y', 'x'],
    ])

Comments