11  Graphs


Goals


Graphs in Practice

Graphs are one of the most important concepts in computer science because of their real-world applications.

If we want to model social interactions, computer networks, or any other process with lots of interconnections, we find ourselves relying on graphs.

Graphs have two key concepts: the node and the edge.

G A A B B A--B B--B C C B--C E E C--E D D C--D E--A R R D--R S S D--S T T D--T

This graph has lettered nodes, and lines representing edges.

We have seen these before, with trees and linked lists.

G N0 N0 N1 N1 N0->N1 N2 N2 N1->N2 N3 N3 N2->N3 N4 N4 N3->N4

G N0 N0 N1 N1 N0--N1 N2 N2 N0--N2 N3 N3 N1--N3 N4 N4 N1--N4 N5 N5 N2--N5 N6 N6 N2--N6

Both of those can be thought of as graphs themselves, each node typically represents the fundamental unit we’re looking at. In a social network, each node would be a person. In global trade, a country.

An edge represents some relationship between the nodes. This relationship usually will have some kind of label on it, both nodes and edges may have attributes in practice. We could label what kind of relationship two people have, e.g. “sibling” or “neighbor”. When we have a numeric label we sometimes call that a weight.

SocialNetwork Andy Andy Jim Jim Andy--Jim friend Nicole Nicole Andy--Nicole spouse Emily Emily Jim--Emily spouse Shay Shay Jim--Shay coworker Dawn Dawn Shay--Dawn sibling

A weight in a graph of airports might represent the distance, or cost of traveling between the nodes, depending on the needs of our algorithms.

airports LAX LAX SFO SFO LAX--SFO 347 mi ORD ORD LAX--ORD 1741 mi MEX MEX LAX--MEX 1554 mi SFO--ORD 1840 mi SFO--MEX 1890 mi ORD--MEX 1689 mi

Oh that’s hard to read…

airports LAX LAX SFO SFO LAX--SFO 347 mi ORD ORD LAX--ORD 1741 mi MEX MEX LAX--MEX 1554 mi SFO--ORD 1840 mi SFO--MEX 1890 mi ORD--MEX 1689 mi

I left that first graph in to demonstrate an important point about graphs. The positions of the nodes has no meaning. We can reposition them as needed for our display purposes.

All that matters are the nodes and edges.

Directed, Undirected, and Multigraphs

If we want to add costs between cities, we need a second edge.

airports LAX LAX SFO SFO LAX->SFO $90 ORD ORD LAX->ORD $350 MEX MEX LAX->MEX $400 SFO->ORD $410 SFO->MEX $310 ORD->LAX $310 ORD->MEX $780 MEX->LAX $200 MEX->ORD $560

In this case, it costs $780 to travel from Chicago to Mexico City, but $560 to travel from Mexico City to Chicago. When drawing these, we use an arrow to indicate direction.

These are known as directed graphs or digraphs.

When you need to allow multiple edges but there is no directionality that is called a multigraph:

travel_times LAX LAX SFO SFO LAX--SFO 1h 30m (flight) LAX--SFO 6h (driving) LAX--SFO 12h (train) ORD ORD LAX--ORD 4h 15m (flight) LAX--ORD 30h (driving) LAX--ORD 43h (train) SFO--ORD 4h 30m (flight) SFO--ORD 31h (driving) SFO--ORD 51h (train)

Here have multiple edges representing different modes of transportation between the cities.

You can also add directed edges, creating a directed multigraph.

Trees

As mentioned before, trees are a special kind of graph.

Recall that each node is connected to a single parent and potentially many children. This restrictions means there is always a root node, which is connected only to children.

When we draw a tree, we can start with this root node. This allows us to traverse the tree, which we can do depth-first or breadth-first and we will reach every node.

A path is just a list of nodes in the order we visited them based on edges.

G A A B B A--B C C B--C D D B--D E E C--E G G C--G H H G--H S S H--S D--S

In the above graph, we have two paths from A->S. A,B,C,G,H,S or A,B,D,S.

A cycle in a graph is when you can traverse a path and wind up back where you started.

G A A B B A--B C C B--C R R B--R S S B--S D D C--D D--B E E D--E Q Q D--Q T T D--T V V D--V E--A

The highlighted edges form cycles: B,C,D,B or A,B,D,E,A for example.

Nodes may have edges to themselves allowing a cycle within a single node:

G A A A->A

A tree is a special type of graph with no cycles. No child node can send you back to a different parent, and no siblings have edges between themselves.

Adding a single edge, anywhere in a tree will create a cycle.

G N0 N0 N1 N1 N0--N1 N2 N2 N0--N2 N3 N3 N1--N3 N4 N4 N1--N4 N5 N5 N2--N5 N6 N6 N2--N6 N4--N5 N6--N0

Adding either colored edge converts our tree to a graph.

More Terminology

There are some other terms we’ll use to describe graphs.

Connected: A graph is connected if there is a path between every pair of nodes.

Complete: A graph where every pair of nodes has an edge is complete.

G A B A--B C A--C D A--D E A--E B--C B--D B--E C--D C--E D--E

The complete graph K5.

Disconnected: A graph is disconnected if there is no path between some pair of nodes.

G A A B B A--B C C B--C C--A D D C--D E E D--E F F G G F--G K K G--K K--F

This graph has two disconnected components.

Acyclic: A graph is acyclic if it has no cycles.

Tree A tree is a connected acyclic graph.

Forest An acyclic graph that is disconnected can be called a forest, as it is made of many trees.

Isomorphic Two graphs are considered isomorphic if there is a one-to-one correspondence between their respective graphs and edges.

More Examples

What kind of graphs are these? What make up the nodes and edges?

Git is an example of a directed acyclic graph (DAG).

Nodes: Commits are nodes, and the relationships between nodes are edges.

A commit may have multiple parents (a merge), or multiple children (a branch).

It is acyclic however, since no commit can be its own parent (or grandparent, etc.)

The internet can be thought of as a well-connected graph where nodes are individual servers, and the links between them edges.

Depending on the level of modeling you might consider it directed (asymmetric traffic), or undirected (reachability). If you were measuring the actual connectivity it would be a multigraph due to redundant links like wireless and wired connections.

Cities/exits make up nodes and highways themselves directed edges between them.

It may also be considered a multigraph if redundant/alternate routes are taken into account.

Graph Representation

Unlike the simpler array or linked list, graphs have many possible data structure implementations.

We’ll look at three different ways of thinking about, or implementing graphs in our programs.

Formal Definition

Formally, we can define a graph G as an ordered pair G=(V,E) where:

V is a set of vertices (nodes) E is a set of edges, where each is a 2-element subset of V

For a directed graph, the edges are ordered pairs instead of sets:

E(x,y)x,yV

Set Implementation

This points to our first way of implementing a graph, we could store the nodes and edges in two separate structures.

nodes = set(("A", "B", "C", "D"))
edges = set(("A, "B""), ("B", "C"), ("A", "C"), ("C", "D"))

Of course we’re using Python’s set here for convenience, but a carefully managed list or dict could work as well.

set has hashtable-like O(1) containment checking, so if our algorithm simply needs to check if certain edges are present this implementation might be sufficient.

Often though, we will need to query our graph for what edges head in or out of a given node. For example, when traversing our airport graph, “what flights can I take out of Chicago?” might be a question that we ask.

This would require scanning our set of edges.

Adjacency List

To optimize for this case, we can instead store our graph’s edges indexed by node.

adj_list = {
    "A": ["B", "C"],    # edges from A->B and A->C
    "B": ["A"],         # in a non-directed graph we'd store the reciprocals too
    "C": ["A", "D"],         
    "D": ["C"],
    "E": [],            # a node with no edges
}

Adjacency Matrix

If our graph was large, and highly connected, we might find that adjacency lists are inefficient. If almost every edge is connected to every other edge, we can store the graph in an adjacency matrix.

In this matrix the weights are the driving distance between the nodes:

City A B C D E
A 0 45 28 72 91
B 45 0 33 51 86
C 28 33 0 39 64
D 72 51 39 0 25
E 91 86 64 25 0

The numbers represent the distances (in kilometers) between each pair of cities.

The matrix is symmetric since the distance from city A to B is the same as B to A, and the diagonal is 0 since a city has no distance to itself.

Graph Class Example

If we have data we want to store on the edges or nodes themselves, we would often introduce additional dictionaries.

For example:

# we'd add these in addition to our adjacency list or matrix
node_names = {
  "A": "SFO",
  "B": "IAD",
  "C": "MEX",
  "D": "ORD",
  "E": "DCA",
}

edge_prices = {
  ("A", "B"): 500,
  ("A", "C"): 400,
  ("A", "D"): 350,
  ("C", "B"): 400,
}

Whenever we have related data structures to keep in sync we often turn to a class to manage the state & make sure that we don’t allow invalid operations.

class Graph:
    def __init__(self):
        # keys are names of nodes, value are node attributes
        self.nodes = {
        }
        # keys are tuples of nodes, values are lists of adjacent nodes
        self.edges = defaultdict(dict)
        
    def add_node(self, value, **attributes):
        """
        Add a node to the graph.

        Parameters
        ----------
        value : hashable
            The value of the node to add.
        attributes: keyword arguments
            Attributes to add to the node.

        Example:
        >>> g = Graph()
        >>> g.add_node(1, color='red', city_name="Chicago")
        >>> g.nodes
        {1: {'color': 'red'}}
        """
        self.nodes[value] = attributes

    def add_edge(self, from_node, to_node):
        """
        Add an edge to the graph.

        Parameters
        ----------
        from_node : hashable
            The value of the node the edge is coming from.
        to_node : hashable
            The value of the node the edge is going to.
        """
        self.edges[from_node].append(to_node)
        self.edges[to_node].append(from_node)

Graph Algorithms

When we were navigating trees the algorithm was more or less straightforward:

  1. Visit node.
  2. Visit all children.

We sometimes process the node and children in certain orders, such as in a binary search, but any tree traversal has those two steps.

A graph complicates things a bit more, since we need to account for cycles.

G A A B B A--B C C B--C C--A

If we just “visit all neighbors” we wind up in an infinite loop.

We have two common ways to traverse our graph: breadth-first and depth-first.

Dijkstra’s algorithm

BFS can find the shortest path in terms of number of edges, but what if we need to take into account real world distance, or ticket cost?

Dijkstra’s algorithm finds the shortest path (accounting for edge weights).

This means it can be used for real-world routing, networks, but also for many other applications depending on what your edge weights represent.

  1. Build a set of the unvisited nodes called the unvisited set.

  2. Assign to every node a “tentative distance” value, 0 for our starting node, Infinity for all others. We will update these tentative distances with the length of the shortest path to that node yet discovered.

  3. Set the current node to the starting node.

  4. For the current node, consider all unvisited neighbors and calculate their tentative distances through the current node. If the distance through the current node is less than the tentative distance, update the node’s tentative distance.

  5. When we have visited all neighbors of the current node, mark the node as visited/remove it from the unvisited set.

  6. If the destination node has been marked visited, stop. Otherwise select the unvisited node with he smallest tentative distance and set it as the new current node. Go back to step 3.

Dijkstra’s algorithm visualized

NetworkX

NetworkX is a full-featured Python library for working with graphs.
It provides a wide range of graph algorithms, drawing tools, and data structures.

While not every graph-like structure needs NetworkX, you will find its collection of data structures, algorithms, and utilities useful for mid-sized graph problems.

I would recommend you read the NetworkX tutorial which introduces the key concepts. (You can stop one you reach graph generators for our purposes).

Drawing graphs

NetworkX is not a graph drawing package, but provides helper methods to work with Graphviz and Matplotlib.

import networkx as nx
import matplotlib.pyplot as plt

g = nx.petersen_graph()
subax1 = plt.subplot(121)
nx.draw(g, with_labels=True, font_weight="bold")
subax2 = plt.subplot(122)
nx.draw_shell(g, nlist=[range(5, 10), range(5)], with_labels=True, font_weight="bold")
options = {
    "node_color": "black",
    "node_size": 100,
    "width": 3,
}
subax1 = plt.subplot(221)
nx.draw_random(g, **options)
subax2 = plt.subplot(222)
nx.draw_circular(g, **options)
subax3 = plt.subplot(223)
nx.draw_spectral(g, **options)
subax4 = plt.subplot(224)
nx.draw_shell(g, nlist=[range(5, 10), range(5)], **options)
#| tags: []
g = nx.dodecahedral_graph()
shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12, 13]]
nx.draw_shell(g, nlist=shells, **options)

< https://networkx.org/documentation/stable/reference/drawing.html>

Bonus Graph Algorithm: Topological Sort

Topological sort is an algorithm that is specific to Directed Acyclic Graphs.

We mentioned before that DAGs are often used for managing dependencies (e.g. this software package requires these others first) or steps in a data pipeline.

Running a topological sort on a graph gives a list of vertices in an acceptable order, such that each dependent node is only visited after all of its dependencies have been satisfied. (Emphasis on an, many such orderings may exist.)

DFS in psuedocode:

We need a list TL, and a function visit.

We will also define three “colors” for our nodes.

  • White: Unvisited
  • Blue: Temporary Mark
  • Black: Permanent Mark
while there are unvisited nodes:
    select a random white/unmarked node -> n
    visit(n)
    
def visit(node):
    if n is black/visited:
        return
    if n is blue/temporary:
        abort -- the graph has cycles and cannot be resolved
        
    mark n blue
    
    for each node m with an edge from n to m:
        visit(m)
    
    mark n black (removing blue mark)
    add n to head of TL

(this is essentially a Depth First Search with some extra work to detect cycles)

import networkx as nx
dg = nx.DiGraph((["scrape data A", "clean data A"],
                 ["clean data A", "merge data"],
                 ["merge data", "import to SQL"],
                 ["setup database", "import to SQL"],
                 ["import to SQL", "generate queries"],
                 ["generate queries", "data viz"],
                 ["data viz", "presentation"],
                 ["collect data B", "clean data B"],
                 ["clean data B", "merge data"],
                 ["plan data viz", "data viz"],
                ))
nx.draw(dg, with_labels=True, node_color="grey")

import random

def visit(name, existing_path):
    data = dg.nodes(data=True)[name]
    if data["color"] == "BLACK":
        return
    elif data["color"] == "BLUE":
        raise Exception("Cycle Detected!")
    
    data["color"] = "BLUE"
    
    # DFS-like visit to all connected nodes before processing this node
    for m in dg.predecessors(name):
        visit(m, existing_path)
    
    data["color"] = "BLACK"
    # prepend to path
    existing_path.insert(0, name)
    
def topo_sort(graph):
    ordering = []
    
    # add color to each node
    for d in graph.nodes.data():
        d[1]["color"] = "WHITE"
    
    # while there are unvisited nodes
    while len(ordering) < len(graph.nodes):
        # unvisited nodes
        unvisited = [name for name, data in graph.nodes.items() if data["color"] == "WHITE"]
        visit(random.choice(unvisited), ordering)
    return ordering[::-1]
for node in topo_sort(dg):
    print(node)
scrape data A
plan data viz
clean data A
collect data B
clean data B
merge data
setup database
import to SQL
generate queries
data viz
presentation

Further Exploration