15  Data Visualization


Goals


Why Data Visualization?

We have come a long way, gathering and manipulating data, and now we have more data than we know what to do with.

Whether we are using this data to make a decision, argument, or to learn something, it will be important to understand the data.

Four Samples

What sense can we make of this data?

x1 y1 x2 y2 x3 y3 x4 y4
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

We can examine summary statistics:

Sample 1 Sample 2 Sample 3 Sample 4
Mean of x 9 9 9 9
Variance of x 11 11 11 11
Mean of y 7.50 7.50 7.50 7.50
Variance of y (±0.003 ) 4.125 4.125 4.125 4.125
Correlation x & y 0.816 0.816 0.816 0.816
Linear Regression y = 3.00 + 0.500x y = 3.00 + 0.500x y = 3.00 + 0.500x y = 3.00 + 0.500x
R² coefficient 0.67 0.67 0.67 0.67

Are we missing anything?








Anscombe’s quartet

This is known as Anscombe’s quartet, and demonstrates the value of data visualization as a tool for understanding data.

Benefits of Visualizing Data

Visualizing data is particularly useful for:

  • Understanding the “shape of the data” and any clusters or outliers.
  • Discover interesting questions to ask… “Why is it like that?”
  • Leveraging capacity for pattern recognition and intuitions.

Exploratory Visualization

Exploratory visualization aims to deepen our understanding of the data we’re working with.

These visualizations may only exist for their creator’s benefit, or they may be shared more widely among a team. As we can see with Anscombe’s Quartet, some insights are much easier to access with appropriate visualization.

Preparing Data for Visualization

Most data visualization libraries expect your data to be in a “tidy” format:

This means it follows three constraints:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value.

In pure Python terms, this is often expressed as a list of dict, NamedTuple, or dataclass.

You’ll also recognize this pattern is helpful in CSV files, pandas or polars DataFrames, and relational databases.

Non-tidy format (wide)

Employee ID Name Role_2023 Salary_2023 Role_2024 Salary_2024
101 Alice Senior Engineer 80000 Team Lead 90000
102 Bob Engineer 70000 Senior Engineer 80000

Tidy

Employee ID Name Year Role Salary
101 Alice 2023 Senior Engineer 80000
101 Alice 2024 Team Lead 90000
102 Bob 2023 Engineer 70000
102 Bob 2024 Senior Engineer 80000

Visualizing Distributions

Note

Examples in this section come from the Altair Gallery.

If we are trying to understand the distribution of a variable we might turn to the:

histogram

Code
import altair as alt
from vega_datasets import data

source = data.movies.url

alt.Chart(source).mark_bar().encode(
    alt.X("IMDB_Rating:Q", bin=True),
    y='count()',
)

box & whisker plots

Code
import altair as alt
from vega_datasets import data

source = data.population.url

alt.Chart(source).mark_boxplot(extent='min-max').encode(
    x='age:O',
    y='people:Q'
)

or violin plots.

Code
import altair as alt
from vega_datasets import data

alt.Chart(data.cars(), width=100).transform_density(
    'Miles_per_Gallon',
    as_=['Miles_per_Gallon', 'density'],
    extent=[5, 50],
    groupby=['Origin']
).mark_area(orient='horizontal').encode(
    alt.X('density:Q')
        .stack('center')
        .impute(None)
        .title(None)
        .axis(labels=False, values=[0], grid=False, ticks=True),
    alt.Y('Miles_per_Gallon:Q'),
    alt.Color('Origin:N'),
    alt.Column('Origin:N')
        .spacing(0)
        .header(titleOrient='bottom', labelOrient='bottom', labelPadding=0)
).configure_view(
    stroke=None
)

Visualizing Relationships

If instead, we are trying to understand the relationship between two (or more) variables we turn to:

scatter plots

Code
import altair as alt
from vega_datasets import data

source = data.cars()

alt.Chart(source).mark_circle(size=60).encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
    tooltip=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
).interactive()

If we have lots of observations we may use density plots, like the heatmap:

Code
import altair as alt
import numpy as np
import pandas as pd

# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2

# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
                     'y': y.ravel(),
                     'z': z.ravel()})

alt.Chart(source).mark_rect().encode(
    x='x:O',
    y='y:O',
    color='z:Q'
)

Interactive Exploratory Visualizations

Many visualization libraries provide the ability to interact with the data:

  • select subsets of data
  • hover/click for additional context
  • pan & zoom within large visualizations

These are particularly well-suited for exploration, where instead of re-drawing the graphic each time you want to see a slightly different view, you can explore in your browser.

Explanatory Visualizations

If an exploratory visualization is a visualization for your understanding, an explanatory visualization can be thought of as a visualization for everyone else.

Explanatory visualizations may instead aim to:

  • highlight interesting findings
  • tell stories
  • present a thesis
  • persuade
  • inspire action

Explanatory visualizations use the same underlying charts, but are typically much more refined.

When you are exploring the data:

  • dozens of quick draft visualizations
  • little need to focus on color choices & design elements beyond making it intelligible
  • getting answers to questions you have about the data

When you are explaining data:

  • each visualization carefully chosen
  • designing for an audience that has not spent hours pouring over data, labeling and design are essential
  • either answer a question the audience has or guide audience to ask a specific question

For each visualization you may ask:

  • What is the story I want to tell?
  • Who is the audience I’m trying to reach?
  • What kinds of mistakes might my audience make when interpreting my visualization?

Keep your “story” and audience in mind

There’s a good chance you have a lot of interesting data, but data visualization is rarely the place to show it off.

By keeping your story or key question in mind, you can usually find 1-3 visualizations that tell the story best.

John Snow’s 1854 map of Cholera cases clearly implicated a specific water pump in a single graphic:

The “hockey stick” graph rang alarm bells for many regarding the severity of climate change when visualized against 1000 years of climate variation:

Simpler is usually better

It is easy to get lost in making all kinds of wild visualizations modern tools offer: a network analysis graph, a 3D scatter plot, a map with multiple translucent layers showing ten different variables overlaid.

Again, with your audience in mind, consider that most people can interpret a simple bar graph or line graph fairly well, but a radar chart or hierarchical treemap are going to be suited to a more technical audience.

Even with technical audiences, having a clear & simple graph that shows the key message is typically better than a complex graph that attempts to show everything in the data. You can always have a secondary visualization that shows more detail for those that wish to explore further.

If you aren’t visualizing particle collisions, don’t let your charts look like this.

Reduce “chart junk”

Chart junk is any visual element that is not directly related to the data. This includes:

  • 3D effects & shadows
  • Gradients
  • All extraneous lines including grid lines, tick marks, and axes
  • Unnecessary labels

Once you begin this process, you realize just how much stuff on a chart doesn’t add any value.

Edward Tufte originated this term, his book “The Visual Display of Quantitative Information” is a great resource.

Pick an appropriate chart for your data

It is important to consider the type of data you have:

Quantitative data represents intervals or ratios. It can be expressed numerically, and operations like addition and subtraction have a logical meaning.

Nominal data represents categories. Some categories may be numeric (e.g. rankings) but still function as nominal.

  • Quantitative / Quantitative: Line, Area, Scatter, Bubble, Heatmaps
  • Quantitative / Nominal: Bar Chart, Histogram, Strip Plot, Pie/Radial Charts
  • Nominal / Nominal: Sankey Diagram, Mosaic Plot

The advice on choosing between graphs is the same as discussed in the exploratory visualization section.

Use color effectively

Color can be used to make elements stand out from one another, but doing so can also lead to confusion.

It may be tempting to use a different color for each bar in a bar chart, but consider if there are ways to use the colors instead in a meaningful way.

Colors are a great way to group similar data, or highlight outliers or important values.

Also keep in mind that about 4% of the population is colorblind. If you are using color as the only way to differentiate between elements, you may want to consider using a different method as well.

The most common form of colorblindness is red-green colorblindness, so you may want to avoid using red and green together.

(Consider how many charts use red for negative values and green for positive values anyway!)

At the very least, take a look at your visualization with a colorblind filter to see if your point is still clear.

Visualizing Geospatial Data

The two most common kinds of geospatial visualization are your typical points-on-a-map:

Code
states = alt.topo_feature(data.us_10m.url, feature='states')
airports = data.airports()

background = alt.Chart(states).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('albersUsa').properties(
    width=500,
    height=300
)

points = alt.Chart(airports).mark_circle().encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    size=alt.value(10),
    tooltip='name'
)

background + points

Or choropleths, which shade areas based on a nominal or quantitative value:

Code
import altair as alt
from vega_datasets import data

states = alt.topo_feature(data.us_10m.url, 'states')
pop = data.population_engineers_hurricanes()

variable_list = ['population', 'engineers', 'hurricanes']

alt.Chart(states).mark_geoshape().encode(
    color='population:Q'
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(pop, 'id', list(pop.columns))
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

Discouraged Charts

Most data visualization guides will strongly discourage the use of the pie chart:

Code
import altair as alt
import pandas as pd

category = ['Sky', 'Shady side of a pyramid', 'Sunny side of a pyramid']
color = ["#416D9D", "#674028", "#DEAC58"]
df = pd.DataFrame({'category': category, 'value': [75, 10, 15]})

alt.Chart(df, width=150, height=150).mark_arc(outerRadius=80).encode(
    alt.Theta('value:Q').scale(range=[2.356, 8.639]),
    alt.Color('category:N')
        .title(None)
        .scale(domain=category, range=color)
        .legend(orient='none', legendX=160, legendY=50),
    order='value:Q'
).configure_view(
    strokeOpacity=0
)

If you do use pie charts remember:

  • Direct comparison of segments is very difficult at n > 2.
  • Only appropriate when most important information is ratio between sizes, and you have relatively few categories.
  • They must add up to 100%.

(from https://www.storytellingwithdata.com/blog/2020/5/14/what-is-a-pie-chart)

Word Cloud

Word clouds too are rarely the right tool for the job.

Derived from same data as word cloud.

Source: Nieman Lab: Word Clouds Considered Harmful

Importance of Critique

If you’ve spent hours or days honing your data, you understand it better than nearly anyone on Earth.

Make sure that your visualizations are clear to those encountering them for the first time, or at least with the expertise you expect from your audience.

Ask a friend or colleague to interpret your data without any explanation, and see if they can make sense of it. If they take a long time, or come away with the wrong impression, that’s a good sign that you should adjust your approach.

Remember: there’s not a ton of value in a visualization that needs explanation itself.

Data Viz in Python

There are dozens of data visualization libraries in Python.

If you plan to do a significant amount of visualization, this is an area where will likely want to explore a bit and pick a library that is suitable for your needs and fits your preferred way of thinking about visualization.

Matplotlib

Matplotlib is the OG, nearly as old as Python itself.

Compared to more modern libraries it is much less intuitive, and requires data to be split into columns. The library does not have built-in support for DataFrames or other common data formats supported by other libraries.

The library itself is primarily meant for making one-off visualizations with a heavy reliance on global state which can make it hard to use.

Given the plethora of much nicer options available, I would only recommend matplotlib if you are throwing together quick exploratory visualizations or using a very specific visualization that other libraries do not support.

# example from https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html#sphx-glr-gallery-lines-bars-and-markers-bar-label-demo-py
import matplotlib.pyplot as plt
import numpy as np

species = ('Adelie', 'Chinstrap', 'Gentoo')
sex_counts = {
    'Male': np.array([73, 34, 61]),
    'Female': np.array([73, 34, 58]),
}
width = 0.6  # the width of the bars: can also be len(x) sequence


fig, ax = plt.subplots()
bottom = np.zeros(3)

for sex, sex_count in sex_counts.items():
    p = ax.bar(species, sex_count, width, label=sex, bottom=bottom)
    bottom += sex_count

    ax.bar_label(p, label_type='center')

ax.set_title('Number of penguins by sex')
ax.legend()

# you'll notice this method does not take parameters and depends on global state
plt.show()

Seaborn

Seaborn is a wrapper around matplotlib that improves the overall API as well as styling.

It allows you to stick with the matplotlib ecosystem, but from a much better starting point.

# from https://seaborn.pydata.org/examples/different_scatter_variables.html
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

# Load the example diamonds dataset
diamonds = sns.load_dataset("diamonds")

# Draw a scatter plot while assigning point colors and sizes to different
# variables in the dataset
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.despine(f, left=True, bottom=True)
clarity_ranking = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]
sns.scatterplot(x="carat", y="price",
                hue="clarity", size="depth",
                palette="ch:r=-.2,d=.3_r",
                hue_order=clarity_ranking,
                sizes=(1, 8), linewidth=0,
                data=diamonds, ax=ax)

This uses an approach inspired by grammar of graphics, a way of thinking about data visualization that originated in with R’s ggplot.

Altair

Altair takes a full grammar of graphics approach and replaces the underlying matplotlib with Vega-Lite. Vega-Lite is a JavaScript-based tool, which means that Altair can render both static images like matplotlib but also interactive graphics.

# from Altair gallery
import altair as alt
import pandas as pd

source = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})

alt.Chart(source).mark_bar().encode(
    x='a',
    y='b'
)

plotnine

plotnine is a grammar of graphics approach that aims to more closely mimic ggplot. If you are coming from R you may find its syntax preferable.

from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap
from plotnine.data import mtcars

(
    ggplot(mtcars, aes("wt", "mpg", color="factor(gear)"))
    + geom_point()
    + stat_smooth(method="lm")
    + facet_wrap("gear")
)

Interactive Data Viz

As discussed previously, interactive visualizations are most useful for exploratory visualization. Being able to run the code once and obtain different views of your data is a great productivity boost.

All of the libraries above can either render dynamic charts in a Jupyter notebook, though for some an additional plugin is needed.

When it comes to explanatory visualizations, interactives should be used sparingly. This is for the same reasons that 1-3 visualizations are often more useful than dozens, your goal is to guide the audience towards a given point, and the additional options & views of data may confuse or overwhelm. Most interactive tools either offer less strict control over exact layout, or are significantly more complex than their static peers.

Another practical consideration is that most interactive data visualizations today run in the web browser, which requires JavaScript.

This does not mean we cannot use Python, as we saw above with altair, some Python libraries can generate interactives by way of JavaScript. It will however mean an additional layer of complexity between code & final product which can make debugging more difficult.

Altair w/ Vega-Lite

An example of what can be done with just a bit of Altair:

cars = data.cars()
interval = alt.selection_interval(empty='none')

alt.Chart(cars).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).add_selection(
    interval
)
/var/folders/5g/gtr086hd3q5gx90mgfzhrlhr0000gp/T/ipykernel_93631/2318740664.py:8: AltairDeprecationWarning: 
Deprecated since `altair=5.0.0`. Use add_params instead.

Plotly and Dash

Plotly is an interactive visualization JavaScript library that has bindings for Python, R, JavaScript, Julia, Matlab, and more.

Note: Unlike everything else we’ve mentioned so far plotly is a company selling a product. The libraries we’re talking about are open source, but it is worth noting that the company does offer paid services and upsells.

They are well suited to building interactives and dash takes this a step further and allows you to build full web applications (mainly targeted at dashboards).

Take a look at prior year’s projects for some examples of what you can do with plotly and dash.

Further Exploration

Visualization Libraries:

  • matplotlib - The grandfather of Python plotting libraries. It’s very flexible, but it’s not the easiest to use or make visually appealing.
  • seaborn - Seaborn is a library that builds on top of matplotlib to make it easier to create beautiful plots.
  • plotly - Plotly allows creating interactive plots.
  • dash - Dash is a framework that allows building interactive web applications & dashboards.
  • Altair - Altair is a declarative plotting library that makes it easy to create beautiful plots.
  • plotnine - Plotnine is a Python port of the R ggplot2 library.

Additional Visualization Libraries:

  • bokeh - Creates interactive plots that can be embedded in web pages.
  • folium - Another option for map visualization.
  • geoplot - Another option for map visualization.

Colorblindness Tools: