Code
import altair as alt
from vega_datasets import data
= data.movies.url
source
alt.Chart(source).mark_bar().encode("IMDB_Rating:Q", bin=True),
alt.X(='count()',
y )
Goals
We have come a long way, gathering and manipulating data, and now we have more data than we know what to do with.
Whether we are using this data to make a decision, argument, or to learn something, it will be important to understand the data.
What sense can we make of this data?
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
---|---|---|---|---|---|---|---|
10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |
8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |
11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |
14.0 | 9.96 | 14.0 | 8.10 | 14.0 | 8.84 | 8.0 | 7.04 |
6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |
4.0 | 4.26 | 4.0 | 3.10 | 4.0 | 5.39 | 19.0 | 12.50 |
12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |
7.0 | 4.82 | 7.0 | 7.26 | 7.0 | 6.42 | 8.0 | 7.91 |
5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |
We can examine summary statistics:
Sample 1 | Sample 2 | Sample 3 | Sample 4 | |
---|---|---|---|---|
Mean of x | 9 | 9 | 9 | 9 |
Variance of x | 11 | 11 | 11 | 11 |
Mean of y | 7.50 | 7.50 | 7.50 | 7.50 |
Variance of y (±0.003 ) | 4.125 | 4.125 | 4.125 | 4.125 |
Correlation x & y | 0.816 | 0.816 | 0.816 | 0.816 |
Linear Regression | y = 3.00 + 0.500x | y = 3.00 + 0.500x | y = 3.00 + 0.500x | y = 3.00 + 0.500x |
R² coefficient | 0.67 | 0.67 | 0.67 | 0.67 |
Are we missing anything?
This is known as Anscombe’s quartet, and demonstrates the value of data visualization as a tool for understanding data.
Visualizing data is particularly useful for:
Exploratory visualization aims to deepen our understanding of the data we’re working with.
These visualizations may only exist for their creator’s benefit, or they may be shared more widely among a team. As we can see with Anscombe’s Quartet, some insights are much easier to access with appropriate visualization.
Most data visualization libraries expect your data to be in a “tidy” format:
This means it follows three constraints:
In pure Python terms, this is often expressed as a list of dict
, NamedTuple
, or dataclass
.
You’ll also recognize this pattern is helpful in CSV files, pandas or polars DataFrames, and relational databases.
Non-tidy format (wide)
Employee ID | Name | Role_2023 | Salary_2023 | Role_2024 | Salary_2024 |
---|---|---|---|---|---|
101 | Alice | Senior Engineer | 80000 | Team Lead | 90000 |
102 | Bob | Engineer | 70000 | Senior Engineer | 80000 |
Tidy
Employee ID | Name | Year | Role | Salary |
---|---|---|---|---|
101 | Alice | 2023 | Senior Engineer | 80000 |
101 | Alice | 2024 | Team Lead | 90000 |
102 | Bob | 2023 | Engineer | 70000 |
102 | Bob | 2024 | Senior Engineer | 80000 |
Examples in this section come from the Altair Gallery.
If we are trying to understand the distribution of a variable we might turn to the:
histogram
import altair as alt
from vega_datasets import data
= data.movies.url
source
alt.Chart(source).mark_bar().encode("IMDB_Rating:Q", bin=True),
alt.X(='count()',
y )
box & whisker plots
import altair as alt
from vega_datasets import data
= data.population.url
source
='min-max').encode(
alt.Chart(source).mark_boxplot(extent='age:O',
x='people:Q'
y )
or violin plots.
import altair as alt
from vega_datasets import data
=100).transform_density(
alt.Chart(data.cars(), width'Miles_per_Gallon',
=['Miles_per_Gallon', 'density'],
as_=[5, 50],
extent=['Origin']
groupby='horizontal').encode(
).mark_area(orient'density:Q')
alt.X('center')
.stack(None)
.impute(None)
.title(=False, values=[0], grid=False, ticks=True),
.axis(labels'Miles_per_Gallon:Q'),
alt.Y('Origin:N'),
alt.Color('Origin:N')
alt.Column(0)
.spacing(='bottom', labelOrient='bottom', labelPadding=0)
.header(titleOrient
).configure_view(=None
stroke )
If instead, we are trying to understand the relationship between two (or more) variables we turn to:
scatter plots
import altair as alt
from vega_datasets import data
= data.cars()
source
=60).encode(
alt.Chart(source).mark_circle(size='Horsepower',
x='Miles_per_Gallon',
y='Origin',
color=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
tooltip ).interactive()
If we have lots of observations we may use density plots, like the heatmap:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
= np.meshgrid(range(-5, 5), range(-5, 5))
x, y = x ** 2 + y ** 2
z
# Convert this grid to columnar data expected by Altair
= pd.DataFrame({'x': x.ravel(),
source 'y': y.ravel(),
'z': z.ravel()})
alt.Chart(source).mark_rect().encode(='x:O',
x='y:O',
y='z:Q'
color )
Many visualization libraries provide the ability to interact with the data:
These are particularly well-suited for exploration, where instead of re-drawing the graphic each time you want to see a slightly different view, you can explore in your browser.
If an exploratory visualization is a visualization for your understanding, an explanatory visualization can be thought of as a visualization for everyone else.
Explanatory visualizations may instead aim to:
Explanatory visualizations use the same underlying charts, but are typically much more refined.
When you are exploring the data:
When you are explaining data:
For each visualization you may ask:
There’s a good chance you have a lot of interesting data, but data visualization is rarely the place to show it off.
By keeping your story or key question in mind, you can usually find 1-3 visualizations that tell the story best.
John Snow’s 1854 map of Cholera cases clearly implicated a specific water pump in a single graphic:
The “hockey stick” graph rang alarm bells for many regarding the severity of climate change when visualized against 1000 years of climate variation:
It is easy to get lost in making all kinds of wild visualizations modern tools offer: a network analysis graph, a 3D scatter plot, a map with multiple translucent layers showing ten different variables overlaid.
Again, with your audience in mind, consider that most people can interpret a simple bar graph or line graph fairly well, but a radar chart or hierarchical treemap are going to be suited to a more technical audience.
Even with technical audiences, having a clear & simple graph that shows the key message is typically better than a complex graph that attempts to show everything in the data. You can always have a secondary visualization that shows more detail for those that wish to explore further.
If you aren’t visualizing particle collisions, don’t let your charts look like this.
Chart junk is any visual element that is not directly related to the data. This includes:
Once you begin this process, you realize just how much stuff on a chart doesn’t add any value.
Edward Tufte originated this term, his book “The Visual Display of Quantitative Information” is a great resource.
It is important to consider the type of data you have:
Quantitative data represents intervals or ratios. It can be expressed numerically, and operations like addition and subtraction have a logical meaning.
Nominal data represents categories. Some categories may be numeric (e.g. rankings) but still function as nominal.
The advice on choosing between graphs is the same as discussed in the exploratory visualization section.
Color can be used to make elements stand out from one another, but doing so can also lead to confusion.
It may be tempting to use a different color for each bar in a bar chart, but consider if there are ways to use the colors instead in a meaningful way.
Colors are a great way to group similar data, or highlight outliers or important values.
Also keep in mind that about 4% of the population is colorblind. If you are using color as the only way to differentiate between elements, you may want to consider using a different method as well.
The most common form of colorblindness is red-green colorblindness, so you may want to avoid using red and green together.
(Consider how many charts use red for negative values and green for positive values anyway!)
At the very least, take a look at your visualization with a colorblind filter to see if your point is still clear.
The two most common kinds of geospatial visualization are your typical points-on-a-map:
= alt.topo_feature(data.us_10m.url, feature='states')
states = data.airports()
airports
= alt.Chart(states).mark_geoshape(
background ='lightgray',
fill='white'
stroke'albersUsa').properties(
).project(=500,
width=300
height
)
= alt.Chart(airports).mark_circle().encode(
points ='longitude:Q',
longitude='latitude:Q',
latitude=alt.value(10),
size='name'
tooltip
)
+ points background
Or choropleths, which shade areas based on a nominal or quantitative value:
import altair as alt
from vega_datasets import data
= alt.topo_feature(data.us_10m.url, 'states')
states = data.population_engineers_hurricanes()
pop
= ['population', 'engineers', 'hurricanes']
variable_list
alt.Chart(states).mark_geoshape().encode(='population:Q'
color
).transform_lookup(='id',
lookup=alt.LookupData(pop, 'id', list(pop.columns))
from_
).properties(=500,
width=300
height
).project(type='albersUsa'
)
Most data visualization guides will strongly discourage the use of the pie chart:
import altair as alt
import pandas as pd
= ['Sky', 'Shady side of a pyramid', 'Sunny side of a pyramid']
category = ["#416D9D", "#674028", "#DEAC58"]
color = pd.DataFrame({'category': category, 'value': [75, 10, 15]})
df
=150, height=150).mark_arc(outerRadius=80).encode(
alt.Chart(df, width'value:Q').scale(range=[2.356, 8.639]),
alt.Theta('category:N')
alt.Color(None)
.title(=category, range=color)
.scale(domain='none', legendX=160, legendY=50),
.legend(orient='value:Q'
order
).configure_view(=0
strokeOpacity )
If you do use pie charts remember:
(from https://www.storytellingwithdata.com/blog/2020/5/14/what-is-a-pie-chart)
Word Cloud
Word clouds too are rarely the right tool for the job.
Derived from same data as word cloud.
If you’ve spent hours or days honing your data, you understand it better than nearly anyone on Earth.
Make sure that your visualizations are clear to those encountering them for the first time, or at least with the expertise you expect from your audience.
Ask a friend or colleague to interpret your data without any explanation, and see if they can make sense of it. If they take a long time, or come away with the wrong impression, that’s a good sign that you should adjust your approach.
Remember: there’s not a ton of value in a visualization that needs explanation itself.
There are dozens of data visualization libraries in Python.
If you plan to do a significant amount of visualization, this is an area where will likely want to explore a bit and pick a library that is suitable for your needs and fits your preferred way of thinking about visualization.
Matplotlib is the OG, nearly as old as Python itself.
Compared to more modern libraries it is much less intuitive, and requires data to be split into columns. The library does not have built-in support for DataFrames or other common data formats supported by other libraries.
The library itself is primarily meant for making one-off visualizations with a heavy reliance on global state which can make it hard to use.
Given the plethora of much nicer options available, I would only recommend matplotlib
if you are throwing together quick exploratory visualizations or using a very specific visualization that other libraries do not support.
# example from https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html#sphx-glr-gallery-lines-bars-and-markers-bar-label-demo-py
import matplotlib.pyplot as plt
import numpy as np
= ('Adelie', 'Chinstrap', 'Gentoo')
species = {
sex_counts 'Male': np.array([73, 34, 61]),
'Female': np.array([73, 34, 58]),
}= 0.6 # the width of the bars: can also be len(x) sequence
width
= plt.subplots()
fig, ax = np.zeros(3)
bottom
for sex, sex_count in sex_counts.items():
= ax.bar(species, sex_count, width, label=sex, bottom=bottom)
p += sex_count
bottom
='center')
ax.bar_label(p, label_type
'Number of penguins by sex')
ax.set_title(
ax.legend()
# you'll notice this method does not take parameters and depends on global state
plt.show()
Seaborn is a wrapper around matplotlib
that improves the overall API as well as styling.
It allows you to stick with the matplotlib
ecosystem, but from a much better starting point.
# from https://seaborn.pydata.org/examples/different_scatter_variables.html
import seaborn as sns
import matplotlib.pyplot as plt
="whitegrid")
sns.set_theme(style
# Load the example diamonds dataset
= sns.load_dataset("diamonds")
diamonds
# Draw a scatter plot while assigning point colors and sizes to different
# variables in the dataset
= plt.subplots(figsize=(6.5, 6.5))
f, ax =True, bottom=True)
sns.despine(f, left= ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]
clarity_ranking ="carat", y="price",
sns.scatterplot(x="clarity", size="depth",
hue="ch:r=-.2,d=.3_r",
palette=clarity_ranking,
hue_order=(1, 8), linewidth=0,
sizes=diamonds, ax=ax) data
This uses an approach inspired by grammar of graphics, a way of thinking about data visualization that originated in with R’s ggplot
.
Altair takes a full grammar of graphics approach and replaces the underlying matplotlib
with Vega-Lite. Vega-Lite is a JavaScript-based tool, which means that Altair can render both static images like matplotlib
but also interactive graphics.
# from Altair gallery
import altair as alt
import pandas as pd
= pd.DataFrame({
source 'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})
alt.Chart(source).mark_bar().encode(='a',
x='b'
y )
plotnine is a grammar of graphics approach that aims to more closely mimic ggplot
. If you are coming from R you may find its syntax preferable.
from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap
from plotnine.data import mtcars
("wt", "mpg", color="factor(gear)"))
ggplot(mtcars, aes(+ geom_point()
+ stat_smooth(method="lm")
+ facet_wrap("gear")
)
As discussed previously, interactive visualizations are most useful for exploratory visualization. Being able to run the code once and obtain different views of your data is a great productivity boost.
All of the libraries above can either render dynamic charts in a Jupyter notebook, though for some an additional plugin is needed.
When it comes to explanatory visualizations, interactives should be used sparingly. This is for the same reasons that 1-3 visualizations are often more useful than dozens, your goal is to guide the audience towards a given point, and the additional options & views of data may confuse or overwhelm. Most interactive tools either offer less strict control over exact layout, or are significantly more complex than their static peers.
Another practical consideration is that most interactive data visualizations today run in the web browser, which requires JavaScript.
This does not mean we cannot use Python, as we saw above with altair
, some Python libraries can generate interactives by way of JavaScript. It will however mean an additional layer of complexity between code & final product which can make debugging more difficult.
An example of what can be done with just a bit of Altair:
= data.cars()
cars = alt.selection_interval(empty='none')
interval
alt.Chart(cars).mark_point().encode(='Horsepower:Q',
x='Miles_per_Gallon:Q',
y=alt.condition(interval, 'Origin', alt.value('lightgray'))
color
).add_selection(
interval )
/var/folders/5g/gtr086hd3q5gx90mgfzhrlhr0000gp/T/ipykernel_93631/2318740664.py:8: AltairDeprecationWarning:
Deprecated since `altair=5.0.0`. Use add_params instead.
Plotly is an interactive visualization JavaScript library that has bindings for Python, R, JavaScript, Julia, Matlab, and more.
Note: Unlike everything else we’ve mentioned so far plotly is a company selling a product. The libraries we’re talking about are open source, but it is worth noting that the company does offer paid services and upsells.
They are well suited to building interactives and dash
takes this a step further and allows you to build full web applications (mainly targeted at dashboards).
Take a look at prior year’s projects for some examples of what you can do with plotly
and dash
.
Visualization Libraries:
Additional Visualization Libraries:
Colorblindness Tools: