17 Architecture

Goals

Examine the architecture of a real application to develop an understanding of the big picture, how the various components we’ve touched on fit together.

Levels of Abstraction

Back in Chapter 1, we introduced the structure of a typical Python application: a handful of modules collected into a package. Then, importantly, one or more entrypoint functions, typically in if __name__ == "__main__" blocks.

As the size and scope of our software grows, we encounter challenges and create increasingly nested abstractions.

We can think about these in an approximate hierarchy:

Component	Built From	Purpose	Analogy
Function	Statements & Expressions	Encapsulate logic; repeatable behavior.	Recipes
Class	Functions (Methods) & Attributes	Encapsulate associated behavior & data.	Blueprints
Module	Functions & Classes	Organize related components & define an interface.	Department
Package	Modules	Bundle of code that accomplishes a specific task.	Factory
Program	Packages & Smaller Programs	Accomplish real-world tasks.

Tip

This table is modeling a program of a certain level of complexity. Of course, not all programs need to be multiple modules, or even functions.

A 200-line one function program that does its job well is no less of a program than one constructed from thousands of submodules.

Case Study: Open States

I ran a project called Open States for 13 years. The purpose of the project was to collect, standardize, and publish legislative information from the legislatures of all 50 states, DC, and Puerto Rico.

The project is the collective effort of dozens of individuals, many of which contributed a handful of scrapers. While I wrote many scrapers myself, over time my primary role became that of a software architect: establishing best practices, writing interfaces, and designing the overall flow of data between components.

This project, due to my deep familiarity with it as well as it being Open Source, gives us a unique opportunity to look at a full software product.

Architecture Overview

First, let’s see what goes into this full application:

(open image)

There’s a lot going on here, but for now let’s focus on the largest boxes (with title bars & squared edges).

The key elements of the project include:

a large Python application “openstates-core” that has a command line interface
hundreds of small Python modules in “openstates-scrapers” these act as plugins for the main openstates-core
two user-facing Python web applications “API v3” and “openstates.org”
“bobsled” which is another web application used to orchestrate the hundreds of scrapers

While there are some small components omitted here, this represents 90% of the process that turns more than 50 different state websites:

Into:

a website, example bill search
a JSON API
and bulk downloads in various formats.

With this big picture in mind, we can take a closer look at various components.

openstates-core

https://github.com/openstates/openstates-core

At the start of this course, we talked about the importance of testing and designing programs to be testable.

Effectively testing web scrapers is challenging and often comes with little benefit. If the site changes, but the test still passes, that isn’t particularly useful.

Instead, given the volatility of the data itself and how often scrapers needed to be updated due to site changes, it became incredibly important to have a well-tested core. This core code behaves mostly as a library responsible for everything that is common across scrapers.

Some of the essential core modules are:

`metadata`

Stores rarely-changing metadata on the states: how many senate & house seats? what do they call their legislators? what URLs are important?

`models`

Provides an intermediate data model for representing bills, votes, legislators, committees, and events. This module is implemented as a set of classes Bill, Person, Vote, etc. that each have helpful methods associated with that particular data type.

`scrape`

Provides architecture that each state scraper will implement. Provides a series of abstract classes like BillScraper and CommitteeScraper that have abstract scrape methods. The implementation of these classes will live in a different project, openstates-scrapers which we’ll come to shortly. These scrapers, along with the models write JSON to disk. This allows us to run the scrapers locally without a developer needing a full database set up.

`importers`

The JSON written to disk by scrape methods is in a common format, but still needs work to be added to the database itself. The importers reconcile the scraped data against the database performing entity resolution and record linkage as they go.

`fulltext`

After the metadata is written to disk, a separate process fetches the full text of bills, extracts it from PDFs and HTML files, and saves it to the database. This is done in its own process since the logic is fairly uniform, but uses a similar approach to the scrapers where states can override the behavior as needed if there are special cases.

`cli`

All of the behavior described above: scraping, importing, full text extraction, etc. can be run from the command line. A handful of command line entrypoints (__main__) allow developers to run given portions of the pipeline locally for development and testing. The final product is also run via this same interface, which we’ll see when we discuss bobsled.

`utils`

Lots of general purpose helper functions. Some of these were eventually pulled out into even more general libraries like scrapelib and jellyfish.

The core of the project has near-total test coverage meaning that we used a tool called coverage.py to determine which lines were run in tests and ensured that nearly every line is executed during testing.

Given the critical nature of this portion of the code this coverage gives peace of mind that most issues that arise are related to a specific state, not endemic to the project as a whole.

This portion of the code also contains the most computationally intensive portions:

Complex entity resolution in importers that would require comparing every bill against every other bill in the system in a worst case scenario.
A topological sort of a directed acyclic graph created between objects to ensure that data that references other items (such as a bill sponsorship) are imported in the correct order.

openstates-scrapers

https://github.com/openstates/openstates-scrapers

While the Open States core has had 35 contributors as of 2025, the scrapers themselves have 165.

A key design goal of the project was that almost anyone could contribute a scraper, and people have in fact taught themselves Python in order to do so!

To make this possible, the core provides a very detailed “blueprint” for a scraper, a base class per scraper type, and authors can then inherit from this class, implement a method or two (as well as any helper methods they wish) and have a working scraper.

We can look at an example scraper such as az.bills to see what this looks like in practice.

These scrapers import modules from openstates-core, which provide common functionality like validation logic, helper functions for common tasks, and writing the output to JSON.

The scrapers themselves use lxml.html, re, and lots of other libraries to extract the information they need from HTML files, PDFs, Excel Files, Word Documents, and much more.

bobsled

https://github.com/openstates/bobsled

In practice, once a scraper is written, we want to run it at least once, but often many times a day, getting the latest data.

While there are many tools to do this, while the project was on a shoestring budget, we landed on an approach that would turn servers on, run a single scraper, and then turn the server off. This is a cloud-hosted web application we called bobsled.

I won’t go too far into the design of this application, but it is responsible for maintaining a schedule: checking when given scrapers last ran and what the results were. If the configuration indicates that a new run is needed, bobsled will call the openstates.cli to run a scrape, and then once the JSON has been written, will run an import as well.

This is also used to run the aforementioned full text processing, download and process legislator photos, and other tasks that need to occur semi-regularly.

Given that scrapers can fail for any number of reasons, this application uses the GitHub API to open a GitHub issue if a scraper fails more than $N$ times in a row. This gives the ability to have some grace if a site is down, but ensures that developers are aware of recurring issues that need attention.

Manual Processes

https://github.com/openstates/people

This is a good moment to discuss the fact that the design of the system makes room for manual processes.

Some information is far more expensive to scrape than it is to ask a person to enter themselves. Also sometimes official sources are incorrect, and this gives a mechanism to correct some of that as needed.

This system writes scraper data out to a Git repository instead of a traditional database. This allows us to track changes, and allow people to make manual changes as needed.

openstates.geo

Another special component worth a mention is openstates.geo.

By far the most used feature of the website is helping people find out who represents them in their state legislature.

To do this we need to place people within the polygons representing their congressional districts. This is done by loading all of the districts into a PostGIS (PostgreSQL w/ GIS features) database.

This code is so mission-critical, it in fact runs as a microservice so that it can be written in the minimal amount of Python possible. That means that instead of needing to load a dozen or so irrelevant modules, it only loads the code that is absolutely necessary to perform the task of matching a latitude & longitude to the district polygons that contain it and return them in a simple JSON API call.

Publishing Data

https://github.com/openstates/api-v3 & https://github.com/openstates/openstates.org

Once the data has been ingested into a database, it is uniform and ready for public consumption.

The data is served out in three ways:

bulk data which is generated nightly via bobsled jobs (writing large amounts of data to CSV & JSON files)
a JSON API that serves millions of requests per hour
a public website that serves millions of users per year

These are all separate Python applications, but they use a shared interface to the data through openstates-core. This ensures that a change to the data model is reflected in the scrapers, API, and website together, since letting them get out of sync would lead to confusion or invalid data.