17 Architecture
Goals
- Examine the architecture of a real application to develop an understanding of the big picture, how the various components we’ve touched on fit together.
Levels of Abstraction
Back in Chapter 1, we introduced the structure of a typical Python application: a handful of modules collected into a package. Then, importantly, one or more entrypoint functions, typically in if __name__ == "__main__"
blocks.
As the size and scope of our software grows, we encounter challenges and create increasingly nested abstractions.
We can think about these in an approximate hierarchy:
Component | Built From | Purpose | Analogy |
---|---|---|---|
Function | Statements & Expressions | Encapsulate logic; repeatable behavior. | Recipes |
Class | Functions (Methods) & Attributes | Encapsulate associated behavior & data. | Blueprints |
Module | Functions & Classes | Organize related components & define an interface. | Department |
Package | Modules | Bundle of code that accomplishes a specific task. | Factory |
Program | Packages & Smaller Programs | Accomplish real-world tasks. |
This table is modeling a program of a certain level of complexity. Of course, not all programs need to be multiple modules, or even functions.
A 200-line one function program that does its job well is no less of a program than one constructed from thousands of submodules.
Case Study: Open States
I ran a project called Open States for 13 years. The purpose of the project was to collect, standardize, and publish legislative information from the legislatures of all 50 states, DC, and Puerto Rico.
The project is the collective effort of dozens of individuals, many of which contributed a handful of scrapers. While I wrote many scrapers myself, over time my primary role became that of a software architect: establishing best practices, writing interfaces, and designing the overall flow of data between components.
This project, due to my deep familiarity with it as well as it being Open Source, gives us a unique opportunity to look at a full software product.
Architecture Overview
First, let’s see what goes into this full application:
There’s a lot going on here, but for now let’s focus on the largest boxes (with title bars & squared edges).
The key elements of the project include:
- a large Python application “openstates-core” that has a command line interface
- hundreds of small Python modules in “openstates-scrapers” these act as plugins for the main openstates-core
- two user-facing Python web applications “API v3” and “openstates.org”
- “bobsled” which is another web application used to orchestrate the hundreds of scrapers
While there are some small components omitted here, this represents 90% of the process that turns more than 50 different state websites:
- https://www.ncleg.gov
- https://ilga.gov
- https://www.capitol.hawaii.gov/home.aspx,
- https://alison.legislature.state.al.us
- etc.
Into:
- a website, example bill search
- a JSON API
- and bulk downloads in various formats.
With this big picture in mind, we can take a closer look at various components.
openstates-core
https://github.com/openstates/openstates-core
At the start of this course, we talked about the importance of testing and designing programs to be testable.
Effectively testing web scrapers is challenging and often comes with little benefit. If the site changes, but the test still passes, that isn’t particularly useful.
Instead, given the volatility of the data itself and how often scrapers needed to be updated due to site changes, it became incredibly important to have a well-tested core. This core code behaves mostly as a library responsible for everything that is common across scrapers.
Some of the essential core modules are:
metadata
Stores rarely-changing metadata on the states: how many senate & house seats? what do they call their legislators? what URLs are important?
models
Provides an intermediate data model for representing bills, votes, legislators, committees, and events. This module is implemented as a set of classes Bill
, Person
, Vote
, etc. that each have helpful methods associated with that particular data type.
scrape
Provides architecture that each state scraper will implement. Provides a series of abstract classes like BillScraper
and CommitteeScraper
that have abstract scrape
methods. The implementation of these classes will live in a different project, openstates-scrapers
which we’ll come to shortly. These scrapers, along with the models
write JSON to disk. This allows us to run the scrapers locally without a developer needing a full database set up.
importers
The JSON written to disk by scrape
methods is in a common format, but still needs work to be added to the database itself. The importers
reconcile the scraped data against the database performing entity resolution and record linkage as they go.
fulltext
After the metadata is written to disk, a separate process fetches the full text of bills, extracts it from PDFs and HTML files, and saves it to the database. This is done in its own process since the logic is fairly uniform, but uses a similar approach to the scrapers
where states can override the behavior as needed if there are special cases.
cli
All of the behavior described above: scraping, importing, full text extraction, etc. can be run from the command line. A handful of command line entrypoints (__main__
) allow developers to run given portions of the pipeline locally for development and testing. The final product is also run via this same interface, which we’ll see when we discuss bobsled
.
utils
Lots of general purpose helper functions. Some of these were eventually pulled out into even more general libraries like scrapelib
and jellyfish
.
The core of the project has near-total test coverage meaning that we used a tool called coverage.py to determine which lines were run in tests and ensured that nearly every line is executed during testing.
Given the critical nature of this portion of the code this coverage gives peace of mind that most issues that arise are related to a specific state, not endemic to the project as a whole.
This portion of the code also contains the most computationally intensive portions:
- Complex entity resolution in importers that would require comparing every bill against every other bill in the system in a worst case scenario.
- A topological sort of a directed acyclic graph created between objects to ensure that data that references other items (such as a bill sponsorship) are imported in the correct order.
openstates-scrapers
https://github.com/openstates/openstates-scrapers
While the Open States core has had 35 contributors as of 2025, the scrapers themselves have 165.
A key design goal of the project was that almost anyone could contribute a scraper, and people have in fact taught themselves Python in order to do so!
To make this possible, the core provides a very detailed “blueprint” for a scraper, a base class per scraper type, and authors can then inherit from this class, implement a method or two (as well as any helper methods they wish) and have a working scraper.
We can look at an example scraper such as az.bills to see what this looks like in practice.
These scrapers import modules from openstates-core
, which provide common functionality like validation logic, helper functions for common tasks, and writing the output to JSON.
The scrapers themselves use lxml.html
, re
, and lots of other libraries to extract the information they need from HTML files, PDFs, Excel Files, Word Documents, and much more.
bobsled
https://github.com/openstates/bobsled
In practice, once a scraper is written, we want to run it at least once, but often many times a day, getting the latest data.
While there are many tools to do this, while the project was on a shoestring budget, we landed on an approach that would turn servers on, run a single scraper, and then turn the server off. This is a cloud-hosted web application we called bobsled.
I won’t go too far into the design of this application, but it is responsible for maintaining a schedule: checking when given scrapers last ran and what the results were. If the configuration indicates that a new run is needed, bobsled
will call the openstates.cli
to run a scrape, and then once the JSON has been written, will run an import as well.
This is also used to run the aforementioned full text processing, download and process legislator photos, and other tasks that need to occur semi-regularly.
Given that scrapers can fail for any number of reasons, this application uses the GitHub API to open a GitHub issue if a scraper fails more than
Manual Processes
https://github.com/openstates/people
This is a good moment to discuss the fact that the design of the system makes room for manual processes.
Some information is far more expensive to scrape than it is to ask a person to enter themselves. Also sometimes official sources are incorrect, and this gives a mechanism to correct some of that as needed.
This system writes scraper data out to a Git repository instead of a traditional database. This allows us to track changes, and allow people to make manual changes as needed.
openstates.geo
Another special component worth a mention is openstates.geo
.
By far the most used feature of the website is helping people find out who represents them in their state legislature.
To do this we need to place people within the polygons representing their congressional districts. This is done by loading all of the districts into a PostGIS (PostgreSQL w/ GIS features) database.
This code is so mission-critical, it in fact runs as a microservice so that it can be written in the minimal amount of Python possible. That means that instead of needing to load a dozen or so irrelevant modules, it only loads the code that is absolutely necessary to perform the task of matching a latitude & longitude to the district polygons that contain it and return them in a simple JSON API call.
Publishing Data
https://github.com/openstates/api-v3 & https://github.com/openstates/openstates.org
Once the data has been ingested into a database, it is uniform and ready for public consumption.
The data is served out in three ways:
- bulk data which is generated nightly via
bobsled
jobs (writing large amounts of data to CSV & JSON files) - a JSON API that serves millions of requests per hour
- a public website that serves millions of users per year
These are all separate Python applications, but they use a shared interface to the data through openstates-core
. This ensures that a change to the data model is reflected in the scrapers, API, and website together, since letting them get out of sync would lead to confusion or invalid data.