13 Record Linkage

Goals

Introduce the concepts of record linkage and de-duplication.
Discuss common string metrics that are useful in record linkage applications.

Unique Identifiers

If you are dealing with two different data sets, the ideal scenario is that they share a unique identifier.

Imagine a business registry:

ID	Business Name	Address
123	ACME, Inc.	1 ACME Way
456	Lumen Enterprises	4 Kier Circle
789	Yoyodyne Industries	8 Dimension Ct

An a separate fines table:

Business ID	Business Name	Fine
123	ACME, Inc.	OSHA Violation 4/11/2024 …
123	ACME Incorporated	OSHA Violation 4/11/2024 …
456	Lumen	Overtime Violation
456	Lumen Corp.	Overtime Violation
456	Lumen Enterprises	Overtime Violation

The business name may often be recorded differently, but the presence of the identifier helps us reconcile these records.

In relational databases the unique identifier is called a primary key and we’ll borrow that term.

Without Unique Identifiers

In real world data we often do not have a shared key between our data sets, especially if they are created by different organizations.

Sometimes we are able to cobble together a set of fields that are unique across data sets, a compound key. But, given the variability in how fields like names and addresses are often represented, we often must resort to string manipulation & probabilistic approaches.

Record linkage is the term for linking together two or more separate data sets by examination of their records.

Entity resolution is a term given for resolving individual entities against a data set: determining that “ACME CORP.” means “Organization ID #123” in our example above.

De-duplication can also be a special case of record linkage, essentially matching a data set against itself: determining that “ACME Inc.” and “ACME Incorporated” are the same entity.

Record Linkage

Let’s say we have a list of restaurants whose owners have given money to a political campaign. You also have a list of restaurants that have been fined by the city.

As part of an analysis, you want to see if there is a difference in the average fine amount for businesses that have given money to the campaign versus those that have not.

Business Name	Address	Amount Donated
Little Coco’s	123 Main St	$1000
Taqueria Habanero	901 17th St	$250
Red Derby	917 Queen Anne Ave	$2000
Susana’s	8312 Park Blvd	$500

ID	Business Name	Address	Fine Amount
1230	Little Coco’s Restaurant LLC	123 Main Street	$100
2901	Taqueria Habañero	901 17th Street #1	$100
3014	IHOP	917 Queen Ann Avenue	$5000
3207	Susanna’s	8312 Park Blvd	$100

There’s no shared key between the two tables. The campaign finance & restaurant fine systems would typically not use the same IDs.

Furthermore, there are inconsistencies in how the restaurants and addresses are written. It would appear likely that “Little Coco’s” and “Little Coco’s Restaurant LLC” are the same restaurant, but trying to match on columns directly will not work.

Record Linkage Algorithm

With two data sets $A$ and $B$ consisting of items $A_{i}$ and $B_{i}$ .

Identify a set of comparable fields between the two data sets.
A $k e y_{A} (A_{i}) \to k_{i}$ and $k e y_{B} (B_{i}) \to k_{j}$ that each return a tuple with the same fields.
Clean and standardize data in those fields.
$c l e a n (k_{i}) \to k_{i}^{'}$ $c l e a n (k_{j}) \to k_{j}^{'}$
Calculate similarity between each pair.
$s i m i l a r i t y (k_{i}, k_{j})$ - this requires $O (| A | * | B |)$ comparisons
Consider all pairs with $s i m i l a r i t y (k_{i}, k_{j}) > t h r e s h o l d$ to be linked.

In Python this might look like:

matches = []

for item1 in dataset1:
    for item2 in dataset2:
        # thresholds were chosen arbitrarily
        if (item_similarity(
                clean_name(item1.name),
                clean_name(item2.name)
            ) > 0.8 and 
            items_similarity(
                clean_addr(item1.address), 
                clean_addr(item2.address)
            ) > 0.9:
                 matches.append((item1, item2))

Note that in code we are using cleaning functions on individual items, despite the more formal notation suggesting we are always acting on the full tuple.

Aside: itertools.product

When you have a nested loop, you can use itertools.product to make it a bit more concise.

itertools.product takes two iterables and returns an iterable of tuples of the items from each iterable. (In relational terms, the cross product of the two iterables.)

import itertools

list1 = ["a", "b", "c"]
list2 = [1, 2, 3]
for item1, item2 in itertools.product(list1, list2):
    print(item1, item2)

a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2
c 3

import itertools

matches = []
for item1, item2 in itertools.product(dataset1, dataset2):
    if item_similarity(item1.name, item2.name) > 0.8 and item_similarity(item1.address, item2.address) > 0.9:
        matches.append((item1, item2))

The same amount of work is done, the algorithm is still $O (m * n)$ .

Probabilistic Approaches

In the algorithm above we’re setting a threshold. Our results cannot be known to be perfect, and in practice we need to set a threshold high enough to avoid too many false positives. Setting a threshold too high however may mean missing matches that otherwise would have been valid.

There’s no correct threshold, the value will depend upon the data, the metrics in use, and the tolerance of false positives in the intended usage of the linked data.

A more principled approach is to use a probabilistic model to determine the likelihood that two records are linked.

This is a bit more complicated, but it has the advantage that it can be used to determine the optimal threshold for a specific application.

For example, if you are dealing with linking medical records, you might want to be conservative in what you link to avoid accidentally linking two patients who are not the same person. In this case, you would want to configure your model to have a low false positive rate.

On the other hand, if you are doing an analysis across decennial census records, you might want to be more aggressive in linking records to get a more complete picture of the population. In this case, you would want to configure your model to have a low false negative rate.

Probabilistic approaches require already labeled data to train the model. This is often done by having a human look at a sample of the records and label them as linked or not linked. This data will form a baseline that the model’s predictions can be compared against.

Probabilistic Approaches: Example

We have two lists of people that we’d like to link: A and B.

We take a sample of the data A_sample and B_sample and label the pairs (A_sample X B_sample) as linked or not linked. Call the matched pairs M and the unmatched pairs U.

We then define a scoring function for each link key. For example, we might have a scoring function for names that returns “good match”, “intermediate match”, or “no match”. We’ll call this name_scoring_function. We could also have a similar function for address, year of birth, etc.

If we get two records a and b from A and B respectively, we’d use our scoring function:

name_score = scoring_function(a["name"], b["name"])
# score is either "good match", "intermediate match", or "no match"

Let’s say we get “intermediate match”.

We need to know the probability of getting such a score among the matches and among the non-matches. We can compute this from the labeled data.

We determine the probability of getting such a value among the matches in M, P(name has "intermediate" | M), by looking at the fraction of matches that have such a value for the field name.

Similarly we determine the probability for getting such a value among the non-matches, P(first name has "intermediate" | U).

We compute such probabilities corresponding to address as well.

Suppose the result of address_scoring_function(a, b) is "good match". We then compute P(address has "good match" | M) and P(address has "good match" | U).

Now we are ready to find the ratio of the probabilities, assuming independence between first name and year of birth.

R = ( P(name has "intermediate | M) * P(address has "good match" | M) ) / ( P(name has "intermediate | U) * P(address has "good match" | U )

Based again on training data we determine thresholds T1 and T2, such that when the score is above T1 we assign (a, b) as a match, and if below T2 as a non-match. In between them they are sent for a manual clerical review.

Machine Learning Approaches

Another approach is to use machine learning to determine the optimal threshold. This is particularly useful when there are a large number of fields and you don’t have a clear idea of how they help in record linkage.

It’s beyond the scope of what we can cover here, but relatively easy to implement using a library like scikit-learn.

Improving Efficiency with Blocking

If we need to compare $M$ items to $N$ items this is $O (M * N)$ , on the order of $O (N^{2})$ since the data sets are likely similar in size.

With just 1000 items in each dataset, you’re doing 1,000,000 calls to your item_similarity() function.

But item_similarity() is going to be fast right?

Meme

As we’ll see shortly, item_similiarity may itself be $O (N^{2})$ or worse. (Though for a smaller $N$ , the length of the string.)

At $O (N * M)$ , every call we can eliminate to our similarity functions will help.

Sometimes there are fields that we are confident are very important for a match. Perhaps 99.9% of matches have the correct state. If we find that two records have a different state, we can be confident that they are not a match and avoid doing the rest of the work.

With 1 million records in each dataset, we’re looking at a trillion comparisons.

If we only compared records to those in the same state: We can view our 1,000,000 as 50 groups averaging 20,000 each.

This would be $50 * 20, 000^{2}$ or 20,000,000,000 comparisons.

Phew, only 20 billion calls, far better than the expected 1 trillion. A 98% reduction in the number of expensive comparisons.

This is called blocking, considering the data in smaller blocks to reduce expensive comparisons between records that will never match.

The blocking key does not need to be a single record, or a whole record. For example, you could use the first letter of a person’s last name and their zip code combined if you had reason to believe that those were unlikely to be incorrect.

De-anonymization & Privacy

Record linkage can be used to de-anonymize data.

With enough data and/or unique attributes, disparate data on individuals thought to be anonymous can be stiched back together.

This can lead to harmful privacy violations, identifying people’s political beliefs or other personal behavior.

When linking records, it’s important to consider the privacy implications of the data you’re working with, particularly when individuals are involved.

As always, practice responsible disclosure If you uncover a privacy issue report it to the data owner with suggestions on how to remedy it. Careless de-anonymization can have serious consequences for vulnerable individuals.

String Metrics

Similarity functions must be tailored to the data being compared and intended usage.

For numeric data we can use statistical methods, for geospatial data we can use various distance comparisons, but strings present a challenge.

We need a way to quantify the similarity between two strings. There are many approaches we can examine:

Hamming Distance

The hamming distance computes the number of characters that differ between two strings:

def hamming_distance(s1, s2):
    # ensure length of s1 >= s2 using an in-place swap
    if len(s2) > len(s1):
        s1, s2 = s2, s1

    # distance is difference in length + differing chars
    distance = len(s1) - len(s2)
    for i, c in enumerate(s2):
        if c != s1[i]:
            distance += 1

    return distance

It has simple $O (N)$ implementation, but as we’ll see it isn’t incredibly useful:

print(hamming_distance("Taqueria Habañero", "Taqueria Habanero"))
print(hamming_distance("Susana's", "Susanna's"))

1
4

Both of these are one character off, Hamming distances are 1 and 4, respectively.

Hamming distance only handles replacements, not insertions or deletions. Since it is not normalized to the length of the string, it works best on strings of the same length.

Levenshtein Distance

Levenshtein distance is a more general version of counting the number of edits between two strings. Often called edit distance.

It allows for counting insertions, deletions, and replacements as “edits”.

The bad news is that the algorithm is $O (n * m)$ where $n$ and $m$ are the lengths of the strings.

from jellyfish import levenshtein_distance

print(levenshtein_distance("Taqueria Habañero", "Taqueria Habanero"))
print(levenshtein_distance("Susana's", "Susanna's"))

1
1

Both of our comparisons now give one, as we might expect.

Damerau-Levenshtein Distance

Taking this idea a step further, Damerau-Levenshtein considers a transposition a single edit.

This means damerau("abc", "acb") == 1, recognizing the two characters were transposed, not replaced.

This algorithm can be $O (N * M * max (N, M))$ , or $O (N^{3})$ .

from jellyfish import damerau_levenshtein_distance

print(levenshtein_distance("Fish", "Fsih"))
print(damerau_levenshtein_distance("Fish", "Fsih"))

2
1

But we still have the issue that strings of different lengths are not comparable.

print(damerau_levenshtein_distance("Little Coco's", "Little Coco's Restaurant LLC"))

Jaro Similarity

The Jaro distance is a string similarity measure, which measures the edit distance between two sequences.

Its key advantage is that normalizes the distance by the length of the strings so that the result is between 0 and 1.

from jellyfish import jaro_similarity

print(jaro_similarity("Taqueria Habañero", "Taqueria Habanero"))
# 0.9608

print(jaro_similarity("Susana's", "Susanna's"))
# 0.9630

print(jaro_similarity("Little Coco's", "Little Coco's Restaurant LLC"))
# 0.8214

print(jaro_similarity("Little Coco's", "Susanna's"))
# 0.4587

0.9607843137254902
0.9629629629629629
0.8214285714285715
0.4586894586894587

Jaro-Winkler

And the Jaro-Winkler similarity is a modification of the Jaro distance that gives more favorable ratings to strings that match from the beginning.

This algorithm was developed by a statistician at the US Census, and is useful for name-like strings. It recognizes that misspellings occur more likely in towards the end of names and more heavily weights differences early in the string.

from jellyfish import jaro_winkler_similarity

print(jaro_winkler_similarity("Taqueria Habañero", "Taqueria Habanero"))
# 0.9707

print(jaro_winkler_similarity("Susana's", "Susanna's"))
# 0.9778

print(jaro_winkler_similarity("Little Coco's", "Little Coco's Restaurant LLC"))
# 0.8929

print(jaro_winkler_similarity("Little Coco's", "Susanna's"))
# 0.4587

0.9764705882352941
0.9777777777777777
0.8928571428571429
0.4586894586894587

Phonetic Encodings

An alternative approach is to shorten the string by encoding it in different ways.

Match Rating Approach

A generic version of this is known as the Match Rating Approach. This converts each string into a string of characters that are similar to the original string based upon phonetic similarity.

Example:

from jellyfish import match_rating_codex, match_rating_comparison

print(match_rating_codex("Taqueria Habañero"))
# TQRBÑR
print(match_rating_codex("Susana's".replace("'", "")))  # cannot have non-alpha characters
# SNSS
print(match_rating_comparison("Taqueria Habañero", "Taqueria Habanero"))
True

TQRBÑR
SSNS
True

True

Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

soundex('Ann') == soundex('Anne') == 'A500'

soundex('Rupert') == soundex('Robert') == 'R163'

Metaphone

Metaphone is more complex than Soundex, but it’s also more accurate.

metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'.

Metaphone is particularly good at handling soundalikes like C/K and PH/F.

There are many more sound-alike algorithms. Note that these are language specific, and even accent specific since they are based on how words sound.

NYSIIS
Double Metaphone
Triple Metaphone
Caverphone - New Zealand Pronunciation

How Similar is Similar Enough?

It is important to understand the “shape” of your chosen metric when setting a threshold.

If you were using jaro_winkler, and chose 0.5 as your threshold, you’d find far too many false positives.

This code takes ~10000 randomly selected words and computes all Jaro-Winkler scores between them:

import itertools
import random
import altair as alt
import polars as pl
from jellyfish import jaro_winkler_similarity

def graph_jw(list1, list2, sample_size=10000):
    total_pairs = len(list1) * len(list2)
    # sample pairs
    pairs = random.sample(list(itertools.product(list1, list2)), sample_size)
    
    all_similarities = [
        jaro_winkler_similarity(word1, word2)
        for word1, word2 in pairs
        if jaro_winkler_similarity(word1, word2) > 0
    ]
    df = pl.DataFrame({
        "similarity": all_similarities
    })
    
    chart = alt.Chart(df).mark_bar().encode(
        alt.X("similarity:Q", 
              bin=alt.Bin(maxbins=50),
              title="Jaro-Winkler Similarity"),
        alt.Y('count():Q', title="Count")
    ).properties(
        width=600,
        height=400,
        title="Distribution of Jaro-Winkler Similarities"
    )
    
    return chart

all_words = open("shakespeare.txt").read().split()
words = all_words[:2000]
chart = graph_jw(words, words, sample_size=10000)
chart.show()

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

We can see vast majority of scores fall between 0.4 and 0.6 despite not matching. This suggest you’ll want a threshold above 0.6, still dependant on your specific needs.

Further Exploration

The seminal paper on record linkage: A Theory for Record Linkage by Ivan Fellegi and Alan Sunter

De-anonymization

The Wikipedia article on data re-identification contains lots of examples.

Libraries

jellyfish - Written as part of Open States, now over 6 million downloads/month on PyPI.
recordlinkage
scikit-learn

Interactive Data Cleaning Tools

OpenRefine - Interactive data cleaning tool.
crosswalker - Interactive data cleaning tool from Washington Post.
dedupe - De-duplication & linkage tool.

Other

Levenshtein Distance in Python