import itertools
= ["a", "b", "c"]
list1 = [1, 2, 3]
list2 for item1, item2 in itertools.product(list1, list2):
print(item1, item2)
a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2
c 3
Goals
If you are dealing with two different data sets, the ideal scenario is that they share a unique identifier.
Imagine a business registry:
ID | Business Name | Address |
---|---|---|
123 | ACME, Inc. | 1 ACME Way |
456 | Lumen Enterprises | 4 Kier Circle |
789 | Yoyodyne Industries | 8 Dimension Ct |
An a separate fines table:
Business ID | Business Name | Fine |
---|---|---|
123 | ACME, Inc. | OSHA Violation 4/11/2024 … |
123 | ACME Incorporated | OSHA Violation 4/11/2024 … |
456 | Lumen | Overtime Violation |
456 | Lumen Corp. | Overtime Violation |
456 | Lumen Enterprises | Overtime Violation |
The business name may often be recorded differently, but the presence of the identifier helps us reconcile these records.
In relational databases the unique identifier is called a primary key and we’ll borrow that term.
In real world data we often do not have a shared key between our data sets, especially if they are created by different organizations.
Sometimes we are able to cobble together a set of fields that are unique across data sets, a compound key. But, given the variability in how fields like names and addresses are often represented, we often must resort to string manipulation & probabilistic approaches.
Record linkage is the term for linking together two or more separate data sets by examination of their records.
Entity resolution is a term given for resolving individual entities against a data set: determining that “ACME CORP.” means “Organization ID #123” in our example above.
De-duplication can also be a special case of record linkage, essentially matching a data set against itself: determining that “ACME Inc.” and “ACME Incorporated” are the same entity.
Let’s say we have a list of restaurants whose owners have given money to a political campaign. You also have a list of restaurants that have been fined by the city.
As part of an analysis, you want to see if there is a difference in the average fine amount for businesses that have given money to the campaign versus those that have not.
Business Name | Address | Amount Donated |
---|---|---|
Little Coco’s | 123 Main St | $1000 |
Taqueria Habanero | 901 17th St | $250 |
Red Derby | 917 Queen Anne Ave | $2000 |
Susana’s | 8312 Park Blvd | $500 |
ID | Business Name | Address | Fine Amount |
---|---|---|---|
1230 | Little Coco’s Restaurant LLC | 123 Main Street | $100 |
2901 | Taqueria Habañero | 901 17th Street #1 | $100 |
3014 | IHOP | 917 Queen Ann Avenue | $5000 |
3207 | Susanna’s | 8312 Park Blvd | $100 |
There’s no shared key between the two tables. The campaign finance & restaurant fine systems would typically not use the same IDs.
Furthermore, there are inconsistencies in how the restaurants and addresses are written. It would appear likely that “Little Coco’s” and “Little Coco’s Restaurant LLC” are the same restaurant, but trying to match on columns directly will not work.
With two data sets
Identify a set of comparable fields between the two data sets.
A
Clean and standardize data in those fields.
Calculate similarity between each pair.
Consider all pairs with
In Python this might look like:
= []
matches
for item1 in dataset1:
for item2 in dataset2:
# thresholds were chosen arbitrarily
if (item_similarity(
clean_name(item1.name),
clean_name(item2.name)> 0.8 and
)
items_similarity(
clean_addr(item1.address),
clean_addr(item2.address)> 0.9:
) matches.append((item1, item2))
Note that in code we are using cleaning functions on individual items, despite the more formal notation suggesting we are always acting on the full tuple.
itertools.product
When you have a nested loop, you can use itertools.product
to make it a bit more concise.
itertools.product
takes two iterables and returns an iterable of tuples of the items from each iterable. (In relational terms, the cross product of the two iterables.)
import itertools
= ["a", "b", "c"]
list1 = [1, 2, 3]
list2 for item1, item2 in itertools.product(list1, list2):
print(item1, item2)
a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2
c 3
import itertools
= []
matches for item1, item2 in itertools.product(dataset1, dataset2):
if item_similarity(item1.name, item2.name) > 0.8 and item_similarity(item1.address, item2.address) > 0.9:
matches.append((item1, item2))
The same amount of work is done, the algorithm is still
In the algorithm above we’re setting a threshold. Our results cannot be known to be perfect, and in practice we need to set a threshold high enough to avoid too many false positives. Setting a threshold too high however may mean missing matches that otherwise would have been valid.
There’s no correct threshold, the value will depend upon the data, the metrics in use, and the tolerance of false positives in the intended usage of the linked data.
A more principled approach is to use a probabilistic model to determine the likelihood that two records are linked.
This is a bit more complicated, but it has the advantage that it can be used to determine the optimal threshold for a specific application.
For example, if you are dealing with linking medical records, you might want to be conservative in what you link to avoid accidentally linking two patients who are not the same person. In this case, you would want to configure your model to have a low false positive rate.
On the other hand, if you are doing an analysis across decennial census records, you might want to be more aggressive in linking records to get a more complete picture of the population. In this case, you would want to configure your model to have a low false negative rate.
Probabilistic approaches require already labeled data to train the model. This is often done by having a human look at a sample of the records and label them as linked or not linked. This data will form a baseline that the model’s predictions can be compared against.
We have two lists of people that we’d like to link: A
and B
.
We take a sample of the data A_sample
and B_sample
and label the pairs (A_sample X B_sample
) as linked or not linked. Call the matched pairs M
and the unmatched pairs U
.
We then define a scoring function for each link key. For example, we might have a scoring function for names that returns “good match”, “intermediate match”, or “no match”. We’ll call this name_scoring_function
. We could also have a similar function for address, year of birth, etc.
If we get two records a
and b
from A
and B
respectively, we’d use our scoring function:
= scoring_function(a["name"], b["name"])
name_score # score is either "good match", "intermediate match", or "no match"
Let’s say we get “intermediate match”.
We need to know the probability of getting such a score among the matches and among the non-matches. We can compute this from the labeled data.
We determine the probability of getting such a value among the matches in M
, P(name has "intermediate" | M)
, by looking at the fraction of matches that have such a value for the field name.
Similarly we determine the probability for getting such a value among the non-matches, P(first name has "intermediate" | U)
.
We compute such probabilities corresponding to address as well.
Suppose the result of address_scoring_function(a, b)
is "good match"
. We then compute P(address has "good match" | M)
and P(address has "good match" | U)
.
Now we are ready to find the ratio of the probabilities, assuming independence between first name and year of birth.
R = ( P(name has "intermediate | M) * P(address has "good match" | M) ) / ( P(name has "intermediate | U) * P(address has "good match" | U )
Based again on training data we determine thresholds T1 and T2, such that when the score is above T1 we assign (a, b) as a match, and if below T2 as a non-match. In between them they are sent for a manual clerical review.
Another approach is to use machine learning to determine the optimal threshold. This is particularly useful when there are a large number of fields and you don’t have a clear idea of how they help in record linkage.
It’s beyond the scope of what we can cover here, but relatively easy to implement using a library like scikit-learn.
If we need to compare
With just 1000 items in each dataset, you’re doing 1,000,000 calls to your item_similarity()
function.
But item_similarity()
is going to be fast right?
As we’ll see shortly, item_similiarity
may itself be
At
Sometimes there are fields that we are confident are very important for a match. Perhaps 99.9% of matches have the correct state. If we find that two records have a different state, we can be confident that they are not a match and avoid doing the rest of the work.
With 1 million records in each dataset, we’re looking at a trillion comparisons.
If we only compared records to those in the same state: We can view our 1,000,000 as 50 groups averaging 20,000 each.
This would be
Phew, only 20 billion calls, far better than the expected 1 trillion. A 98% reduction in the number of expensive comparisons.
This is called blocking, considering the data in smaller blocks to reduce expensive comparisons between records that will never match.
The blocking key does not need to be a single record, or a whole record. For example, you could use the first letter of a person’s last name and their zip code combined if you had reason to believe that those were unlikely to be incorrect.
Record linkage can be used to de-anonymize data.
With enough data and/or unique attributes, disparate data on individuals thought to be anonymous can be stiched back together.
This can lead to harmful privacy violations, identifying people’s political beliefs or other personal behavior.
When linking records, it’s important to consider the privacy implications of the data you’re working with, particularly when individuals are involved.
As always, practice responsible disclosure If you uncover a privacy issue report it to the data owner with suggestions on how to remedy it. Careless de-anonymization can have serious consequences for vulnerable individuals.
Similarity functions must be tailored to the data being compared and intended usage.
For numeric data we can use statistical methods, for geospatial data we can use various distance comparisons, but strings present a challenge.
We need a way to quantify the similarity between two strings. There are many approaches we can examine:
The hamming distance computes the number of characters that differ between two strings:
def hamming_distance(s1, s2):
# ensure length of s1 >= s2 using an in-place swap
if len(s2) > len(s1):
= s2, s1
s1, s2
# distance is difference in length + differing chars
= len(s1) - len(s2)
distance for i, c in enumerate(s2):
if c != s1[i]:
+= 1
distance
return distance
It has simple
print(hamming_distance("Taqueria Habañero", "Taqueria Habanero"))
print(hamming_distance("Susana's", "Susanna's"))
1
4
Both of these are one character off, Hamming distances are 1 and 4, respectively.
Hamming distance only handles replacements, not insertions or deletions. Since it is not normalized to the length of the string, it works best on strings of the same length.
Levenshtein distance is a more general version of counting the number of edits between two strings. Often called edit distance.
It allows for counting insertions, deletions, and replacements as “edits”.
The bad news is that the algorithm is
from jellyfish import levenshtein_distance
print(levenshtein_distance("Taqueria Habañero", "Taqueria Habanero"))
print(levenshtein_distance("Susana's", "Susanna's"))
1
1
Both of our comparisons now give one, as we might expect.
Taking this idea a step further, Damerau-Levenshtein considers a transposition a single edit.
This means damerau("abc", "acb") == 1
, recognizing the two characters were transposed, not replaced.
This algorithm can be
from jellyfish import damerau_levenshtein_distance
print(levenshtein_distance("Fish", "Fsih"))
print(damerau_levenshtein_distance("Fish", "Fsih"))
2
1
But we still have the issue that strings of different lengths are not comparable.
print(damerau_levenshtein_distance("Little Coco's", "Little Coco's Restaurant LLC"))
15
The Jaro distance is a string similarity measure, which measures the edit distance between two sequences.
Its key advantage is that normalizes the distance by the length of the strings so that the result is between 0 and 1.
from jellyfish import jaro_similarity
print(jaro_similarity("Taqueria Habañero", "Taqueria Habanero"))
# 0.9608
print(jaro_similarity("Susana's", "Susanna's"))
# 0.9630
print(jaro_similarity("Little Coco's", "Little Coco's Restaurant LLC"))
# 0.8214
print(jaro_similarity("Little Coco's", "Susanna's"))
# 0.4587
0.9607843137254902
0.9629629629629629
0.8214285714285715
0.4586894586894587
And the Jaro-Winkler similarity is a modification of the Jaro distance that gives more favorable ratings to strings that match from the beginning.
This algorithm was developed by a statistician at the US Census, and is useful for name-like strings. It recognizes that misspellings occur more likely in towards the end of names and more heavily weights differences early in the string.
from jellyfish import jaro_winkler_similarity
print(jaro_winkler_similarity("Taqueria Habañero", "Taqueria Habanero"))
# 0.9707
print(jaro_winkler_similarity("Susana's", "Susanna's"))
# 0.9778
print(jaro_winkler_similarity("Little Coco's", "Little Coco's Restaurant LLC"))
# 0.8929
print(jaro_winkler_similarity("Little Coco's", "Susanna's"))
# 0.4587
0.9764705882352941
0.9777777777777777
0.8928571428571429
0.4586894586894587
An alternative approach is to shorten the string by encoding it in different ways.
A generic version of this is known as the Match Rating Approach. This converts each string into a string of characters that are similar to the original string based upon phonetic similarity.
Example:
from jellyfish import match_rating_codex, match_rating_comparison
print(match_rating_codex("Taqueria Habañero"))
# TQRBÑR
print(match_rating_codex("Susana's".replace("'", ""))) # cannot have non-alpha characters
# SNSS
print(match_rating_comparison("Taqueria Habañero", "Taqueria Habanero"))
True
TQRBÑR
SSNS
True
True
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
soundex('Ann') == soundex('Anne') == 'A500'
soundex('Rupert') == soundex('Robert') == 'R163'
Metaphone is more complex than Soundex, but it’s also more accurate.
metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'.
Metaphone is particularly good at handling soundalikes like C/K and PH/F.
There are many more sound-alike algorithms. Note that these are language specific, and even accent specific since they are based on how words sound.
It is important to understand the “shape” of your chosen metric when setting a threshold.
If you were using jaro_winkler
, and chose 0.5 as your threshold, you’d find far too many false positives.
This code takes ~10000 randomly selected words and computes all Jaro-Winkler scores between them:
import itertools
import random
import altair as alt
import polars as pl
from jellyfish import jaro_winkler_similarity
def graph_jw(list1, list2, sample_size=10000):
= len(list1) * len(list2)
total_pairs # sample pairs
= random.sample(list(itertools.product(list1, list2)), sample_size)
pairs
= [
all_similarities
jaro_winkler_similarity(word1, word2)for word1, word2 in pairs
if jaro_winkler_similarity(word1, word2) > 0
]= pl.DataFrame({
df "similarity": all_similarities
})
= alt.Chart(df).mark_bar().encode(
chart "similarity:Q",
alt.X(bin=alt.Bin(maxbins=50),
="Jaro-Winkler Similarity"),
title'count():Q', title="Count")
alt.Y(
).properties(=600,
width=400,
height="Distribution of Jaro-Winkler Similarities"
title
)
return chart
= open("shakespeare.txt").read().split()
all_words = all_words[:2000]
words = graph_jw(words, words, sample_size=10000)
chart chart.show()
We can see vast majority of scores fall between 0.4 and 0.6 despite not matching. This suggest you’ll want a threshold above 0.6, still dependant on your specific needs.
The seminal paper on record linkage: A Theory for Record Linkage by Ivan Fellegi and Alan Sunter
De-anonymization
The Wikipedia article on data re-identification contains lots of examples.
Libraries
Interactive Data Cleaning Tools
Other