Record linking and fuzzy matching are terms used to describe the process of joining two sets of data that do not have a common unique identifier. Examples include trying to attach files based on names of people or merging data that only has the organization name and address.
each and his uncle seem to have a solution. But none of them work. Here is a good article on this topic. He mentioned the Fuzzymatcher and Recordlinkage modules.
https://pbpython.com/record-linking.html
_____
A simpler solution is mentioned in the comments on the page.
#! pip install string-grouper
import pandas as pd
from string_grouper import match_strings
df = pd.read_csv (“corporate_list.csv”, encoding = “ISO-8859-1”, header = None)
df.columns = [“serial”, “company_name”]
ndf = match_strings (df[“company_name”])
ndf[ndf[“left_index”] ! = ndf[“right_index”]].to_csv (“to_study.csv”)
Labels: python