Match names from two different columns - Comparing strings of different lengths

2 views (last 30 days)
I have a cell variable A with about 4000 rows and 5 columns:
C1 c2 c3 c4 c5
A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
1997 'Jambardella Arnold' 'Butter' 'BASIC' 'STEEL'
1999 'Cai von Rumohr' 'Cow' 'CAPITAL' 'AEROPLANE'
2011 'Pierre Smith' 'Milk' 'GOOD' 'AEROPLANE'
2004 'Jinder Kauffman' 'Star' 'CAPITAL' 'PHONE'
And I have a second cell variable B with about 200 000 rows and 8 columns:
c1 c2 c3 c4 c5 c6 c7 c8
B={2013 29 2225 'ELD1' 29 'SMITH P' 4817 'HAYWOOD'
2013 70 2628 'CCRN' 70 'FRANCE J' 11688 'CANTORFZ'
2013 02 952 'ABFS' 02 'KAUFFMAN J' 356 'BUCK'
2013 20297 157 'DUK' 20297 'ARNOLD J' 1382 'LAWRENCE'
2013 78362 260 'APA' 78362 'ARIF A' 2213 'STIFEL'}
The focus is on c2 of A and on c6 of B.
  • C2 of A gives the complete first name and last name (and sometimes other names in between) of an individual.
  • C6 of B gives the last name (in capital letters) and only the initial of the first name of an individual.
I am trying to match both cells . So in case both last names are the same (or silmilar) I would like to add the columns of B to A.
C1 c2 c3 c4 c5
A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
1997 'Jambardella Arnold ' 'Butter' 'BASIC' 'STEEL' 2013 20297 157 'DUK' 20297 'ARNOLD J' 1382 'LAWRENCE'
1999 'Cai von Rumohr' 'Cow' 'CAPITAL' 'AEROPLANE'
2011 'Pierre Smith' 'Milk' 'GOOD' 'AEROPLANE' 2013 29 2225 'ELD1' 29 'SMITH P' 4817 'HAYWOOD'
2004 'Jinder Kauffman' 'Star' 'CAPITAL' 'PHONE' 2013 02 952 'ABFS' 02'KAUFFMAN J' 356 'BUCK'
I never matched cells comparing names, and the function has to be case insensitive, ignore points, commas or spaces, and also ignore if the same letter appears twice in a row. I say this because variable A was wrote by me by hand, so it's possible the last name is not exactly equal in both variables. `strrep & strtrim` functions can help solving the problem.
Can someone help me please? Thank you.

Accepted Answer

Guillaume
Guillaume on 6 Sep 2014
You've not specified what happened to compound surnames (like 'von Ruhmor' in your example) or first names (like 'Jean-Pierre'). it also appears that the fields in B are fixed width so what happens to very long surname? So, I've assumed that only the last part of a compound surname is in B, only the initial of compound first name is in B and for very long surname B is larger. I'm also ignoring case where names don't match exactly:
%extract names from A and transform in uppercase 'SURNAME 1STLETTEROFNAME':
namesfromA = upper(regexprep(A(:,2), '([A-Z]).* ([A-Z][a-z]+)', '$2 $1'));
%x=extract names from B and remove extra spaces:
namesfromB = regexprep(B(:, 6), '([A-Z]+) +([A-Z])', '$1 $2');
%find the intersection and postion of matches. Assume there is only ever one match
[~, ia, ib] = intersect(namesfromA, namesfromB);
%add matches to A:
A(ia, 6:13) = B(ib, :);
If surnames don't match exactly, then it gets a lot more complicated. There are a numbers of algorithm ( Damerau-Levenshtein, Hamming) that allows you to find how two strings are similar, but I don't think any of them are built-in in matlab.
  1 Comment
Maria
Maria on 6 Sep 2014
Thank you very much for all the information. I know that getting all the matches will be a problem. I was first thinking of doing something like this:
A(:, end+1) = lower(strrep(strtrim(A(:,2)),' ',''));
B(:,end+1)= lower(strrep(strtrim(B(:,6)),' ',''));
But then I cannot use strcmpi or strncmpi because I am working with different sizes & content even if there is a match. So I was trying to build something that for instance would do a match if they find 5 letters in a row that are equal.
But anyway, your code works and is really helpful! I will read more about the numbers of algorithm. Thank you again.

Sign in to comment.

More Answers (1)

Stephen23
Stephen23 on 6 Sep 2014
You are trying to match words in two cell arrays, including allowing for punctuation characters and possibly slightly different spellings... not an easy task to perform! You will also need to consider initials and name order.
It really depends on how different the strings might be: if the differences only include repeated characters, then you might be able to get away with creating some regexp pattern to help with that.
If the differences are more complicated, then you need to find a similarity measure. Some common similarity measures are the Rabin-Karp algorithm, the Levenshtein distance, the Needleman–Wunsch algorithm, or the Hamming distance.
You will also find submissions on MATLAB File Exchange that support several of these measures for analyzing string similarity, and plenty of examples online.
  1 Comment
Maria
Maria on 6 Sep 2014
Thank you very much for all the information. I will do some research about the topic of similarity measure. Thank you.

Sign in to comment.

Categories

Find more on Get Started with MATLAB in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!