Comparing cells in Excel

Author
Discussion

t1grm

Original Poster:

4,655 posts

285 months

Tuesday 31st May 2005
quotequote all
I have a spreadsheet with about 14,000 rows in it and I’d like to delete rows where the value in one column is a duplicate.

The trouble is the column in question contains a user entered text description which is sometimes 95% the same as another cell but not exactly the same. I’d like to remove these duplicates as well. For example I could have:

RILLENKUGELLAGER 6005 25x47x12
RILLENKUGELLAGER 6005 25 X 47 X 12
Rillenkugellager 6005 25 - 47 – 12

All of the above would be considered duplicates in my scenario but obviously a straight cell for cell comparison would not pick them up as such. So far I’m using the following formula to clean up the data and then checking for duplicates in the resulting column:

=SUBSTITUTE(TRIM(UPPER(A1))," ","" )

This strips out any spaces and converts it all to upper case which gets rid of any case sensitivity issues and instances where a user’s hit the space bar twice between words. But it does not completely resolve my problem.

Ideally I’d like to do some kind of scoring system based on the ASCII code of each letter. If more than 90% of the letters match letters in the other cell in the same order then it’s a duplicate.

Does anyone know how I could do this using Excel formulas?

>> Edited by t1grm on Tuesday 31st May 12:25

victormeldrew

8,293 posts

278 months

Tuesday 31st May 2005
quotequote all
Have you tried Soundex? Not exactly what you want, but the technique may give you ideas about fuzzy matching and how you could implement it. Try googling EXCEL SOUNDEX, there's plenty there.

Jay-Aim

598 posts

242 months

Tuesday 31st May 2005
quotequote all
I think you're over complicating it!

assuming A1 is the cell in the row

2 choices:
1) An if formula in a spare colum: =if(a1="yourtext",1,0)

you can have up to 7 nested ifs so that's a few choices

use autofilter to select all 1's and delete

2) insert a module with a macro: pm me if you want that. This will give you the option to include as many variants as you wish

t1grm

Original Poster:

4,655 posts

285 months

Tuesday 31st May 2005
quotequote all
Jay-Aim said:
I think you're over complicating it!

assuming A1 is the cell in the row

2 choices:
1) An if formula in a spare colum: =if(a1="yourtext",1,0)

you can have up to 7 nested ifs so that's a few choices

use autofilter to select all 1's and delete

2) insert a module with a macro: pm me if you want that. This will give you the option to include as many variants as you wish


Getting rid of the duplicates is no problem. I just sort by the column I want to check duplicates for and compare each row to the row above using an IF formula:

=IF(A3=A2,”Dup”,”Non-Dup”)

Then I filter on “Dup” and delete all those rows.

The problem is that only works for exact matches. I need to “scrub” the data as described in my first post before I can filter out the duplicates.

The soundex system looks good. However will it work for non-English words? I see the code is generated by allocating a number to groups of letters. However different groups of letters would sound different together in different languages so I’m wondering if the letter groupings suggested are for English only.

Jay-Aim

598 posts

242 months

Tuesday 31st May 2005
quotequote all
how about if you example is fairly typical of the types of text

from the results of your formula

=if(left(a1,len(a1)-6)=left(a2,len(a2)-6),if(right(a1,2)=right(a2,2),1,0),0)

where you basically comparing the left side of the string before any divider and the last 2 digits

t1grm

Original Poster:

4,655 posts

285 months

Wednesday 1st June 2005
quotequote all
I think what I´ll do is strip out all non alphanumeric charachters and spaces and then convert to uppercase. Thanks for your help.