Skip to content Skip to sidebar Skip to footer

Comparing Two Columns Of A Csv And Outputting String Similarity Ratio In Another Csv

I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both column

Solution 1:

Here is another way to get this done using pandas:

Consider your csv data is like this:

Column1,Column2 
tomato,tomatoe 
potato,potatao 
apple,appel

CODE

import pandas as pd
import difflib as diff
#Read the CSV
df = pd.read_csv('datac.csv')
#Create a new column 'diff' and get the result of comparision to it
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1) 
#Save the dataframe to CSV and you could also save it in other formats like excel, html etc
df.to_csv('outdata.csv',index=False)

Result

Column1,Column2 ,diff
tomato,tomatoe ,0.923076923077
potato,potatao ,0.923076923077
apple,appel ,0.8

Solution 2:

The for loop you're setting up here expects something like an array where you have match_ratio, and judging by the error you're getting, that's not what you have. It looks like you're missing the first argument for difflib.SequenceMatcher, which should probably be None. See 6.3.1 here: https://docs.python.org/3/library/difflib.html

Without that first argument specified, I think you're getting back 0.0 from difflib.SequenceMatcher and then trying to run ratio off of that. Even if you correct your SequenceMatcher call, I think you'll still be trying to iterate on a single float value that ratio is returning. I think you need to call SequenceMatcher inside the loop for each set of values you're comparing.

So you'd wind up with a call more like this in your function: difflib.SequenceMatcher(None, a, b). Or if you'd prefer, since these are named arguments, you could do something like this: difflib.SequenceMatcher(a=a, b=b).

Solution 3:

Your sample file looks like it contains markup tags. Assuming you are actually reading a CSV file, the error you are getting is because match_ratio is not an iterable datatype, it's a floating point number -- the return value of your function: similar(). In your code, the function call would have to be contained within a for loop to call it for each a, b string pair. Here's a working example I created that does away with the explicit for loops and uses a list comprehension instead:

import csv
from difflib import SequenceMatcher

path_in ='csv1.csv'
path_out ='csv2.csv'withopen(path_in, 'r') as csv_file_in:
    csv_reader = csv.reader(csv_file_in)
    col_headers = csv_reader.next()
    forrowin csv_reader:
        results = [[row[0],
                    row[1],
                    SequenceMatcher(None, row[0], row[1]).ratio()]
                    forrowin csv_reader]

withopen(path_out, 'wb') as csv_file_out:
    col_headers.append('Ratio')
    out_rows = [col_headers] + results
    writer = csv.writer(csv_file_out, delimiter=',')
    writer.writerows(out_rows)

In addition to the error you received you might also have run into a problem when instantiating the SequenceMatcher object -- its first parameter wasn't specified in your code. You can find more on list comprehensions and SequenceMatcher in the Python docs. Good luck in your future Python coding.

Solution 4:

You are getting that error because the records row[0] or row[1] contain most probably NaN values. Try forcing them to string first by making str(row[0]) and str(row[1])

Solution 5:

You are getting the error because you are running SequenceMatcher on the list of strings, rather than on the strings themselves. When you do this, you get back a single float value, rather than the list of ration values I think you were expecting.

If I understand what you are trying to do, then you don't need to read in the rows first. You can simply find the diff ratio as you iterate through the rows.

import csv
import difflib

match_list = []
withopen('test.csv') as f:
    csv_f = csv.reader(f)
    for row in csv_f:
        match_list.append([difflib.SequenceMatcher(a=row[0], b=row[1]).ratio()])

withopen('output.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(match_list)

Post a Comment for "Comparing Two Columns Of A Csv And Outputting String Similarity Ratio In Another Csv"