Skip to content Skip to sidebar Skip to footer

How To Compare Two Columns From The Same Data Set?

I have a data set with 6 columns and 4.5 million rows, and I want to iterate through all the data set to compare the value of the last column with the value of the 1st column for e

Solution 1:

you could try using numpy, but this will take on the order of 10s of minutes.

import numpy as np
import time


x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
     ['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
     ['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']] 

xArr = np.array(x)
st = time.time()
newList = []
for kk,i inenumerate(xArr):

    matches = np.where(xArr[:,-1]==i[0])[0]
    iflen(matches)!=0:
        newList.append(np.concatenate([i,xArr[matches].flatten()]))

print('Runtime',time.time() - st)

Solution 2:

You could try using defaultdict:

from collections import defaultdict
from pprint importpprintx= [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]

d = defaultdict(list)

for v in x:
    d[v[0]] += v
    d[v[-1]] += v

pprint([v for v in d.values()iflen(v) > 3])

Prints:

[['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
 ['2', 'Jack', '8', '8', 'nobody', '20'],
 ['1', 'Ali', '2', '4', 'sgee', '1']]

Post a Comment for "How To Compare Two Columns From The Same Data Set?"