How To Compare Two Columns From The Same Data Set?
I have a data set with 6 columns and 4.5 million rows, and I want to iterate through all the data set to compare the value of the last column with the value of the 1st column for e
Solution 1:
you could try using numpy, but this will take on the order of 10s of minutes.
import numpy as np
import time
x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]
xArr = np.array(x)
st = time.time()
newList = []
for kk,i inenumerate(xArr):
matches = np.where(xArr[:,-1]==i[0])[0]
iflen(matches)!=0:
newList.append(np.concatenate([i,xArr[matches].flatten()]))
print('Runtime',time.time() - st)
Solution 2:
You could try using defaultdict
:
from collections import defaultdict
from pprint importpprintx= [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]
d = defaultdict(list)
for v in x:
d[v[0]] += v
d[v[-1]] += v
pprint([v for v in d.values()iflen(v) > 3])
Prints:
[['2', 'Jack', '8', '1', 'Ali', '2', '5', 'gabe', '2', '7', 'Ali', '2'],
['2', 'Jack', '8', '8', 'nobody', '20'],
['1', 'Ali', '2', '4', 'sgee', '1']]
Post a Comment for "How To Compare Two Columns From The Same Data Set?"