Counting Line Frequencies And Producing Output Files
With a textfile like this: a;b b;a c;d d;c e;a f;g h;b b;f b;f c;g a;b d;f How can one read it, and produce two output text files: one keeping only the lines representing the most
Solution 1:
Here is an answer without frozen set.
df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']
df_all = pd.concat([df_count.assign(letter=lambda x: x['A']),
df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])
df_first = df_all.groupby(['letter']).first().reset_index()
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
------------older answer --------
Since order matters you can use a frozen set as the key to a groupby
import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']
Which will give you this
Combos Count
0 (a, b) 31 (b, f) 22 (d, c) 23 (g, f) 14 (b, h) 15 (c, g) 16 (d, f) 17 (e, a) 1
To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.
df_a = df_count.copy()
df_b = df_count.copy()
df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])
df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])
And since this is sorted by letter and count (descending) just get the first row of each group.
df_first = df_all.groupby('letter').first()
And to get the top 25%, just use
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
And then use .to_csv
to output to file.
Post a Comment for "Counting Line Frequencies And Producing Output Files"