Mode Of Row As A New Column In Pyspark Dataframe
Is it possible to add a new column based on the maximum of previous columns where the previous columns are string literals. Consider following dataframe: df = spark.createDataFrame
Solution 1:
Define a UDF around statistics.mode
to compute the row-wise mode with the required semantics:
import statistics
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
defmode(*x):
try:
return statistics.mode(x)
except statistics.StatisticsError:
return x[-1]
mode = udf(mode, StringType())
df.withColumn("max_v", mode(*[col(c) for c in df.columns if 'colour'in c])).show()
+---+-----+-----------+-----------+-----------+-----+| ID| cash|colour_body|colour_head|colour_foot|max_v|+---+-----+-----------+-----------+-----------+-----+|1|25000| black| black| white|black||2|16000| red| black| white|white|+---+-----+-----------+-----------+-----------+-----+
Solution 2:
For the general case of any number of columns, the udf
solution by @cs95 is the way to go.
However, in this specific case where you have only 3 columns you can actually simplify the logic using just pyspark.sql.functions.when
, which will be more efficient than using a udf
.
from pyspark.sql.functions import col, when
defmode_of_3_cols(body, head, foot):
return(
when(
(body == head)|(body == foot),
body
).when(
(head == foot),
head
).otherwise(foot)
)
df.withColumn(
"max_v",
mode_of_3_cols(col("colour_body"), col("colour_head"), col("colour_foot"))
).show()
#+---+-----+-----------+-----------+-----------+-----+#| ID| cash|colour_body|colour_head|colour_foot|max_v|#+---+-----+-----------+-----------+-----------+-----+#| 1|25000| black| black| white|black|#| 2|16000| red| black| white|white|#+---+-----+-----------+-----------+-----------+-----+
You just need to check if any two columns are equal- if yes, then that value has to be the mode. If not, return the last column.
Post a Comment for "Mode Of Row As A New Column In Pyspark Dataframe"