Skip to content Skip to sidebar Skip to footer

Mode Of Row As A New Column In Pyspark Dataframe

Is it possible to add a new column based on the maximum of previous columns where the previous columns are string literals. Consider following dataframe: df = spark.createDataFrame

Solution 1:

Define a UDF around statistics.mode to compute the row-wise mode with the required semantics:

import statistics

from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType

defmode(*x):
    try:
        return statistics.mode(x)
    except statistics.StatisticsError:
        return x[-1]

mode = udf(mode, StringType())

df.withColumn("max_v", mode(*[col(c) for c in df.columns if 'colour'in c])).show()

+---+-----+-----------+-----------+-----------+-----+| ID| cash|colour_body|colour_head|colour_foot|max_v|+---+-----+-----------+-----------+-----------+-----+|1|25000|      black|      black|      white|black||2|16000|        red|      black|      white|white|+---+-----+-----------+-----------+-----------+-----+

Solution 2:

For the general case of any number of columns, the udf solution by @cs95 is the way to go.

However, in this specific case where you have only 3 columns you can actually simplify the logic using just pyspark.sql.functions.when, which will be more efficient than using a udf.

from pyspark.sql.functions import col, when

defmode_of_3_cols(body, head, foot):
    return(
        when(
            (body == head)|(body == foot), 
            body
        ).when(
            (head == foot),
            head
        ).otherwise(foot)
    )

df.withColumn(
    "max_v", 
    mode_of_3_cols(col("colour_body"), col("colour_head"), col("colour_foot"))
).show()
#+---+-----+-----------+-----------+-----------+-----+#| ID| cash|colour_body|colour_head|colour_foot|max_v|#+---+-----+-----------+-----------+-----------+-----+#|  1|25000|      black|      black|      white|black|#|  2|16000|        red|      black|      white|white|#+---+-----+-----------+-----------+-----------+-----+

You just need to check if any two columns are equal- if yes, then that value has to be the mode. If not, return the last column.

Post a Comment for "Mode Of Row As A New Column In Pyspark Dataframe"