Skip to content Skip to sidebar Skip to footer

Combine Pivoted And Aggregated Column In Pyspark Dataframe

My question is related to this one. I have a PySpark DataFrame, named df, as shown below. date | recipe | percent | volume ---------------------------------------- 2019-01-0

Solution 1:

According to Spark's source code, it has a special branch for pivoting with single aggregation.

valsingleAgg= aggregates.size == 1

    def outputName(value: Expression, aggregate: Expression): String = {
      valstringValue= value.name

      if(singleAgg) {
        stringValue <--- Here
      } 
      else {
        valsuffix= {...}
        stringValue + "_" + suffix
      }
    }

I don't know the reason, but the single remaining option is column renaming.

Here is a simplified version for renaming:

  def rename(identity: Set[String], suffix: String)(df: DataFrame): DataFrame = {
    val fieldNames = df.schema.fields.map(filed => filed.name)
    val renamed = fieldNames.map(fieldName => {
      if (identity.contains(fieldName)) {
        fieldName
      } else {
        fieldName + suffix
      }} )

  df.toDF(renamed:_*)
  }

Usage:

rename(Set("date"), "_percent")(pivoted).show()

+----------+---------+---------+|date|A_percent|B_percent|+----------+---------+---------+|2019-01-01|0.025|0.05||2019-01-02|0.11|0.06|+----------+---------+---------+

Post a Comment for "Combine Pivoted And Aggregated Column In Pyspark Dataframe"