How To Read Csv File With Additional Comma In Quotes Using Pyspark?

October 21, 2024 Post a Comment

I am having some troubles reading the following CSV data in UTF-16: FullName, FullLabel, Type TEST.slice, 'Consideration':'Verde (Spar Verde, Fonte Verde)', Test, As far as I unde

Solution 1:

You can read as text using spark.read.text and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:

from pyspark.sql import functions as F

df = spark.read.text(file_path)

df = df.filter("value != 'FullName, FullLabel, Type'") \
    .withColumn(
    "value",
    F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
    F.col("value")[0].alias("FullName"),
    F.col("value")[1].alias("FullLabel"),
    F.col("value")[2].alias("Type")
)

df.show(truncate=False)

#+----------+--------------------------------------------------+-----+#|FullName  |FullLabel                                         |Type |#+----------+--------------------------------------------------+-----+#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|#+----------+--------------------------------------------------+-----+

Update:

For input file in utf-16, you can replace spark.read.text by loading the file as binaryFiles and then convert the resulting rdd into dataframe :

df = sc.binaryFiles(file_path) \
    .flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
    .toDF(["value"])

Solution 2:

Just another option as below (if you find it simple):

First read the text file as RDD and replace the ":" with ~:~ and save the text file.

sc.textFile(file_path).map(lambda x: x.replace('":"','~:~')).saveAsTextFile(tempPath)

Next, read the temp path and replace ~:~ with ":"again, but this time as a DF.

from pyspark.sql import functions as F
spark.read.option('header','true').csv(tempPath).withColumn('FullLabel',F.regexp_replace(F.col('FullLabel'),'~:~','":"')).show(1, False)

+----------+-----------------------------------------------+----+
|FullName  |FullLabel                                      |Type|
+----------+-----------------------------------------------+----+
|TEST.slice|Consideration":"Verde (Spar Verde, Fonte Verde)|Test|
+----------+-----------------------------------------------+----+

Python College

How To Read Csv File With Additional Comma In Quotes Using Pyspark?

Solution 1:

Solution 2:

Post a Comment for "How To Read Csv File With Additional Comma In Quotes Using Pyspark?"