How To Read Csv File With Additional Comma In Quotes Using Pyspark?
I am having some troubles reading the following CSV data in UTF-16: FullName, FullLabel, Type TEST.slice, 'Consideration':'Verde (Spar Verde, Fonte Verde)', Test, As far as I unde
Solution 1:
You can read as text using spark.read.text
and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:
from pyspark.sql import functions as F
df = spark.read.text(file_path)
df = df.filter("value != 'FullName, FullLabel, Type'") \
.withColumn(
"value",
F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
F.col("value")[0].alias("FullName"),
F.col("value")[1].alias("FullLabel"),
F.col("value")[2].alias("Type")
)
df.show(truncate=False)
#+----------+--------------------------------------------------+-----+#|FullName |FullLabel |Type |#+----------+--------------------------------------------------+-----+#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|#+----------+--------------------------------------------------+-----+
Update:
For input file in utf-16
, you can replace spark.read.text
by loading the file as binaryFiles
and then convert the resulting rdd into dataframe :
df = sc.binaryFiles(file_path) \
.flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
.toDF(["value"])
Solution 2:
Just another option as below (if you find it simple):
First read the text file as RDD and replace the ":"
with ~:~
and save the text file.
sc.textFile(file_path).map(lambda x: x.replace('":"','~:~')).saveAsTextFile(tempPath)
Next, read the temp path and replace ~:~
with ":"
again, but this time as a DF.
from pyspark.sql import functions as F
spark.read.option('header','true').csv(tempPath).withColumn('FullLabel',F.regexp_replace(F.col('FullLabel'),'~:~','":"')).show(1, False)
+----------+-----------------------------------------------+----+
|FullName |FullLabel |Type|
+----------+-----------------------------------------------+----+
|TEST.slice|Consideration":"Verde (Spar Verde, Fonte Verde)|Test|
+----------+-----------------------------------------------+----+
Post a Comment for "How To Read Csv File With Additional Comma In Quotes Using Pyspark?"