how to loop through each row of dataFrame in pyspark

E.g

sqlContext = SQLContext(sc)

sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement print entire table on terminal but i want to access each row in that table using for or while to perform further calculations .

apache-spark dataframe for-loop pyspark apache-spark-sql

Solution

------

You would define a custom function and use map.

def customFunction(row):

return (row.name, row.age, row.city)

sample2 = sample.rdd.map(customFunction)

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))

The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

Map is needed if you are going to perform more complex computations. If you just need to add a derived column, you can use the withColumn, with returns a dataframe.

sample3 = sample.withColumn('age2', sample.age + 2)

-------

You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.

You can of course collect

for row in df.rdd.collect():

do_something(row)

or convert toLocalIterator

for row in df.rdd.toLocalIterator():

do_something(row)

and iterate locally as shown above, but it beats all purpose of using Spark.

-------

Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:

df = sqlContext.sql("show tables in default")

tableList = [x["tableName"] for x in df.rdd.collect()]

In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().

Or more abbreviated:

tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]

And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.

sql_text = "select name, age, city from user"

tupleList = [{name:x["name"], age:x["age"], city:x["city"]}

for x in sqlContext.sql(sql_text).rdd.collect()]

for row in tupleList:

print("{} is a {} year old from {}".format(

row["name"],

row["age"],

row["city"]))

-------

If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.

Note that this will return a PipelinedRDD, not a DataFrame.

-------

Give A Try Like this

result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]); for f in result.collect(): print (f.col_name)

-------

Still waiting you help, put your proposition in a comment !

hak4free

how to loop through each row of dataFrame in pyspark

Solution

Leave a Comment

1 comment:

Random Posts

Follow Us

Popular Posts

Search This Blog

Blog Archive

Labels

Tags

Recent Posts

Random Posts

Popular Posts

hak4free

how to loop through each row of dataFrame in pyspark

Solution

Related Posts

Leave a Comment

1 comment:

Random Posts

Follow Us

Popular Posts

Search This Blog

Blog Archive

Labels

Tags

Recent Posts

Random Posts

Popular Posts