Spark Transformations And Actions Cheat Sheet

This article is a part of my '100 data engineering tutorials in 100 days' challenge. (25/100)

Spark Actions And Transformations
Spark Transformations And Actions Cheat Sheet Printable
Spark Transformation Example
Spark Transformations And Actions Cheat Sheet Answers

Spark DataFrame Cheat Sheet. Cheatsheet for Apache Spark DataFrame. DataFrame is simply a type alias of DatasetRow Quick Reference val spark = SparkSession.builder.appName('Spark SQL basic example').master('local').getOrCreate // For implicit conversions like converting RDDs to DataFrames import spark.implicits. Creation.
Spark command is a revolutionary and versatile big data engine, which can work for batch processing, real-time processing, caching data etc. Spark has a rich set of Machine Learning libraries that can enable data scientists and analytical organizations to build strong, interactive and speedy applications.

The concepts of actions and transformations in Apache Spark are not something we think about every day while using Spark, but occasionally it is good to know the distinction.

First of all, we may be asked that question during a job interview. It is a good question because it helps spot people who did not bother reading the documentation of the tool they are using. Fortunately, after reading this article, you will know the difference.

Cheat Sheet Depicting Deployment Modes And Where Each Spark Component Runs Spark Apps, Jobs, Stages and Tasks An anatomy of a Spark application usually comprises of Spark operations, which can be either transformations or actions on your data sets using Spark’s RDDs, DataFrames or Datasets APIs. Sonic hacking contest 2017 entries.

In addition to that, we may need to know the difference to explain why the code we are reviewing is too slow. For example, some data engineers have strange ideas like calculating counts in the middle of a Spark job to log the number of rows or store them as a metric. While a well-placed count may help to debug and tremendously speed up problem-solving, counting the number of rows in every other line of code is massive overkill. The difference between a transformation and an action helps us explain why doing it is a significant bottleneck. Skillet rise album download free.

What is a transformation?

A transformation is every Spark operation that returns a DataFrame, Dataset, or an RDD. When we build a chain of transformations, we add building blocks to the Spark job, but no data gets processed. That is possible because transformations are lazy executed. Spark will calculate the value when it is necessary.

Of course, this also means that Spark needs to recompute the values when we re-use the same transformations. We can avoid that by using the persist or cache functions.

What is an action?

Actions, on the other hand, are not lazily executed. When we put an action in the code and Spark reaches that line of code when running the job, it will have to perform all of the transformations that lead to that action to produce a value.

Producing value is the key concept here. While transformations return one of the Spark data types, actions return a count of elements (for example, the count function), a list of them (collect, take, etc.), or store the data in external storage (write, saveAsTextFile, and others).

Spark Actions And Transformations

How to tell the difference

When we look at the Spark API, we can easily spot the difference between transformations and actions. If a function returns a DataFrame, Dataset, or RDD, it is a transformation. If it returns anything else or does not return a value at all (or returns Unit in the case of Scala API), it is an action.

Basic data munging operations: structured data

This page is developing

Python pandas	PySpark RDD	PySpark DF	R dplyr	Revo R dplyrXdf
subset columns	`df.colname`, `df['colname']`	`rdd.map()`	`df.select('col1', 'col2', ..)`	`select(df, col1, col2, ..)`
new columns	`df['newcolumn']=..`	`rdd.map(function)`	`df.withColumn(“newcol”, content)`	`mutate(df, col1=col2+col3, col4=col5^2,..)`
subset rows	`df[1:10]`, `df.loc['rowname':]`	`rdd.filter(function or boolean vector)`, `rdd.subtract()`	`filter`
sample rows	`rdd.sample()`
order rows	`df.sort('col1')`	`arrange`
group & aggregate	`df.sum(axis=0)`, `df.groupby(['A', 'B']).agg([np.mean, np.std])`	`rdd.count()`, `rdd.countByValue()`, `rdd.reduce()`, `rdd.reduceByKey()`, `rdd.aggregate()`	`df.groupBy('col1', 'col2').count().show()`	`group_by(df, var1, var2,..) %>% summarise(col=func(var3), col2=func(var4), ..)`	`rxSummary(formula, df)` or `group_by() %>% summarise()`
peek at data	`df.head()`	`rdd.take(5)`	`df.show(5)`	`first()`, `last()`
quick statistics	`df.describe()`	`df.describe()`	`summary()`	`rxGetVarInfo()`
schema or structure	`df.printSchema()`

..and there's always SQL

Syntax examples

Python pandas

PySpark RDDs & DataFrames

RDDs

Spark Transformations And Actions Cheat Sheet

Transformations return pointers to new RDDs

map, flatmap: flexible,
reduceByKey
filter

Gerrit tutorial. Actions return values

collect
reduce: for cumulative aggregation
take, count

A reminder: how lambda functions, map, reduce and filter work

Partitions: rdd.getNumPartitions(), sc.parallelize(data, 500), sc.textFile('file.csv', 500), rdd.repartition(500)

Spark Transformations And Actions Cheat Sheet Printable

Additional functions for DataFrames

If you want to use an RDD method on a dataframe, you can often df.rdd.function().

Miscellaneous examples of chained data munging:

Further resources

My IPyNB scrapbook of Spark notes
Spark programming guide (latest)
Spark programming guide (1.3)
Introduction to Spark illustrates how python functions like map & reduce work and how they translate into Spark, plus may data munging examples in Pandas and then Spark

R dplyr

The 5 verbs:

Spark Transformation Example

select = subset columns
mutate = new cols
filter = subset rows
arrange = reorder rows
summarise

Additional functions in dplyr

first(x) - The first element of vector x.
last(x) - The last element of vector x.
nth(x, n) - The nth element of vector x.
n() - The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) - The number of unique values in vector x.

Revo R dplyrXdf

Notes:

Spark Transformations And Actions Cheat Sheet Answers

xdf = 'external dataframe' or distributed one in, say, a Teradata database
If necessary, transformations can be done using rxDataStep(transforms=list(..))

Manipulation with dplyrXdf can use:

filter, select, distinct, transmute, mutate, arrange, rename,
group_by, summarise, do
left_join, right_join, full_join, inner_join
these functions supported by rx: sum, n, mean, sd, var, min, max

Further resources