A lambda in programming is a small piece of code that can (among other things) be passed as an argument to a function. It has many uses, but one of the most common is that it lets a calling function customize the behavior of the called function, without the latter needing to know the details.
This article will present a simple example of how lambdas (sometimes called anonymous functions) can be used in Databricks. The cases we will look at demonstrate how we can write transformation code specific to a data set and inject this code into a standard transformation function. We can then use the standard transformation on this particular data set, without having to rewrite existing code. Using this method, we can better organize our notebook and reuse code.
The main goals of this article are:
In this article we will use PySpark, but the general principle will work for similar languages, like Scala. Or outside of Databricks as well, for that matter. Note that for brevity we will omit import commands and code comments.
Note that this article is not intended to be an in-depth guide to lambdas in programming, but rather an introduction to their use in data processing.
Consider the data shown below, which we will call DataSet1. And let’s say that our task is to write transformations that add columns with each persons initials and full name.
location | firstName | lastName |
Oslo | Harald | Haraldson |
Stockholm | Ingrid | Hansen |
Oslo | Mark | Hunter |
This can be done with the following two functions:
def addInitials(df: DataFrame) -> DataFrame: return df.withColumn("intials", concat(col("firstName").substr(0,1), col("lastName").substr(0,1))) def addFullName(df: DataFrame) -> DataFrame: return df.withColumn("fullName", concat(col("firstName"), lit(" "), col("lastName")))
And to make things even easier we define a third function, nameTransformations, that calls the first two.
def nameTransformations(data: DataFrame) -> DataFrame: data = addInitials(data) data = addFullName(data) return data
We can use this to transform our data:
result = nameTransformations(data=DataSet1)
And we get the following:
location | firstName | lastName | initials | fullName |
Oslo | Harald | Haraldson | HH | Harald Haraldson |
Stockholm | Ingrid | Hansen | IH | Ingrid Hansen |
Oslo | Mark | Hunter | MH | Mark Hunter |
So far so good, but anyone familiar with data engineering or data science will know that data from different sources can be messy and often organized differently. Consider DataSet2:
location | firstName | nobleName |
Kobenhagen | Roger | of house Henriksen |
Kobenhagen | Hans | of house Dale |
This new data contains the same type of information (more or less) as DataSet1, but the schema is different: instead of a lastName column we have a nobleName column. We want to reuse the existing code as much as possible when processing this new data, and this is where the lambda comes in.
We first modify nameTransformations to include a callable parameter, which is essentially a reference to a piece of code. This code can be invoked with a Dataframe input.
def nameTransformations(data: DataFrame, correctNameColumns: callable) -> DataFrame: data = correctNameColumns(data) data = addInitials(data) data = addFullName(data) return data
Next, we supply this argument for DataSet2, which is a simple line of code to create the missing lastName column based on nobleName.
result = nameTransformations( data=DataSet2, correctNameColumns=lambda df: df.withColumn("lastName", element_at(split(col("nobleName"), " "), -1)) )
This gives the correct result.
location | firstName | nobleName | lastName | initials | fullName |
Kobenhagen | Roger | of house Henriksen | Henriksen | HH | Roger Henriksen |
Kobenhagen | Hans | of house Dale | Dale | HD | Hans Dale |
Note that the change to nameTransformations will break backwards compatibility for DataSet1. This is easily solved by adding a default value to the customTransform parameter.
def nameTransformations(data: DataFrame, correctNames: callable = lambda df: df) -> DataFrame: ….
This is the shortest lambda possible: it simply returns the input, unchanged. The code for DataSet1 will now work as before, and when no correctNames is provided, no additional transformations are done.
We now have a complete example of how to use lambda to inject additional steps into a data transformation function. This lets us add code specific to the data at hand, without having to modify the nameTransformations function for each new case. Looking at this from a higher perspective, the function nameTransformations is only concerned with creating the new columns, and to do this it needs a firstName and a lastName column. If these are not already in the data, then the lambda lets us tell the function how to add one or both: it doesn’t know, nor need to know, the details.
Imagine an additional data set with neither firstName nor lastName, but instead it contains the names in a single name column formatted like this.
location | name |
Helsinki | Miller, George |
In this case we can still reuse the above nameTransformations function without any modifications by simply providing a lambda that generates the needed columns.
The complete code for this case can be found in this notebook: Databricks Lambda Case 1.
The observant reader might have noticed that in the first example the lambda function was not strictly necessary. After all, we could create the lastName column for DataSet2 prior to calling the nameTransformations function and achieve the same result. We will next see an example where this is not practical. Note that to make this example clearer, we will not reuse any code from Case1, except addInitials and addName.
Consider a variant of DataSet1, named DataSet3, but where the people living at a location are now in an array column.
location | population | ||||||
Oslo |
|
||||||
Stockholm |
|
To produce the same result as before, we need to add an explode-step to our transformation.
def explodePopulation(df: DataFrame) -> DataFrame: return df.withColumn("exploded_poplulation", explode("population")).select("location", " exploded_poplulation.*") def nameTransformationsNested(data: DataFrame) -> DataFrame: data = explodePopulation(data) data = addInitials(data) data = addFullName(data) return data
The new nameTransformationsNested function includes one additional step compared to nameTransformations. We can now run this using DataSet3 and get the same output as in the first case.
result = nameTransformationsNested(data=DataSet3)
location | firstName | lastName | initials | fullName |
Oslo | Harald | Haraldson | HH | Harald Haraldson |
Stockholm | Ingrid | Hansen | IH | Ingrid Hansen |
Oslo | Mark | Hunter | MH | Mark Hunter |
Let us now consider DataSet4.
location | population | ||||||
Copenhagen |
|
As with DataSet3 the names are now in an array column, making it hard to add the required lastName column. But if the explode-step has already been executed then the exact same lambda as before can be used for this purpose. We modify the nameTransformationNested function and provide the lambda.
def nameTransformationsNested(data: DataFrame, correctNameColumns: callable = lambda df: df) -> DataFrame: data = explodePopulation(data) data = correctNameColumns(data) data = addInitials(data) data = addFullName(data) return data result = nameTransformationsNested( data=DataSet4, correctNameColumns=lambda df: df.withColumn("lastName", element_at(split(col("nobleName"), " "), -1)) )
Notice that the correctNames function must be executed between explodePopulation and addInitials, which means that unlike in the first example, we cannot easily create the lastName column prior to calling nameTransformationsNested. The finished function can now be used to transform both DataSet3 and DataSet4.
location | firstName | nobleName | lastName | initials | fullName |
Copenhagen | Roger | of house Henriksen | Henriksen | HH | Roger Henriksen |
Copenhagen | Hans | of house Dale | Dale | HD | Hans Dale |
The complete code for this case can be found in this notebook Databricks Lambda Case 2.
We have seen how to use a lambda in Databricks to inject a custom transformation into a function, thus allowing us to reuse code for multiple data sets with similar, but not identical, schemas. Situations like these are very common in data analysis and data engineering. The described method allows us write code that is reusable, readable and maintainable, all of which are important requirements to production quality software.