Install flowrunner¶

In [ ]:
%pip install --upgrade flowrunner
Python interpreter will be restarted.
Requirement already satisfied: flowrunner in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (0.2.0)
Requirement already satisfied: ipython>=8.11.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from flowrunner) (8.11.0)
Requirement already satisfied: matplotlib>=3.7.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from flowrunner) (3.7.1)
Requirement already satisfied: click>=8.1.3 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from flowrunner) (8.1.3)
Requirement already satisfied: requests in /databricks/python3/lib/python3.9/site-packages (from flowrunner) (2.26.0)
Requirement already satisfied: coloredlogs>=15.0.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from flowrunner) (15.0.1)
Requirement already satisfied: Jinja2>=3.1.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from flowrunner) (3.1.2)
Requirement already satisfied: humanfriendly>=9.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from coloredlogs>=15.0.1->flowrunner) (10.0)
Requirement already satisfied: pygments>=2.4.0 in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (2.10.0)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (3.0.38)
Requirement already satisfied: pexpect>4.3 in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (4.8.0)
Requirement already satisfied: traitlets>=5 in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (5.1.0)
Requirement already satisfied: decorator in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (5.1.0)
Requirement already satisfied: stack-data in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (0.6.2)
Requirement already satisfied: backcall in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (0.2.0)
Requirement already satisfied: jedi>=0.16 in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (0.18.0)
Requirement already satisfied: pickleshare in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (0.7.5)
Requirement already satisfied: matplotlib-inline in /databricks/python3/lib/python3.9/site-packages (from ipython>=8.11.0->flowrunner) (0.1.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /databricks/python3/lib/python3.9/site-packages (from jedi>=0.16->ipython>=8.11.0->flowrunner) (0.8.2)
Requirement already satisfied: MarkupSafe>=2.0 in /databricks/python3/lib/python3.9/site-packages (from Jinja2>=3.1.2->flowrunner) (2.0.1)
Requirement already satisfied: packaging>=20.0 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (21.0)
Requirement already satisfied: numpy>=1.20 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (1.20.3)
Requirement already satisfied: python-dateutil>=2.7 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (2.8.2)
Requirement already satisfied: pyparsing>=2.3.1 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (3.0.4)
Requirement already satisfied: importlib-resources>=3.2.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (5.12.0)
Requirement already satisfied: fonttools>=4.22.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (4.39.2)
Requirement already satisfied: contourpy>=1.0.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (1.0.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in /databricks/python3/lib/python3.9/site-packages (from matplotlib>=3.7.1->flowrunner) (8.4.0)
Requirement already satisfied: six in /databricks/python3/lib/python3.9/site-packages (from cycler>=0.10->matplotlib>=3.7.1->flowrunner) (1.16.0)
Requirement already satisfied: zipp>=3.1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from importlib-resources>=3.2.0->matplotlib>=3.7.1->flowrunner) (3.15.0)
Requirement already satisfied: ptyprocess>=0.5 in /databricks/python3/lib/python3.9/site-packages (from pexpect>4.3->ipython>=8.11.0->flowrunner) (0.7.0)
Requirement already satisfied: wcwidth in /databricks/python3/lib/python3.9/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython>=8.11.0->flowrunner) (0.2.5)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests->flowrunner) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests->flowrunner) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.9/site-packages (from requests->flowrunner) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests->flowrunner) (2021.10.8)
Requirement already satisfied: pure-eval in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from stack-data->ipython>=8.11.0->flowrunner) (0.2.2)
Requirement already satisfied: executing>=1.2.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from stack-data->ipython>=8.11.0->flowrunner) (1.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-3802ecd0-258b-43a4-bd0b-ca580696d949/lib/python3.9/site-packages (from stack-data->ipython>=8.11.0->flowrunner) (2.2.1)
Python interpreter will be restarted.
In [ ]:
# -*- coding: utf-8 -*-
from flowrunner import BaseFlow, end, start, step
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.getOrCreate()


class ExampleSparkFlow(BaseFlow):
    @start
    @step(next=["transformation_function_1", "transformation_function_2"])
    def create_data(self):
        """
        This method we create the dataset we are going use. In real use cases,
        you'll have to read from a source (csv, parquet, etc)

        For this example we create two dataframes for students ranked by marked scored
        for when they attempted the example on 1st January 2023 and 12th March 2023

        After creating the dataset we pass it to the next methods

        - transformation_function_1
        - transformation_function_2
        """

        data1 = [
            ("Hermione",100),
            ("Harry", 85),
            ("Ron", 75),
        ]

        data2 =  [
            ("Hermione",100),
            ("Harry", 90),
            ("Ron", 80),
        ]

        columns = ["Name", "marks"]

        rdd1 = spark.sparkContext.parallelize(data1)
        rdd2 = spark.sparkContext.parallelize(data2)
        self.df1 = spark.createDataFrame(rdd1).toDF(*columns)
        self.df2 = spark.createDataFrame(rdd2).toDF(*columns)

    @step(next=["append_data"])
    def transformation_function_1(self):
        """
        Here we add a snapshot_date to the input dataframe of 2023-03-12
        """

        self.transformed_df_1 = self.df1.withColumn("snapshot_date", lit("2023-03-12"))

    @step(next=["append_data"])
    def transformation_function_2(self):
        """
        Here we add a snapshot_date to the input dataframe of 2023-01-01
        """
        self.transformed_df_2 = self.df2.withColumn("snapshot_date", lit("2023-01-01"))

    @step(next=["show_data"])
    def append_data(self):
        """
        Here we append the two dataframe together
        """
        self.final_df = self.transformed_df_1.union(self.transformed_df_2)

    @end
    @step
    def show_data(self):
        """
        Here we show the new final dataframe of aggregated data. However in real use cases. It would
        be more likely to write the data to some final layer/format
        """
        self.final_df.show()
        return self.final_df
In [ ]:
# create an instance of the flow 

example_spark_flow = ExampleSparkFlow()
In [ ]:
example_spark_flow.show()
DEBUG | Validating flow for ExampleSparkFlow(data_store={}, param_store={}) | 2023-03-27 16:08:28 | 0327-155113-1n4appw3-10-172-195-208 | flowrunner.system.logger | 1680
DEBUG:flowrunner.system.logger:Validating flow for ExampleSparkFlow(data_store={}, param_store={})
DEBUG | Show flow for ExampleSparkFlow(data_store={}, param_store={}) | 2023-03-27 16:08:28 | 0327-155113-1n4appw3-10-172-195-208 | flowrunner.system.logger | 1680
DEBUG:flowrunner.system.logger:Show flow for ExampleSparkFlow(data_store={}, param_store={})
create_data


        This method we create the dataset we are going use. In real use cases,
        you'll have to read from a source (csv, parquet, etc)

        For this example we create two dataframes for students ranked by marked scored
        for when they attempted the example on 1st January 2023 and 12th March 2023

        After creating the dataset we pass it to the next methods

        - transformation_function_1
        - transformation_function_2
        
   Next=transformation_function_1, transformation_function_2


transformation_function_2


        Here we add a snapshot_date to the input dataframe of 2023-01-01
        
   Next=append_data


transformation_function_1


        Here we add a snapshot_date to the input dataframe of 2023-03-12
        
   Next=append_data


append_data


        Here we append the two dataframe together
        
   Next=show_data


show_data


        Here we show the new final dataframe of aggregated data. However in real use cases. It would
        be more likely to write the data to some final layer/format
        
In [ ]:
example_spark_flow.validate()
DEBUG | Validating flow for ExampleSparkFlow(data_store={}, param_store={}) | 2023-03-27 16:08:28 | 0327-155113-1n4appw3-10-172-195-208 | flowrunner.system.logger | 1680
DEBUG:flowrunner.system.logger:Validating flow for ExampleSparkFlow(data_store={}, param_store={})
✅ Validated number of start nodes
✅ Validated start nodes 'next' values
✅ Validate number of middle_nodes
✅ Validated middle_nodes 'next' values
✅ Validated end nodes
In [ ]:
example_spark_flow.display()
In [ ]:
example_spark_flow.run()
DEBUG | Validating flow for ExampleSparkFlow(data_store={}, param_store={}) | 2023-03-27 16:08:29 | 0327-155113-1n4appw3-10-172-195-208 | flowrunner.system.logger | 1680
DEBUG:flowrunner.system.logger:Validating flow for ExampleSparkFlow(data_store={}, param_store={})
WARNING | Validation will raise InvalidFlowException if invalid Flow found | 2023-03-27 16:08:29 | 0327-155113-1n4appw3-10-172-195-208 | flowrunner.system.logger | 1680
WARNING:flowrunner.system.logger:Validation will raise InvalidFlowException if invalid Flow found
DEBUG | Running flow for ExampleSparkFlow(data_store={}, param_store={}) | 2023-03-27 16:08:29 | 0327-155113-1n4appw3-10-172-195-208 | flowrunner.system.logger | 1680
DEBUG:flowrunner.system.logger:Running flow for ExampleSparkFlow(data_store={}, param_store={})
+--------+-----+-------------+
|    Name|marks|snapshot_date|
+--------+-----+-------------+
|Hermione|  100|   2023-03-12|
|   Harry|   85|   2023-03-12|
|     Ron|   75|   2023-03-12|
|Hermione|  100|   2023-01-01|
|   Harry|   90|   2023-01-01|
|     Ron|   80|   2023-01-01|
+--------+-----+-------------+

In [ ]:
%pip install python-dateutil
Python interpreter will be restarted.
Requirement already satisfied: python-dateutil in /databricks/python3/lib/python3.9/site-packages (2.8.2)
Requirement already satisfied: six>=1.5 in /databricks/python3/lib/python3.9/site-packages (from python-dateutil) (1.16.0)
Python interpreter will be restarted.
In [ ]:

root
 |-- name: string (nullable = true)
 |-- date_joined: string (nullable = true)

In [ ]: