primrose package

Subpackages

Submodules

primrose.dag_runner module

Run the DAG: gets list of nodes to traverse then calls run(data_object) on each

Author(s):

Carl Anderson (carl.anderson@weightwatchers.com)

class primrose.dag_runner.DagRunner(configuration)

Bases: object

class that runs the DAG: gets the list of nodes to traverse and then asks them to run

cache_data_object(data_object)

cache the data object

Parameters

data_object (DataObject) – instance of DataObject

Returns

whether it was cached (bool)

check_for_upstream(sequence)

check for any upstream paths with the input sequence. That is, suppose we had a reader flowing to writer. It would not make sense to run writer before the reader.

Parameters

sequence (list) – list of node names

Raises

Exception if any upstream paths found

create_data_object()

restore data_object from cache

Returns

data_object (DataObject)

filter_sequence(sequence)

The user may have specified some subset of sections to run in metadata.section_run Let’s assume we can’t trust traverser to limit themselves to those sections, so here we limit the sequence, if necessary

Parameters

sequence (list) – list of nodes to run in given order

Returns

complete or subset of input sequence

Return type

sequence (list)

Raises
  • Exception if there are dupes in the sequence, or if nodes are not in config, or we have nodes from other sections.

  • The latter can happen if we mix up nodes from sections. That is, suppose we have section1 (1 node) and section 2 (2 nodes) and

  • we want to run section2 and then section1 and we receive sequence [section2_node1, section1_node1, section2_node2], it will

  • complain about the partition [section2_node1, section1_node1] [section2_node2] as they are mixed from sections.

initial_check_sequence(sequence)

Some checks on the incoming sequence

Parameters

sequence (list) – list of nodes to run in given order

Returns

nothing

Raises
  • Exception if there are dupes in the sequence, or if nodes are not in config, or we have nodes from other sections.

  • The latter can happen if we mix up nodes from sections. That is, suppose we have section1 (1 node) and section 2 (2 nodes) and

  • we want to run section2 and then section1 and we receive sequence [section2_node1, section1_node1, section2_node2], it will

  • complain about the partition [section2_node1, section1_node1] [section2_node2] as they are mixed from sections.

run(dry_run=False)
run the whole DAG. Optonally, you can call dry_run=True

which will log what would be run and in what order but not actually run it

Parameters

dry_run – Boolean. Want to do a dry run?

Returns

DataObject instance node (Node): last node run

Return type

data_object

primrose.data_object module

Module to handle book keeping of data

Author(s):

Carl Anderson (carl.anderson@weightwatchers.com)

class primrose.data_object.DataObject(config)

Bases: object

DataObject: a container for “data” (strings, dicts, arbitrary objects etc)

DATA_KEY = 'data'
DEFAULT_RESPONSE_TYPE = 'kv'
__repr__()

string representation of the class

Returns

string representation

add(requestor, data, key='data', overwrite=False)

for requestor’s instance_name, set key:data in storage

Parameters
  • requestor (Node) – is object (model, pipeline, writer etc) that has instance_name attribute

  • data (object) – some object

  • key (string) – if not supplied default data key is used

Returns

nothing.

get(instance_name, pop_data=False, rtype='kv')

get some data from storage, optionally popping it off.

Parameters
  • instance_name (str) – name of node in DAG

  • pop_data (bool) – boolean, whether to pop data from storage

  • rtype (DataObjectResponseType) – DataObjectResponseType value, specifying response type

Returns

data of desired DataObjectResponseType, selected with rtype

Raises

Exception if unrecognixzed rtype or keys

get_filtered_upstream_data(instance_name, filter_for_key)

Upstream data where first level dict keys are first checked for the presence of a filter key

Parameters
  • instance_name (str) – name of instance to look upstream from

  • filter_for_key (str) – the key data was saved with (not instance name but the data value key)

Returns

dictionary of stored data for that instance if only one matching dict, if more than one valid dictionary then return list of dicts, None otherwise

get_upstream_data(instance_name, pop_data=False, rtype='kv', operation_type_filter=None)

Return data from upstream source(s), choose to pop or not from the dict

Note

returns dictionary, where keys are instance_names and each value is a dictionary. However, if i) there is only 1 upstream key and ii) value_only=True then return the value only.

This option is useful if you expect 1 upstream source only and it returns a single artifact, such as a single dataframe. In that case just the dataframe is returned

Returns

object (type depends on DEFAULT_RESPONSE_TYPE)

Raises

Exception if no upstream data found

static read_from_cache(filename)

restore DatObject from dill-cached file

Parameters

filename (str) – cache filename

Returns

DataObject instance from cache

Return type

data_object (DataObject)

upstream_keys(instance_name, operation_type_filter=None)

get list of upstream node names for a given input requestor node

Parameters
  • instance_name (str) – name of requestor

  • operation_type_filter (optional) – type of operation type to filter in

Returns

list of keys, if any

write_to_cache(filename)

write data_object (self) to dill-cache

Returns

nothing. Side effect is to cache object to file

class primrose.data_object.DataObjectResponseType

Bases: enum.Enum

Type of object when getting data from DataObject

INSTANCE_KEY_VALUE = dictionary of instance_name keys and their data dictionaries:

{‘instance_name’: {‘key’:value}, ‘instance_name2’: {‘key2’:value2}, … } e.g. {‘corpus_reader’: {‘data’: dataframe}} this is useful if there is a set of upstream data arriving from mulitple sources

KEY_VALUE = dictonary of data for given instance name: {‘key’:value}

e.g. {‘data’: dataframe} or {‘data’: dataframe, ‘query’: ‘select * from table’} this is useful if there are multiple keys for a given instance_name or if you want to explicitly check against expected keys

VALUE = value only (for 1st or only instance name and for only key):

e.g. dataframe this is useful if you know a node in DAG has only a single upsteam source and only a single value. Readers are often a good example as they typically read in and provide a single data frame

INSTANCE_KEY_VALUE = 'ikv'
KEY_VALUE = 'kv'
VALUE = 'v'
values = <function DataObjectResponseType.values>

primrose.node_factory module

Singleton Factory where one can register objects/classes for instantiation

Author(s):

Michael Skarlinski (michael.skarlinski@weightwatchers.com)

Carl Anderson (carl.anderson@weightwatchers.com)

class primrose.node_factory.NodeFactory

Bases: object

Singleton Factory where one can register objects/classes for instantiation

CLASS_KEY = 'class'
CLASS_PREFIX = 'class_prefix'
__getattr__(name)

getattr with instance name

Parameters

name (str) – name of the instance

Returns

gettattr

instance = None

primrose.util module

Enum to speacify type of run mode: train, predict, and eval

Author(s):

Michael Skarlinski (michael.skarlinski@weightwatchers.com)

Carl Anderson (carl.anderson@weightwatchers.com)

class primrose.util.RunModes

Bases: enum.Enum

set of operation type identifiers

EVAL = 'eval'
PREDICT = 'predict'
TRAIN = 'train'
names = <function RunModes.names>
values = <function RunModes.values>

Module contents

primrose.replace_line(file_name, line_num, text)

Replace a single line of a file with text

Parameters
  • file_name (str) – name of file

  • line_num (int) – line to replace

  • text (str) – string to replace at line_num

Returns

Nothing, the file is modified and rewritten