Converting TO the Collins Economics Result Object (CERO) format

The ToCERO class provides methods for converting data files to the CERO format.

Critical to the successful use of this class is a configuration file in YAML format. Do not be intimidated by the acronym - the YAML format is very simple and human readable. Typically, study [1] of the YAML format should be unnecessary - copying a working configuration file and then altering it for the desired purpose should satisfy most users (the tests/data subdirectory provides many examples). This documentation will show you how to build a YAML configuration file for use with the ToCERO class in a gradual, example-by-example process. A technical reference to the ToCERO class will follows.

Building a YAML file from scratch to convert TO the CERO format

The configuration file can differ significantly depending on the type of file from which data is imported, but one aspect that all configuration files must have in common is the files field. As the name suggests, files specifies the input files that are sources of data for the conversion process. It therefore follows that a minimal (albeit useless) YAML configuration file will look like this:

files:

That is, a single line that doesn’t specify anything. This simple file is interpreted as a dict with the key "files" with a corresponding value of None - the : identifies the key-value nature of the data. That is:

{"files": None}

This top-level dictionary object - is referred to as a ToCERO object. The obvious next step is to specify some input files to convert. This is done by adding indented [2] subsequent lines with a hyphen, followed by a space, followed by the relevant data. For example:

files:
    - <File Object A>
    - <File Object B>
    - <File Object C>
    - etc.

The hyphens (followed by a space) on subsequent lines identify separate items that collectively are interpreted as a python list. The indented nature of the list identifies that this list is the value for the key in the line above. Basically the previous example is interpreted as the python object:

{"files": [<Python interpretation of File Object A>,
                  <Python interpretation of File Object B>,
                  <Python interpretation of File Object C>,
                  <etc.>]}

Note that each item of the "files" list can be either a str or a dict. If a str, the string must refer to a YAML file containing a dict defining a file object. If a dict, then that dict must be a file object. A file object is a dictionary with one mandatory key-value pair - that is, (in YAML form):

file: name_of_file

Where name_of_file is a file path relative to the configuration file. The option search_paths: List[str] provided as on option to the file object (or the encompassing ToCERO object) overrides this behaviour (where paths earlier in the list are searched before paths later in the list).

Without further specification, if the file type is comma-separated-values (CSV) and if the data is of the default format, ConCERO can import the entire file. The ‘default format’ is discussed on this page Guidelines for painless importing of data. ConCERO determines the file type:

  1. by the key-value pair type: <ext> in the file object, and if not provided then
  2. by the key-value pair type: <ext> in the ToCERO object, and if not provided then
  3. by determining the extension of the value of file in the file object, and if not determined then
  4. an error is raised.

Providing the type option allows the user to potentially extend the program to import files that the program author was not aware existed, if the file is of a similar format to one of the known and supported formats. For example, if the program author was not aware shk files existed (and thus did not provide support for them), shk files could be imported by specifying type: har (given their similarity to har files). As it is, shk files are supported, so this is not necessary. Naturally, whether the import succeeds will be dependent on whether the underlying library allows importing that file type.

With respect to step 2 (of determining the file type), it can be said that the file object inherits from the ToCERO object. Many key-value pairs can be inherited from the ToCERO object, which reduces duplicating redundant information in the case that some properties apply to all the input files. Given that every key-value pair has some effect on configuration, the term option is used to refer to a key-value pair collectively. So an example of a YAML file including all points discussed so far is:

files:
    - file: a_file.csv
    - file: b_file
      type: csv

In the example above, a_file.csv and b_file would be successfully imported (assuming they are both of default format). The file extension can be discerned with respect to a_file.csv, and b_file has the corresponding type specified. Note that the type option (for b_file is indented at the same level as file option, not the list).

A minimal configuration form that demonstrates inheritance (and assuming c_file is of default csv type) is:

type: 'csv'
files:
    - file: a_file.csv
    - file: b_file
    - file: c_file

Note that, alternatively, the file name of c_file could be changed to include a file extension. An important point is that the inheritance of type does not mean you - the user - can lazily drop the file extensions. The file extension is part of the file name, and so it must be provided, if it exists, to find the correct file.

In most cases, more specification in the file object is necessary to import data. The necessary and additional options in the file object depend on the type of the file - whether it be CSV files, Excel files, HAR files or GDX files. That is, the supported types are ToCERO.supported_file_types - a set of:

  • "csv"
  • "xlsx"
  • "har"
  • "shk"

File Objects - CSV files

CSV files can be considered the simplest case with respect to data import. ‘Under the hood’ ConCERO uses the pandas.read_csv() method to implement data import from CSVs (documentation for which can be found here ). Any option available for the pandas.read_csv() method is also available to ConCERO by including that option in the file object.

There are also a few additional options that can be provided that provide specific functionality for ConCERO. These options are:

series: (list)

the list specifies the series in the index that are relevant, so therefore providing a way to select data for export to the CERO. Each item in the list is referred to as a series object, which is a dictionary with the following options:

name: (str)

name identifies the elements of the index that will be converted into a CERO. name is a mandatory option.

rename: (str)

If provided, after export into the CERO changes name to value provided by rename.

A series object can be provided as a string - this is equivalent to the series object {'name': series_name}.

orientation: (str)

'rows' by default. If the data is in columns with respect to time, change this option to 'cols', (and therefore effectively calling a transposition operation).

skip_cols: (str|list)

A column name, or a list of column names to ignore.

And other pandas.read_csv() options that are regularly used include:

usecols: (list)

From pandas documentation - Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Note that usecols will take precedence over skip_cols, and that the argument format for usecols for a csv file differs slightly to that for an xlsx file.

index_col: (int|list)

The column or list of columns (zero-indexed) in which the identifiers reside or, if orientation=="cols", the column with the date index.

header: (int|list)

The row or list of rows (zero-indexed) in which the date index resides or, if orientation=="cols", the rows with the data identifiers.

nrows: (int)

Number of rows of the file to read. May be useful with very large csv files that have a lot of irrelevant data.

For further documentation, please consult the pandas documentation documentation.

File Objects - Excel files

The process for importing Excel files is very similar to that of csv files. Underneath, the pandas.read_excel() method is used, with virtually identical options with identical meanings. Consequently, not all the standard options will be mentioned here - just the differences in contrast to those for csv files. For a complete list of available options, please consult the pandas documentation.

sheet: (str) or sheet_name: (str)

The name of the sheet in the workbook to be imported.

usecols: (list[int]|str)

Similar to the csv form of the option, usecols accepts a list of zero-indexed integers to identify the columns to be imported. Unlike the csv option, usecols will not accept a list of str, but will accept a single str with an excel-like specification of columns. For example, usecols: A,C,E:H specifies the import of columns A, C and all columns between E and H inclusive.

File Objects - HAR (or SHK) files

In reading this section of the documentation, shk files can be considered equivalent to har files, so references to shk files can be dropped.

har files contain one or more header arrays, and with each header array is an array of one or more dimensions (to a maximum of 7). Each dimension of each array has an associated set. Note that the terminology set can be considered misleading because, unlike the mathematical concept of a set, HAR sets have an order. The order of the set corresponds to the placement of items within the array.

To specify the import of a har file, only one option in the file object is necessary - that is, head_arrs with an associated list of strings specifying the names of header arrays to import from the file. Therefore, an example configuration file that specifies the import of a har file could look like:

files:
  - file: har_file.har
    head_arrs:
      - HEA1
      - HEA2

With the example configuration, header arrays HEA1 and HEA2 would be imported from file har_file.har. Note that it is a restriction of the har format itself that header names can not be longer than 4 characters.

In the example above, each header array name is interpreted as a string. The more general format for a header definition is a dict, referred to as header_dict. Each header_dict must have the option:

  • name: header_name, where header_name is the name of the header.

header_dict must also have the following option if one of the dimensions of the array is to be interpreted as a time dimension:

  • time_dim: (str), where the string is the name of the set indexing the time-dimension (note that the format/data-type of the time dimension is irrelevant).

If the data has no time dimension (which definitely should be avoided) and therefore time_dim is not specified, then default_year must be provided (or inherited from the file object) - otherwise a ValueError will be thrown.

Note that it may also be necessary to include some of the file-independent options if the time-dimension has a format that deviates from the default. Please see File independent options for more information.

File Objects - VD files (Experimental)

The coder writing the import connector is not familiar with the diversity of VEDA data files (if there are any). Consequently, the VEDA data file importer has been written with several assumptions. Specifically:

  1. Lines starting with an asterisk (*) are comments.
  2. The number of data columns remain constant throughout a single file.

If these assumptions are incorrect, please raise an issue on GitHub.

To specify the import of a vd file, it is mandatory to specify:

  • date_col: (int), where date_col is the zero-indexed number of the column containing the date.
  • val_col: (int), where val_col is the zero-indexed number of the column containing the values.

And optional to specify:

  • default_year: (int) - If left unspecified, all records with an invalid date in date_col are dropped. If specified (as a year), the value of date_col in all records with an invalid date are changed to default_year.

Example:

files:
  - file: a_file.vd
    date_col: 3
    val_col: 8
    default_year: 2018

Note that it may also be necessary to include some of the file-independent options if the time-dimension has a format that deviates from the default. Please see File independent options for more information.

File Objects - GDX files (Experimental)

GDX files can be imported by providing the option:
  • symbols: list(dict) - where each list item is a dict (referred to as a “symbol dict”).

Each symbol dict must have the options:

  • name: (str) - where name is the name of the symbol to load.
  • date_col: (int) - where date_col specifies the (zero-indexed) column that includes the date reference.

File Independent Options:

The options in this section are relevant to all input files, regardless of their type. They are:

time_regex: (str)

time_fmt: (str)

default_year: (int)

A fundamental principle ConCERO relies upon is that all data has some reference to time (noting that all data to date has been observed to reference the year only). The time-index data will typically be in a string format, and the year is interpreted by searching through the string, using the regular expression time_regex. The default - '.*(\d{4})$' - will attempt to interpret the last four characters of the string as the year. Importantly, the match returns the year as the 1st ‘group’ (regular expression lingo). It is the first group that time_fmt is used with to convert the string to a datetime object. The default - '%Y' assumes that the string contains 4 digits corresponding to the year (and only that).

In the event that the date-time data isn’t stored in the file itself, a default_year option (a single integer corresponding to the year - e.g. 2017) must be provided. What follows is an example, using the defaults of time_regex and time_fmt, to demonstrate how this works…

Let’s assume the time index series is given, in CSV form, by:

bs1b-2017,bs1b-br1r-pl1p-2018,bs1b-br1r-pl1p-2019,...

which is typically seen with VURM-related data. The last four digits is obviously the year, so the default setting is appropriate. The regex essentially simplifies the data to a list of strings:

['2017', '2018', '2019', etc...]

However, ConCERO needs to convert these strings to pandas.datetime format. This is done by the pandas.datetime.strftime() method, which relies on matching the strings with a pattern. The default - '%Y' - will interpret the strings as four digits corresponding to the year - an obviously satisfactory result. Hence, the following options are appropriate to include in the YAML configuration file.

time_regex: .*(\d{4})$
time_fmt: '%Y'

Note: if the default settings (as per the example immediately above) are appropriate, specifying them is not necessary.

[1]For a more thorough yet simple introduction to YAML files, http://docs.ansible.com/ansible/latest/YAMLSyntax.html is recommended.
[2]‘Indented’ can refer to a tab, 4 spaces or any combination of tabs/spaces. It is however critical that the indentation pattern remains consistent (which is a requirement in common with python).

ToCERO Technical Specification

class to_cero.ToCERO(conf: dict, *args, parent: dict = None, **kwargs) → pandas.core.frame.DataFrame[source]

Loads a ToCERO configuration, suitable for creating CEROs from data files.

Parameters:
  • conf ('Union[dict,str]') – The configuration dictionary, or a path to a YAML file containing the configuration dictionary. If a path, it must be provided as an absolute path, or relative to the current working directory.
  • args – Passed to the superclass (dict) at initialisation.
  • kwargs – Passed to the superclass (dict) at initialisation.
create_cero()[source]

Create a CERO from the configuration (defined by self).

Return pd.DataFrame:
 A CERO is returned.
static is_valid(conf, raise_exception=True)[source]

Performs static validity checks on conf as a ToCERO object.

Parameters:
  • conf (dict) – An object, which may or may not suitable as a ToCERO object.
  • raise_exception (bool) – If True (the default) an exception will be raised in the event a test is failed. Otherwise (in this event) an error message is printed to stdout and False is returned.
Return bool:

A bool indicating the validity of conf as a ToCERO object.

static load_config(conf, parent: dict = None)[source]
Parameters:
  • conf ('Union[dict,str]') – A configuration dictionary, or a str to a path containing a configuration dictionary.
  • parent (dict) – A dict from which to inherit.
Return dict:

The configuration dictionary (suitable as a ToCERO object).

static run_checks(conf, raise_exception=True)[source]

Performs dynamic validity checks on conf as a ToCERO object.

Parameters:
  • conf (dict) – An object, which may or may not suitable as a ToCERO object.
  • raise_exception (bool) – If True (the default) an exception will be raised in the event a test is failed. Otherwise (in this event) an error message is printed to stdout and False is returned.
Return bool:

A bool indicating the validity of conf as a ToCERO object.

Created on Fri Jan 19 11:49:23 2018

Section author: Lyle Collins <Lyle.Collins@csiro.au>

Code author: Lyle Collins <Lyle.Collins@csiro.au>