rainforest.database package

This module provides routines used for database management.

rainforest.database.database : main database class, used as an entry-point to the database

rainforest.database.db_populate : command-line script used to add data to database

rainforest.database.retrieve_radar_data : functions used to add new radar data to database

rainforest.database.retrieve_reference_data : functions used to add new Cartesian reference data to database

rainforest.database.retrieve_dwh_data R module : R functions used to add new gauge data to database

rainforest.database.database module

Main class to update the RADAR/STATION database and run queries to retrieve specific data

Note that I use spark because there is currently no way to use SQL queries with dask

class rainforest.database.database.DataFrameWithInfo(name, df)

Bases: pyspark.sql.dataframe.DataFrame

property info
class rainforest.database.database.Database(config_file=None)

Bases: object

Creates a Database instance that can be used to load data, update new data, run queries, etc

Parameters

config_file (str (optional)) – Path of the configuration file that you want to use, can also be provided later and is needed only if you want to update the database with new data

add_tables(filepaths_dic, get_summaries=False)

Reads a set of data contained in a folder as a Spark DataFrame and adds them to the database instance

Parameters

filepaths_dic (dict) –

Dictionary where the keys are the name of the dataframes to add and the values are the wildcard patterns poiting to the files for example {‘gauge’: ‘/mainfolder/gauge/*.csv’,

’radar’ : ‘/mainfolder/radar/.csv’, ‘reference’ : /mainfolder/reference/.parquet’}

will add the three tables ‘gauge’, ‘radar’ and ‘reference’ to the database

property config_file
query(sql_query, to_memory=True, output_file='')

Performs an SQL query on the database and returns the result and if wanted writes it to a file

Parameters
  • sql_query (str) – Valid SQL query, all tables refered to in the query must be included in the tables attribute of the database (i.e. they must first be added with the add_tables command)

  • memory (to) – If true will try to put the result into ram in the form of a pandas dataframe, if the predicted size of the query is larger than the parameter WARNING_RAM in common.constants this will be ignored

  • output_file (str (optional)) – Full path of an output file where the query will be dumped into. Must end either with .csv, .gz.csv, or .parquet, this will determine the output format

Returns

  • If the result fits in memory, it returns a pandas DataFrame, otherwise

  • a cached Spark DataFrame

update_radar_data(gauge_table_name, output_folder, t0=None, t1=None)

Updates the radar table using timesteps from the gauge table

Inputs:
gauge_table_name: str

name of the gauge table, must be included in the tables of the database, i.e. you must first add it with load_tables(..)

output_folder: str

directory where to store the computed radar tables

t0: start time in YYYYMMDD(HHMM) (optional)

starting time of the retrieval, by default all timesteps that are in the gauge table will be used

t1: end time in YYYYMMDD(HHMM) (optional)

ending time of the retrieval, by default all timesteps that are in the gauge table will be used

update_reference_data(gauge_table_name, output_folder, t0=None, t1=None)

Updates the reference product table using timesteps from the gauge table

Inputs:
gauge_table_name: str

name of the gauge table, must be included in the tables of the database, i.e. you must first add it with load_tables(..)

output_folder: str

directory where to store the computed radar tables

t0: start time in YYYYMMDD(HHMM) (optional)

starting time of the retrieval, by default all timesteps that are in the gauge table will be used

t1: end time in YYYYMMDD(HHMM) (optional)

ending time of the retrieval, by default all timesteps that are in the gauge table will be used

update_station_data(t0, t1, output_folder)
Populates the csv files that contain the point measurement data,

that serve as base to update the database. A different file will be created for every station. If the file is already present the new data will be appended to the file.

inputs:

t0: start time in YYYYMMDD(HHMM) format (HHMM) is optional t1: end time in YYYYMMDD(HHMM) format (HHMM) is optional output_folder: where the files should be stored. If the directory

is not empty, the new data will be merged with existing files if relevant

class rainforest.database.database.TableDict

Bases: dict

This is an extension of the classic python dict that automatically calls createOrReplaceTempView once a table has been added to the dict

rainforest.database.db_populate module

Command line script to add new data to the database

see Database command-line tool

rainforest.database.db_populate.main()

rainforest.database.retrieve_radar_data module

Main routine for retrieving radar data This is meant to be run as a command line command from a slurm script

i.e. ./retrieve_radar_data -t <task_file_name> -c <config_file_name> - o <output_folder>

IMPORTANT: this function is called by the main routine in database.py so you should never have to call it manually

class rainforest.database.retrieve_radar_data.Updater(task_file, config_file, output_folder)

Bases: object

Creates an Updater class instance that allows to add new radar data to the database

Parameters
  • task_file (str) – The full path to a task file, i.e. a file with the following format timestamp, station1, station2, station3…stationN These files are generated by the database.py module so normally you shouldn’t have to create them yourself

  • config_file (str) – The full path of a configuration file written in yaml format that indicates how the radar retrieval must be done

  • output_folder (str) – The full path where the generated files will be stored

get_agg_operators()

Returns all aggregation operators codes needed to aggregate all columns to 10 min resolution, 0 = mean, 1 = log mean

process_all_timesteps()

Processes all timesteps that are in the task file

process_single_timestep(list_stations, radar_object, tidx)

Processes a single 5 min timestep for a set of stations

Parameters
  • list_stations (list of str) – Names of all SMN or pluvio stations for which to retrieve the radar data

  • radar_object (Radar object instance as defined in common.radarprocessing) – a radar object which contains all radar variables in polar format

  • tidx (int) – indicates if a radar 5 min timestep is the first or the second in the corresponding 10 min gauge period, 1 = first, 2 = second

retrieve_radar_files(radar, start_time, end_time, include_vpr=True, include_status=True)

Retrieves a set of radar files for a given time range

Parameters
  • radar (char) – The name of the radar, i.e either ‘A’,’D’,’L’,’P’,’W’

  • start_time (datetime.datetime instance) – starting time of the time range

  • end_time (datetime.datetime instance) – end time of the time range

  • include_vpr (bool (optional)) – Whether or not to also include VPR files

  • include_status (bool (optional)) – Whether or not to also include status files

rainforest.database.retrieve_reference_data module

Main routine for retrieving reference MeteoSwiss data (e.g. CPC, RZC, POH, etc) This is meant to be run as a command line command from a slurm script

i.e. ./retrieve_reference_data -t <task_file_name> -c <config_file_name> - o <output_folder>

IMPORTANT: this function is called by the main routine in database.py so you should never have to call it manually ————– Daniel Wolfensberger, LTE-MeteoSwiss, 2020

class rainforest.database.retrieve_reference_data.Updater(task_file, config_file, output_folder)

Bases: object

Creates an Updater class instance that allows to add new reference data to the database

Parameters
  • task_file (str) – The full path to a task file, i.e. a file with the following format timestamp, station1, station2, station3…stationN These files are generated by the database.py module so normally you shouldn’t have to create them yourself

  • config_file (str) – The full path of a configuration file written in yaml format that indicates how the radar retrieval must be done

  • output_folder (str) – The full path where the generated files will be stored

process_all_timesteps()

Processes all timestaps in the task file

retrieve_cart_files(start_time, end_time, products)

Retrieves a set of reference product files for a given time range

Parameters
  • start_time (datetime.datetime instance) – starting time of the time range

  • end_time (datetime.datetime instance) – end time of the time range

  • products (list of str) – list of all products to retrieve, must be valid MeteoSwiss product names, for example CPC, CPCH, RZC, MZC, BZC, etc

rainforest.database.retrieve_dwh_data R module

Main routine for retrieving station data This is meant to be run as a command line command from a slurm script

i.e. Rscript retrieve_dwh_data.r <t0> <t1> <threshold> <stations> <variables> <output_folder> <missing_value overwrite>

IMPORTANT: this function is called by the main routine in database.py so you should never have to call it manually

retrieve_dwh_data.R [options]

Options:

  • t0 (str) - start time in YYYYMMDDHHMM format

  • t1 (str) - end time in YYYYMMDDHHMM format

  • threshold (float) - minimum value of hourly precipitation for the entire hour to be included in the database (i.e. all 6 10min timesteps)

  • variables (str) - list of variables to retrieve, using the DWH names, for example “tre200s0,prestas0,ure200s0,rre150z0,dkl010z0,fkl010z0”

  • output_folder (str) - directory where to store the csv files containing the retrieved data

  • output_folder (float) - directory where to store the csv files containing the retrieved data

  • overwrite (bool) - whether or not to overwrite already existing data in the output_folder