API Reference

intake_spark.spark_sources.SparkRDD(args[, …]) A reference to an RDD definition in Spark
intake_spark.spark_sources.SparkDataFrame(args) A reference to a DataFrame definition in Spark
intake_spark.spark_cat.SparkTablesCatalog([…]) Intake automatically-generate catalog for tables stored in Spark
class intake_spark.spark_sources.SparkRDD(args, context_kwargs=None, metadata=None)[source]

A reference to an RDD definition in Spark

RDDs are list-of-things objects, evaluated lazily in Spark.

Examples

>>> args = [('textFile', ('text.*.files', )),
...         ('map', (len,))]
>>> context = {'master': 'spark://master.node:7077'}
>>> source = SparkRDD(args, context)

The output of source.to_spark() is an RDD object holding the lengths of the lines of the input files.

Attributes:
cache_dirs
datashape
description
hvplot

Returns a hvPlot object to provide a high-level plotting API.

plot

Returns a hvPlot object to provide a high-level plotting API.

plots

List custom associated quick-plots

Methods

close() Close open resources corresponding to this data source.
discover() Open resource and populate the source attributes.
read() Materialise the whole RDD into a list of objects
read_chunked() Return iterator over container fragments of data source
read_partition(i) Returns one of the partitions of the RDD as a list of objects
to_dask() Return a dask container for this data source
to_spark() Return the spark object for this data, an RDD
yaml([with_plugin]) Return YAML representation of this data-source
set_cache_dir  
read()[source]

Materialise the whole RDD into a list of objects

read_partition(i)[source]

Returns one of the partitions of the RDD as a list of objects

to_spark()[source]

Return the spark object for this data, an RDD

class intake_spark.spark_sources.SparkDataFrame(args, context_kwargs=None, metadata=None)[source]

A reference to a DataFrame definition in Spark

DataFrames are tabular spark objects containing a heterogeneous set of columns and potentially a large number of rows. They are similar in concept to Pandas or Dask data-frames. The Spark variety produced by this driver will be a handle to a lazy object, where computation will be managed by Spark.

Examples

>>> args = [
...    ('read', ),
...    ('format', ('csv', )),
...    ('option', ('header', 'true')),
...    ('load', ('data.*.csv', ))]
>>> context = {'master': 'spark://master.node:7077'}
>>> source = SparkDataFrame(args, context)

The output of source.to_spark() contains a spark object pointing to the parsed contents of the indicated CSV files

Attributes:
cache_dirs
datashape
description
hvplot

Returns a hvPlot object to provide a high-level plotting API.

plot

Returns a hvPlot object to provide a high-level plotting API.

plots

List custom associated quick-plots

Methods

close() Close open resources corresponding to this data source.
discover() Open resource and populate the source attributes.
read() Read all of the data into an in-memory Pandas data-frame
read_chunked() Return iterator over container fragments of data source
read_partition(i) Returns one partition of the data as a pandas data-frame
to_dask() Return a dask container for this data source
to_spark() Return the Spark object for this data, a DataFrame
yaml([with_plugin]) Return YAML representation of this data-source
set_cache_dir  
read()[source]

Read all of the data into an in-memory Pandas data-frame

read_partition(i)[source]

Returns one partition of the data as a pandas data-frame

to_spark()[source]

Return the Spark object for this data, a DataFrame

class intake_spark.spark_cat.SparkTablesCatalog(database=None, context_kwargs=None, metadata=None)[source]

Intake automatically-generate catalog for tables stored in Spark

This driver will query Spark’s Catalog object for any tables, and create an entry for each which, when accessed, will instantiate SparkDataFrame sources. Commonly, these table definitions will come from Hive.

Attributes:
cache_dirs
datashape
description
hvplot

Returns a hvPlot object to provide a high-level plotting API.

plot

Returns a hvPlot object to provide a high-level plotting API.

plots

List custom associated quick-plots

Methods

close() Close open resources corresponding to this data source.
discover() Open resource and populate the source attributes.
force_reload() Imperative reload data now
read() Load entire dataset into a container and return it
read_chunked() Return iterator over container fragments of data source
read_partition(i) Return a (offset_tuple, container) corresponding to i-th partition.
reload() Reload catalog if sufficient time has passed
to_dask() Return a dask container for this data source
to_spark() Provide an equivalent data object in Apache Spark
walk([sofar, prefix, depth]) Get all entries in this catalog and sub-catalogs
yaml([with_plugin]) Return YAML representation of this data-source
search  
set_cache_dir