API Reference#

This page provides an overview of all public objects, functions, and methods included in the scystream-sdk.

Core (scystream.sdk.core)#

The core of the SDK are entrypoints

To configure entrypoints a decorator is provided. Use this to define entrypoints, and pass scystream.sdk.env.settings.EnvSettings if necessary.

Config (scystream.sdk.config)#

The scystream-sdk contains of two main configuration objects.

  1. SDKConfig (scystream.sdk.config.SDKConfig)

    This is a Object containing all the global configurations for the SDK to work. This could be, for example, the app name, which will be used to identify the compute block on your spark-master.

  2. ComputeBlockConfig (scystream.sdk.config.models.ComputeBlock)

    The ComputeBlockConfig is a file that “configures” the ComputeBlocks Inputs and Outputs. It also contains some metadata configurations (e.g. Author, Docker-Image-URL, …).

ENVs and Settings (scystream.sdk.env)#

When using the scystream-sdk and defining entrypoints, its important to give the user (via the scheduler) the possibility to define settings for each entrypoints.

These settings are set using env variables.

There are three main types of Settings:

  1. EnvSettings (scystream.sdk.env.settings.EnvSettings)

    The EnvSettings class inherits from the pydantic BaseSettings class. It can be used to parse env-variables from the .env file. You should use this class when defining the Settings for an entrypoint.

    However, you can also use this function to parse your custom environment variables which might not be user-defined.

  2. InputSettings (scystream.sdk.env.settings.InputSettings)

    Use this when defining settings for your inputs. Under the hood, this works exactly the same as EnvSettings.

  3. InputSetting (scystream.sdk.env.settings.OutputSettings)

    Use this when defining settings for your outputs. Under the hood, htis works exactly the same as EnvSettings.

The SDK also provides more specific types of inputs and outputs. These offer predefined config-keys:

  1. FileSettings (scystream.sdk.env.settings.FileSettings)

  2. PostgresSettings (scystream.sdk.env.settings.PostgresSettings)

Spark Manager (scystream.sdk.spark_manager)#

We aim to handle all our data exchange & data usage using Apache Spark.

To use Spark you need to configure the scystream.sdk.spark_manager.SparkManager, which connects to a spark-master and gives you access to the session.

Bare in mind, currently only the database connection is handled using Spark. When using a Database, please make sure to setup the connection using:

Database Handling (scystream.sdk.database_handling)#

The database handling package contains all the required utilities to connect & query from/to a database. The database handling package makes use of Apache Spark.

Currently the scystream-sdk supports the following databases:

  1. Postgres (scystream.sdk.database_handling.postgres_manager)

    To configure a connection to postgres use the scystream.sdk.spark_manager.SparkManager.setup_pg method. The postgres_manager module currently supports:

File Handling (scystream.sdk.file_handling)#

The file handling package contains all the required utilities to connect & query from/to a file storage. Currently the file handling package does NOT make use of Apache Spark.

Currently the scystream-sdk supports the following file-storages:

  1. S3 Buckets (scystream.sdk.file_handling.s3_manager)

    The s3_manager module currently supports:

Scheduler (scystream.sdk.scheduler)#

The scheduler module can be used to list & execute entrypoints.

Modules#

The following modules are part of the scystream-sdk: