The RDP Data Crawler mainly interfaces external systems and the AIT RDP. It either periodically or event-driven fetches data from various sources such as forecasts or measurement information and stores the data into Redis streams. In addition, the AIT Data Crawler can push data to external systems such as Modbus or OPC UA devices.
The RDP Data Crawler is designed to be integrated as Docker container into the AIT RDP. The main docker image is
available at Docker Hub via ait1/rdp-data-crawler. In addition to version tags, the following are supported:
latest: The latest stable release branch.latest-dev: The latest version of the development branch.
Since the configurations are commonly rather complex, a direct configuration via environment variables is not feasible.
Instead, a configuration file or directory is mounted. Per default, the configuration is located at
/etc/data_crawler/config.yml. Nevertheless, the whole /etc/data_crawler/ directory can be mounted in case
sub-configuration files are needed. The following example shows a basic docker-compose service definition:
services:
# ...
data-crawler:
image: ait1/rdp-data-crawler:latest-dev
volumes:
- ./data-crawler/config.yml:/etc/data_crawler/config.yml:ro
environment:
REDIS_USERNAME: ${REDIS_USERNAME}
REDIS_PASSWORD: ${REDIS_PASSWORD}
depends_on:
- redis
restart: unless-stoppedFor other installation methods, including custom data sources and development setups, please refer to the Advanced Installation section.
The configuration is done via a YAML file that supports the AIT RDP extensions such as variables substitution and templating. The configuration is organized into multiple channels that are described from the perspective of the data sources. For each channel, a single source is created and the data from that source is accessed, processed and written to a destination. In most cases, either the destination (default) or the source will be an internal Redis. However, also different configurations are supported. Hence, a single data crawler process can handle multiple sources concurrently and consequently reduces the overhead of spinning up a large amount of containers. The basic structure of the configuration is as follows:
version: 1 # Configuration file version, mostly to ensure later compatibility
# List of sources that will spin up a dedicated channel for each source
data sources:
source.name.0: # Unique name of the channel. Mostly used for debugging
type: <source_type> # Type of the source to instantiate
source parameter: {} # Source-specific parameters are defined here
# The polling section describes the timing of passive data sources that are executed periodically. Active data
# sources that listen on external events may emit messages at any time. Hence, the polling section can be omitted
# for these sources.
polling:
frequency: 5min
sink_type: <sink_type> # Type of the sink
sink_parameters: {} # Sink-specific parameters are defined here
source.name.1: # Another source
# ...Each section has at lest a source and some sink configuration. Both sources and sinks are dynamically loaded by the
respective type. Source- and sink-specific parameters can be passed on in the source parameter and sink_parameters
sections, respectively. In case a Redis sink is used, the configuration can be simplified by omitting the sink_type
and sink_parameters sections and appending a redis section, instead:
version: 1
data sources:
source.name.0:
type: <source_type>
source parameter: {}
polling:
frequency: 5min
# The Redis section replaces the sink_type and sink_parameters sections.
redis:
stream: <stream_name> # Name of the Redis stream to write to.
tags: # Optional tags to be added to the Redis stream.
message-key-0: message-value-0
message-key-1: message-value-1Note that the redis sink allows to append arbitrary but static message fields. This can be used to set meta-data that is needed for further data processing.
Sources that require regular polling can be configured via the polling section. In order to determine the timing and
avoid overloading various sources, a fine-grained control is possible. At minimum, the frequency parameter must be
set. The complete list of parameters is as follows:
frequency: The nominal interval between two consecutive polling operations. Values are interpreted according to the Pandas Timedelta Specification including ISO 8601 duration representation. Examples are5min,1h,1d, and1w.jiter: A uniformly distributed random jitter that is added to the nominally scheduled time. This may be useful to load balance a source serving multiple requests at the same time. The value is interpreted according to the Pandas Timedelta Specification including ISO 8601 duration representation. The jitter is applied symmetrically, i.e., the scheduled point in time may be both reduced or extended by the jitter value, at maximum.offset: A fixed offset to shift the scheduling interval. Per default, scheduling is aligned to the full hour/day/month/etc. The offset shifts this alignment by the given timedelta. Values are represented as described above.slot count,slot id: The scheduling interval can be divided into multiple time slots, where each slot is occupied by another operation. This feature can, for instance, be used to access devices by multiple data crawlers in a round-robin fashion avoiding concurrent data access.slot countdefines the total number of slots, whileslot idgives the current slot of the source process. The slot IDs start with 0 and max out atslot count - 1. Note that the slot system only affects the scheduling offset and does not perform any synchronization among sources or data crawlers. Make sure that sufficient timing reserves are available and that the system time among multiple hosts is sufficiently well synchronized. Per default, one slot is configured.force initial: If set totrue, the source will be triggered immediately after startup. This behaviour is mainly intended to quickly populate the AIT RDP data structures on startup without the need of waiting a long time until regular scheduling intervals (e.g., of forecasts) are met. Defaults totrue.
In case the processing duration exceeds the configured frequency, the next operation will be immediately scheduled. If the delay exceeds the following regular interval, the triggering point will be skipped in order to avoid pile-up of delays and unpredictable timing.
Each channel is executed independently. In case a stale or faulty channel is detected, a restart is riggered to clear transient faults and support for less stable, external sources. In addition, the status as well as any restarts are recorded and exposed via detailed prometheus metrics. Per default, the prometheus port 8000 and metric path '/' is used. A prometheus scrape config therefore may be as follows:
- job_name: data-crawler
metrics_path: '/'
static_configs:
- targets:
- data-crawler:8000In order to support development setups that do not directly write to the final assets, a dry-run mode is provided.
For each channel, the dry_run parameter can be set to true to avoid writing to the final sink. Instead, the
data is dropped after processing without passing it on to any sink. Note that for safety reasons, the dry-run flag,
if configured, must not be left emty. Hence, disabling dry run requires passing on a dediated false value. Per
default, dry_run is disabled. The following configuration snippet shows an externally supplied dry_run indicated
via a corresponding environment variable.
controller.output.battery-0:
# Suppress writing to the final sink if the environment variable DATA_CRAWLER_DRY_RUN_WRITES is set to true.
dry_run: ${DATA_CRAWLER_DRY_RUN_WRITES}
type: "data_crawler.sources.redis.RedisStream"
source parameter: {} # ... the RedisStream source parameters
sink_type: "data_crawler.sinks.modbus.ModbusTCP"
sink_parameters: {} # ... the Modbus sink parametersThe default distribution of the AIT RDP Data Crawler already supports a broad variety of data sources and data sinks. The following overview lists the main ones. Detailed configurations can be found in the data source and data sink description.
-
Meteorological data
- Weatherbit
data_crawler.sources.weatherbit.CurrentWeather: Current weather estimations (not recommended for archiving).data_crawler.sources.weatherbit.HourlyForecasts: Hourly, numerical weather prediction data for a particular location
- met.no
data_crawler.sources.yr_no.LocationForecastNumerical weather prediction data for a particular location
- Geosphere Austria
data_crawler.sources.zamg.MeasurementStationData: Live and historic measurementsdata_crawler.sources.zamg.NumericalWeatherPredictionData: Numerical weather prediction data, both point predictions and ensemble forecasts.
- KNMI
data_crawler.sources.knmi.WeatherStationsKNMI: Weather station data
- Weatherbit
-
Generic protocols and interfaces
data_crawler.sources.modbus.ModbusTCP: Modbus TCP sourcedata_crawler.sinks.modbus.ModbusTCP; Modbus TCP sinkdata_crawler.sources.opc_ua.OPCUA: OPC UA sourcedata_crawler.sinks.opc_ua.OPCUAOPC UA sinkdata_crawler.sinks.redis.RedisStream: Redis stream sink (default)data_crawler.sources.teltonika_modbus.TeltonikaModbus: REST interface to receive Modbus data via Teltonica devices
-
Energy- and market-related services:
- ENTSO-E
data_crawler.sources.entsoe_da.ENTSOEDATransparency: Day-ahead market prices from ENTSO-E
- ENTSO-E
-
Device-specific interfaces
- Fronius
data_crawler.sources.fronius.FroniusInverterRealtimeData: Device-level real-time data from Fronius invertersdata_crawler.sources.fronius.FroniusInverterPowerFlowRealtimeData: Real-time power-flow data of all devices connected to the data loggerdata_crawler.sources.fronius.FroniusSystemArchiveData: Device-level API to query historic values and detailed information from Fronius inverters
- Fronius