Using common data exchange format between components
Introduction
This is a collection of ideas and concepts with their respective advantages and drawbacks in their use.
The basic idea is to specify a common data exchange format encoding for most of the communications between the components. The intention is to better decouple the systems and allow for better composability of the available components and potentially future ones.
It is hereby proposed to use STAC Items as a data exchange format between the components. The STAC Items are transient, in the sense that they are only put into the queues and not stored on volumes/buckets. VSQ
allows to embed the STAC items into the JSON message structure. used.
General advantages are:
- it is possible to encode the footprint of the product directly in the JSON (but it can be set to
null
if not immediately available) - there are several Python libraries available to digest, create or transform items (e.g: PySTAC and stac.py) but they are optional, as it is sometimes easier to simply work with the raw Python objects.
- it combines the data/metadata assets with readily available metadata values.
- referenced assets are not required to be on the same storage, allowing more flexibility
- the transient nature eliminates the requirement to create sidecar files to store metadata from one component to the next (such as GSC files generated in the preprocessor for the registrar)
Disadvantages:
- some concepts are harder to represent with STAC, such as data directories (object storage prefixes)
- it is not automatically clear how to deal with missing metadata. e.g: the
geometry
could benull
, but how would the components handle that? - verbosity. As the whole STAC Item is put into the queues, it may not be handy anymore to directly inspect the queues without additional tools.
Components involved with registration/ingestion
This listing details what each component inputs/outputs and an assessment how the new format could be of use.
Ingestor
- Input: Browse Report XML files
- Outputs: custom JSON format (basically translation of XML -> JSON) which currently only the preprocessor is able to handle properly
- Assessment: The custom JSON format could easily be replaced with the STAC Item format, which would standardize it, and allow for an easier integration with other components.
Preprocessor
- Input: Object storage prefix or custom JSON format
- Output: Object storage prefix
- Assessment: Arguably, this component would benefit the most of a switch to STAC Items. Using the
assets
it is easily distinguishable which assets are of interest. Also, metadata of the input STAC Item could simply be passed through, without the preprocessor being required to understand it. In essence, only the asset links would have to be replaced or enriched with the processed items.
Registrar
- Input: Object storage prefix
- Output: none
- Assessment: The current approach is not very stable. Several "schemes" are tried and checked whether they can be applied to be registered. Unifying this to STAC Items would greatly reduce the number of code paths. Metadata from the STAC Item could easily be handed through and mapped to the internal metadata model. It could be interesting to allow to forward the registered item to the next queue, so that the registrar is not necessarily the "dead-end" of the whole ingestion queue. (e.g: to start seeding the registered product)
Seeder
- Input: ???
- Output: none
- Assessment: This component is currently not implemented in the new VS. In theory, it could retrieve seeding requests in the form of STAC Items to get the region and time of interest to seed.
Harvester
- Input: custom JSON or raw values
- Output: tbd
- Assessment: currently there is no data format defined, STAC Items would be a "natural" fit as STAC API is actually one of the intended backends. Some backends may be more tricky though: e.g: object storage listings are not easily translatable into STAC Items without actually reading metadata files at that location. Some OADS outputs (
.index
files, basically just CSV) could actually map quite nicely into STAC Items.
Usage example
Harvester -> Preprocessor -> Registrar -> Seeder
In this example scenario, the Harvester queries an external catalogue and either passes through the STAC Items or transforms them to that format. The items are written to the queue and the harvester is oblivious of which component is the next in the chain.
The preprocessor has an immediate list of files (assets
) to work with. There is usually no need to retrieve additional metadata, but if necessary a referenced metadata file can be opened to read that. It processes selected files from the assets, and creates a copy of the STAC Item input file and adds the preprocessed files as new assets. All other metadata is kept for other components to digest. This new STAC Item is send to the next queue.
The registrar receives the STAC Item and based on its contents and the configuration starts the registration into its backends. If successful, the STAC Item is passed on the the next component without modification.
The seeder uses the stored spatiotemporal information in the STAC Item to start the seeding process.