Modularize preprocessor

Currently the preprocessor in VHR18 is very specific for the use of the products/collections of that project. For the use in VS, the implementation needs to be generalized.

In order to generalize the preprocessor, the necessary steps can be extracted:

Data retrieval: download from the swift object storage (maybe abstract this to allow other sources in the future as well)
Unpacking: (recursively) unpack downloaded source files
File selection: from the unpacked files, select the ones that actually have significance in VS (data files and metadata files)
Data file merging: stacking bands if separated into multiple files, combine tiles into a single file, etc
Output data file generation: COGs (possibly others)
Metadata extraction
Upload to output swift bucket (maybe other object storages in the future)

comments:

I think unpacking and file selection can be done together in one step (currently it is done like that in vhr2018), but the file selection should be configurable definitely. It does not give a mentionable performance benefit to extract only what we need, as the images are by far the largest content of those archives, but never mind.

Data file merging - yes, but before that there should be (and usually has to be those geotransform corrections, gcps to geotransform, rpc applying etc, because creating a vrt is successful only if previously warped to a shared non-rotated geotransform.

I think this in general can be abstracted to:

convert "special datasets" (like netcdf - not implemented or HDF5 to "ordinary" raster datasets)
filter out rasters based on band count(configurable)
filter out "mask" rasters (currently not supported) (configurable)
ensure we can create a vrt for a data merge by doing following for each dataset:
- fix geotransform if [1] and [5] == 0
- apply rpc (rpc -> geotransform)
- gcps -> geotransform
- ensure same projection of all datasets
data merge (either to stack bands or to merge split files)
- special case, if jpeg2000 + more files (rows/cols) + more world files, RPC is showing geotransform of top left corner of top left image of mosaic (R1C1) and world files are showing position of individual images in the mosaic, in this case overwrite geotransform of individual images in vrt on the vrt level from the RPC of the first image (because all individual images share reference to RPC and gdal uses it by design over (not along) the world files, so all images would be stacked on each other)

now thinking about it, essentially all those above points can be summarized into one configuration: result_coordinate_system as all those operations essentially aim to warp each sub-raster through vrt into a sub-raster with geotransform in epsg:4326 in our case

also between data file merging and output data file generation would be: transforming values (configurable) - Not sure how to make this configurable, if it was just one operation, we could specify the gdal_calc.py numpy math formula, which is passed to "eval", but as for example the transformation to decibel range those are two operations, where intermediary results from the first (log) are necessary in the second as min/max of it is used. So it is not easily writable as two or more gdal_calc.py mathematical formulas (I mean technically yes, but practically no).

Outlining architecture for preprocessing

source:
  type: swift
  kwargs:
    OS_USERNAME_DOWNLOAD: edwW57uNJ3w8
    OS_PASSWORD_DOWNLOAD: w8gEBx9tqCrzqqTMExudQrkC9gFMj7n2
    OS_TENANT_NAME_DOWNLOAD: 3603751684599153
    OS_TENANT_ID_DOWNLOAD: f9dbb2ce89fa44f5af93a15c2deeef6e
    OS_REGION_NAME_DOWNLOAD: SERCO-DIAS1
    OS_AUTH_URL_DOWNLOAD: https://auth.cloud.ovh.net/
    ST_AUTH_VERSION_DOWNLOAD: 3

target:
  type: swift
  kwargs:
    ST_AUTH_VERSION: 3
    OS_AUTH_URL_SHORT: https://auth.cloud.ovh.net/
    OS_AUTH_URL: https://auth.cloud.ovh.net/v3/
    OS_USERNAME: xqNChf3Rz5vs
    OS_PASSWORD: 74aM62YPwuHzjcYsaweb58huHNe3rCuZ
    OS_TENANT_NAME: 7398560954290261
    OS_TENANT_ID: 1b418c4359064774af5d55da3f4bcac0
    OS_REGION_NAME: SERCO-DIAS1

# metadata file to look for in downloaded tar/zip file
metadata_glob: "*GSC*.xml"

# extractors for Product type / level
type_extractor:
  # xpath can also be a list of xpaths to be tried one after another
  xpath:
    - /gsc:report/gsc:opt_metadata/gml:using/eop:EarthObservationEquipment/eop:platform/eop:Platform/eop:shortName/text()
    - /gsc:report/gsc:opt_metadata/gml:using/eop:EarthObservationEquipment/eop:platform/eop:Platform/eop:shortName/text()
  map: # optional mapping from extracted type name to used product type name
    PHR_FUS__3: PH00

level_extractor:
  # xpath can also be a list of xpaths to be tried one after another
  xpath: substring-after(substring-after(/gsc:report/gsc:opt_metadata/gml:metaDataProperty/gsc:EarthObservationMetaData/eop:parentIdentifier/text(), '/'), '/')
  map: # optional mapping


preprocessing:
  defaults:
    output:
      crs: "EPSG:4326"
      driver: GTiff
      format_options:
        - BLOCKSIZE=512
        - COMPRESS=DEFLATE
        - LEVEL=6
        - NUM_THREADS=8
        - BIGTIFF=IF_SAFER
        - OVERVIEWS=AUTO
        - RESAMPLING=CUBIC
  types:
    PH00: # as extracted/translated above
      # whether the package can contain sub-packages of TARs/ZIPs
      nested: true
      # glob selectors to look for source images in the source package
      data_file_globs:
        - *.TIF
      additional_file_globs:
        - *.RPC

      # a custom preprocessor function to be called on all selected files
      custom_preprocessor:
        path: "path.to.some.module:attribute"
        # TODO: specify args/kwargs and pass meaningful parameters

      georeference:
        # georeference each file individually
        - type: geotransform # one of geotransform, RPC, GCP, world file
        - type: GCP


      stack_bands:
        # stack all bands for each scene in the product
        group_by: # TODO: figure out a way to get a grouping. e.g: part of the filename using regex?

      output:

      # define a custom postprocessor function to be called on the processed file
      custom_postprocessor:
        path: "path.to.some.module:attribute"
        # TODO: specify args/kwargs and pass meaningful parameters

Needs testing, but currently implemented in https://gitlab.eox.at/esa/prism/vs/-/tree/preprocessor-modularization

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information