Metadaten zur Dokumentation

Zuständigkeiten
Product OwnerFrank Schwichtenberg
Entwickler/innenvan Ross, Daniel 
Versionsnummer (Software) 3.6



Introduction

Based on the concept discussed in early 2018 the Binaries Service is meant to handle all kinds of binaries provided during a data-ingest process as it occurs in the DDB. Binaries are submitted to this service by passing a URL to the service from where the binary can be retrieved. After retrieval the service will store the binaries and provide them to the user on demand. A special focus lies on the handling images. These are not only retrieved and provided on demand, they are also scaled to predefined resolution which can be accessed using a IIIF image API endpoint (Level 0). 

Components

The Binaries Service is made up of two components: The first component is responsible for retrieving the binaries from the provided URL, scaling images and saving the data to the database. The second component is the delivery component which is targeted at the user and delivery data stored in the database.

Technology

The Binaries Service stores its data in a Cassandra Database. Cassandra is a cluster-based distributed database capable of safely storing large amounts of data.
The ingest component is a standalone Java application utilizing FIZPro and Apache Spark to process data on the cluster. 
The delivery component is a simple Java servlet. 

Implementation Details

Ingest component

The ingest component is a standalone Java application meant to run on a cluster node.

It offers an API endpoint to submit a URL together with a context saved along with the URL. The endpoint is non-blocking and immediately returns a reference which can later be used to access the binary and all of its derivatives. Once submitted the URL is saved in the database together with the context and the generated reference. It is also enqueued for processing.

For Performance-Reasons and Speed-Adaptions we have one Queue per Host providing Binaries. Queues are ordered by their creation-date. Once all objects in a queue are processed, the queue is deleted.

Queues created first are processed first with the half (configurable) speed of the cluster-resources. If a host gives an error with status-code between 500 and 599, speed for that Queue is reduced by half and resources that gave that error are reenqueued. If with the next run of that queue the host still gives response-codes between 500 and 599, speed is further reduced until host doesnt give these errors any more. Queue then is processed with that speed till it is empty.

Free cluster-resources are used to process objects of other Queues. So then cluster-resources are filled up with the objects of the subsequent queues.

Main Loop that processes Queue is BinariesService.run()

Processing of an Object(URL) means the following: 

  • DdbBinaryDownloadExecutor
    • has this URL been processed before: if not, then download file. 
    • if yes, check the last-modified header and e-tag provided 
    • if one of them differs from previous checks, download file. 
    • update "last checked" date
    • if a file was downloaded
      • if file is no image, save if to the database
      • if file is an image, save to temporary table (originals_data_temp)
  • If file is an image or a pdf:
    • DdbBinaryProcessExecutor
      • BinaryPreparer
        • If downloaded file was an pdf, get first page of the pdf as image
        • scale image to predefined resolutions
      • BinarySaver
        • Write original with + height of image to database
      • BinaryExporter
        • generate pyramid-tiff out of original image or image generated from pdf
        • save pyramid-tiff into filesystem (mounted on all nodes and on iiif-image-server) with originalId as path
        • generate symlink with referenceId that points to file with originalId
    • DdbTempOriginalsExecutor
      • get original image-files from temporary table, save it to filesystem and delete it from temporary table

The above implicitely means that if a Binary was successfully downloaded and scaled in the past and now the Server behind the URL responds with an error-code, status in Table originals is set to "ERROR" but the Binary that was successfully downloaded in the past is kept!

So we have to avoid that any cleanup-job deletes the data and the scaled-data if status in Table originals is "ERROR" or any status other than "OK"

There are different ways of handling errors during this process, especially if the server from which the binary is supposed to be downloaded responds with an error, the URL may be re-enqueued to try the download again later. 

Monitoring

Im nächsten Release geplant. Wird dann aber nur FIZ-intern erreichbar sein.

Endpoints

  • GET /version
    • retrieve version of BinariesService
  • GET /binaries/reference/{referenceId}
    • retrieve status of the binary
  • GET /admin/context/{context}
    • retrieve status of Objects in context ({"queued":"0","processing":"0","successful":"1974","failed":"31"})
  • GET /binaries/entries
  • GET /binaries/context/{context}/queuesize
    • Get the Number of Objects for a context that are still in Queue.
  • POST /binaries
    • submit url to BinariesService ({"url":"http://host/image.jpg","mimetype":"image/jpg","context":"jrgikrgvwkvlv","priority":1)
  • DELETE /binaries/reference/{referenceId}
    • Delete a refrerence
  • DELETE /binaries/context/{context}
    • Delete all references assigned to a context
  • POST /binaries/context/{context}/reprocess
    • Reprocess all references of a context that have an error
  • POST /admin/imagestofs/<start|stop>
    • If original-images are written to FileSystem, then with this Endpoint you can start/stop the writing to FileSystem.
    • As long as it is stopped, Data remains in Table prozessierung.originals_data_temp. When it is started again, data is flushed from Table to FS as soon as BinariesService is not busy anymore.

Java Package description 

NOTE: Initially both components were in one project, therefore the package structure should be reworked, therefore it will not be documented here at the moment.

()

Links

Sources: https://dev.fiz-karlsruhe.de/stash/projects/DDB/repos/ddb-administration-binaries-service

Sonar: https://dev.fiz-karlsruhe.de/sonar/dashboard?id=de.fiz-karlsruhe.binaries-service

Bamboo: https://dev.fiz-karlsruhe.de/bamboo/browse/DDB-DABS

Delivery component

The delivery component is a standard Java Servlet using the JAX-RS implementation Jersey to provide access to the stored binaries and their derivatives. 

Java Package description 

de.fiz.binariesservice.server:

Contains the application main classes and the two classes responsible for serving content for /binary and /image. 

de.fiz.binariesservice.models:

Contains JAX-B classes for xml/json mappings. 

de.fiz.binariesservice.utils:

Utility classes (should be moved to other packages with less generic names)


Noteworthy classes: BinariesRequestHandler contains code parsing HTTP Range parameters to allow streaming data.

Links

 

Sources: https://dev.fiz-karlsruhe.de/stash/projects/DDB/repos/ddb-administration-binaries-server

Sonar: https://dev.fiz-karlsruhe.de/sonar/dashboard?id=de.fiz-karlsruhe.binaries-server

Bamboo: https://dev.fiz-karlsruhe.de/bamboo/browse/DDB-DDBBS2

Cassandra table structure

The table structure is documented on an own page:

Cassandra Tabellenstruktur des Binaries Services

API Documentation

Ingest component

METHODURLDescription
GET/binaries/{context}Provides a basic status summary of the binaries assiciated with the given status
POST/binaries

Enqueues the URL provided in the JSON in the request body for processing.

Returns a reference which can be used to access the binary via the delivery component.

Example request body:

{ "url": "http://exqample.com/file.pdf",
"mimetype": "application/pdf",
"context": "arbitrary string"
}
DELETE/binaries/reference/{reference}Deletes the given reference
DELETE/binaries/context/{context}Deletes all references associated with this context


Delivery component

METHODURLDescription
HEAD/binary/{reference}Provides access to the metadata headers of the according GET request
GET/binary/{reference}Allows downloading the original binary, supports streaming using HTTP range requestsDownloading image binaries over this endpoint is not supported.
GET/image/{reference}/info.json

IIIF image info request.

Deprecated, will redirect to /image/2/{reference}/info.json
GET/image/2/{reference}/info.json

IIIF image info request compliant with IIIF version 2.1.


GET

/image/{reference}/{region}/{size}/{rotation}/default.jpg

IIIF image request.Deprecated, will redirect to image/2/{reference}/....
GET/image/2/{reference}/{region}/{size}/{rotation}/default.jpgIIIF image request compliant with IIIF version 2.1.

Currently only a subset is supported.

Supported values are:
region: full
rotation: 0
size: "!116,87", "!140,105", "!440,330" or "!800,600"


Side projects

Deployments

Ingest component

There are currently two deployments of the Ingest component, one on the development cluster and one on the production cluster. 

Delivery component

A snapshot of the current development status of the delivery component is automatically deployed to the development network by bamboo after each commit, It can be reached at https://dev-ddb.fiz-karlsruhe.de/binaries-service/.

In the production network there is a redundant deployment accessible via https://iiif.deutsche-digitale-bibliothek.de


Combining the Binaries Service with an existing IIIF Image Server

The Binaries Services delivery component already serves as a IIIF Image API compliant server. In order to allow more resolutions and tiled images a IIIF Image server is combined with the Binaries service in order to serves those resolutions that the Binaries Service can currently not deliver. The chosen IIIF Image Server is IIP Image. This required pyramid images as datasources. Therefore the Binaries Services ingest component was extended to create a mutli-resolution TIFF image for every image it dowloads. The IIPImage Server is installed in parallel to the Binaries Services delivery component and an Apache HTTPD server with mod_proxy ensures that the requests are sent to the right application to provide the response.