Tech Blog

Automating Full text Extraction in the Alma Digital Repository

Alma recently added support for full text extraction from image files. The resulting text is stored with the file and can be accessed via the file list in the representation resource editor. In addition, the full-text can be searched in the Book Reader viewer. Full-text can be extracted using the “Extract Fulltext” job.

In the blog post, we’ll show an automated flow for ingesting image files, creating a new representation, and extracting the full text. The full workflow is represented by the flowchart below:

  1. First, we create a new representation on a given bibliographic record using the “Create representation” API
  2. The we add each image file to the representation by:
    1. Uploading the image to S3 in the special institutional upload folder
    2. Calling the “Add file” API
  3. We call the “Create set” API…
  4. … and then we add all of the file IDs to the newly created set using the “Manage set members” API (which now supports adding up to 1000 members at a time, increased form the previous limit of 100)
  5. Then we run the “Extract fulltext” job by:
    1. Submitting the job on the newly populated set using the “Submit job” api
    2. Monitoring the result with the “Get job instance details” API
  6. Finally we clean up after ourselves by deleting the set we created using the “Delete set” API

The result of this flow is automatically extracted full text for the newly ingested image files. We can view the full text in the representation resource editor:

And we can search the extracted full text in the Book Reader view:

The Alma Digital Repository has many powerful features which can be combined with the Alma Open Platform to enable fully automated workflows. All of the scripts used in this post are available in this Github Gist.

Leave a Reply