Automating Full text Extraction in the Alma Digital Repository
Alma recently added support for full text extraction from image files. The resulting text is stored with the file and can be accessed via the file list in the representation resource editor. In addition, the full-text can be searched in the Book Reader viewer. Full-text can be extracted using the “Extract Fulltext” job.
In the blog post, we’ll show an automated flow for ingesting image files, creating a new representation, and extracting the full text. The full workflow is represented by the flowchart below:
- First, we create a new representation on a given bibliographic record using the “Create representation” API
- The we add each image file to the representation by:
- We call the “Create set” API…
- … and then we add all of the file IDs to the newly created set using the “Manage set members” API (which now supports adding up to 1000 members at a time, increased form the previous limit of 100)
- Then we run the “Extract fulltext” job by:
- Finally we clean up after ourselves by deleting the set we created using the “Delete set” API
The result of this flow is automatically extracted full text for the newly ingested image files. We can view the full text in the representation resource editor:
And we can search the extracted full text in the Book Reader view:
The Alma Digital Repository has many powerful features which can be combined with the Alma Open Platform to enable fully automated workflows. All of the scripts used in this post are available in this Github Gist.