Tech Blog

Extracting Full Text (OCR) in the Alma Digital Repository

The digital repository features in Alma continue to be developed. Recently support was added for automatic full text extraction for digital images (OCR), and you can now search for text within one of the built-in viewers. In this blog post, we’ll review how to extract text from images in Alma and how that text can be searched using the book reader viewer.

Full Text Extraction Job

Full text can be provided for a digital resource when it is ingested into Alma. For more information see this section of the online help. Alternatively, Alma can extract full text automatically for common text and image file formats (see the online help for the detailed list). This extraction can be performed on an individual file or by using a job on a set of digital files.

Individual File

To extract full text on an individual file, go to the Files tab in the Digital Representation Editor, click the “more actions” button for a file, and select “Fulltext.” In the window which appears, select the option to “Extract” and click OK. The process will begin and full text will be extracted from the file.

Bulk Full Text Extraction via Job

Alternatively, full text can be extracted by Alma on a set of digital files. Select “Run a job” from the Admin menu. Search for fulltext and select the Extract Fulltext job. Select the desired set of digital files and click through the options.

Viewing Full Text Results

Once the process has completed, a link to the full text results will be available in the file list in the Representation Editor. Click the link to view the results of the extraction. To change to the results, download the file, make the desired changes, and then select “Upload” from the “Fulltext” file action.

 

Searching full text during delivery

The extracted full text can now be used by the book reader viewer. Activate the book reader viewer service by following the instructions in the online help. You can use a rule which captures image-only representations, such as the following:

Representation's Files Equals \.tif|\.jpeg|\.jpg|\.png

 

Once configured, the viewer will appear as an option in the fulfillment services:

 

The book reader viewer shows an option to search for text in the book. The search results indicate the page on which the desired term is located, and shows an excerpt with surrounding text.

 

You can click here to view this example in the book reader viewer.

Full Text in Services

The full text provided or extracted in Alma can be accessed via APIs for use in external applications or viewers.

JSON Delivery Service

The JSON delivery service returns the data required to display material in an external viewer. See this previous blog post for more details on how to implement a viewer using the service. The delivery service now returns links to the full text for a particular file if available.

"files": [
  {
    "pid": "1337423180000561",
    "label": "Tyler-No-Tail-Mouse-00",
    ...
    "fulltext": {
      "format": "PLAIN",
      "url": "https://na01.alma.exlibrisgroup.com/view/delivery/text/TR_INTEGRATION_INST/1337423180000561"
    }
  }
]

 

File REST API

Similarly, the REST Files API also returns a link to the full text if available:

"representation_file": [
  {
    "pid": "1337423180000561",
    "path": "TR_INTEGRATION_INST/storage/alma/E7/97/2B/A4/0E/1E/57/F7/43/BC/1B/61/C9/2C/B4/9C/Tyler-No-Tail-Mouse-00.png",
    "thumbnail_url": "https://na01.alma.exlibrisgroup.com/view/delivery/thumbnail/TR_INTEGRATION_INST/1337423180000561",
    "label": "Tyler-No-Tail-Mouse-00",
    "size": 753288,
    "url": null,
    "fulltext": {
      "fulltext_format": "PLAIN",
      "fulltext_url": "https://na01.alma.exlibrisgroup.com/view/delivery/text/TR_INTEGRATION_INST/1337423180000561"
    }
  }
]

 

In summary, the full text extraction capabilities in Alma make it possible to provide a rich viewing experience for your patrons.

 

 

2 Replies to “Extracting Full Text (OCR) in the Alma Digital Repository”

  1. Hi Josh — doyou know if it is possible to configure teh IA reader to ignore words with a ‘low’ word confidence level. For example on teh following ALTO XMl line the wc value is 0

    cheers

    1. Hi Dave,

      Are you referring to the Rosetta or the Alma implementation of the IA Book Reader? The Rosetta side is addressed in this blog post. On the Alma side, the automated text extract job produces only text, so the ALTO WC score is not relevant. User-provided ALTO is supported though so perhaps the WC score might be relevant there.

      -Josh

Leave a Reply