Ready to Learn?Ex Libris products all provide open APIs

Tech Blog

 

Extracting Full Text (OCR) in Alma

Josh Weisman on October 29th, 2018

The digital repository features in Alma continue to be developed. Recently support was added for automatic full text extraction for digital images (OCR), and you can now search for text within one of the built-in viewers. In this blog post, we'll review how to extract text from images in Alma and how that text can be searched using the book reader viewer.

Full Text Extraction Job

Full text can be provided for a digital resource when it is ingested into Alma. For more information see this section of the online help. Alternatively, Alma can extract full text automatically for image files in the following formats: bmp, png, jpg, and tif. This extraction can be performed on an individual file or by using a job on a set of digital files.

Individual File

To extract full text on an individual file, go into the Representation Editor, click the "more actions" button, and select "Fulltext." In the window which appears, select the option to "Extract" and click OK. The process will begin and full text will be extracted from the file.

Digital Full Text Extract

Bulk Full Text Extraction via Job

Alternatively, full text can be extracted by Alma on a set of digital files. Select "Run a job" from the Admin menu. Search for fulltext and select the Extract Fulltext job. Select the desired set of digital files and click through the options.

Digital Full Text Job

Viewing Full Text Results

Once the process has completed, a link to the full text results will be available in the file list in the Representation Editor. Click the link to view the results of the extraction. To change to the results, download the file, make the desired changes, and then select "Upload" from the "Fulltext" file action.

Digital Full Text Files

 

Searching full text during delivery

The extracted full text can now be used by the book reader viewer. Activate the book reader viewer service by following the instructions in the online help. You can use a rule which captures image-only representations, such as the following:

\.tif|\.jp2|\.jpg|\.png

Once configured, the viewer will appear as an option in the fulfillment services:

Delivery Service View It

The book reader viewer shows an option to search for text in the book. The search results indicate the page on which the desired term is located, and shows an excerpt with surrounding text.

Digital Fulltext Viewer

You can click here to view this example in the book reader viewer.

Full Text in Services

The full text provided or extracted in Alma can be accessed via APIs for use in external applications or viewers.

JSON Delivery Service

The JSON delivery service returns the data required to display material in an external viewer. See this previous blog post for more details on how to implement a viewer using the service. The delivery service now returns links to the full text for a particular file if available. 

"files": [
  {
    "pid": "1337423180000561",
    "label": "Tyler-No-Tail-Mouse-00",
    ...
    "fulltext": {
      "format": "PLAIN",
      "url": "https://na01.alma.exlibrisgroup.com/view/delivery/text/TR_INTEGRATION_INST/1337423180000561"
    }
  }
]

File REST API

Similarly, the REST Files API also returns a link to the full text if available:

"representation_file": [
  {
    "pid": "1337423180000561",
    "path": "TR_INTEGRATION_INST/storage/alma/E7/97/2B/A4/0E/1E/57/F7/43/BC/1B/61/C9/2C/B4/9C/Tyler-No-Tail-Mouse-00.png",
    "thumbnail_url": "https://na01.alma.exlibrisgroup.com/view/delivery/thumbnail/TR_INTEGRATION_INST/1337423180000561",
    "label": "Tyler-No-Tail-Mouse-00",
    "size": 753288,
    "url": null,
    "fulltext": {
      "fulltext_format": "PLAIN",
      "fulltext_url": "https://na01.alma.exlibrisgroup.com/view/delivery/text/TR_INTEGRATION_INST/1337423180000561"
    }
  }
]

 

In summary, the full text extraction capabilities in Alma make it possible to provide a rich viewing experience for your patrons.