Tech Blog

Harvesting OAI into Amazon AWS CloudSearch

Alma supports OAI publishing via its publishing platform. Publishing is a very powerful means of sharing the content of our Alma repository with other systems. When configuring a publishing profile, we specify a set of the records which will be published. We can also specify if we want the results to be written to a file, or to be made available via OAI.

Amazon Web Services offers a search engine service called CloudSearch. CloudSearch exposes a Lucene-based search engine in the cloud.

In this recipe, we want to do the following:

  • define a publishing profile with OAI,
  • define a search domain in CloudSearch,
  • create a script to harvest the data from Alma into CloudSeach

Publishing Profile

The publishing profile defines a set, and specifies OAI with a set spec of “discovery”

Search Domain

We’ve defined a relatively simple search domain in CloudSearch with four fields. Of course this solution can be expanded to index additional fields if desired.

Harvesting

The harvest script is written in Ruby and follows this basic flow:

  • Determine the “from” and “to” dates for subsequent harvesting
  • Call the OAI URL and retrieve the updated records
  • Use an XSL to convert the OAI response into a CloudSearch request
  • Continue to call the OAI URL until all records are processed

Once harvested, we can perform searches in the CloudFront search engine.

Using the script

The scripts are available as Github Gists. The harvest script is available here. It references the XSL so there’s no need to download it. But if you’re interested in the code, it’s here. It uses the following gems:

To configure the script to work in your environment, update the following variables:

### Define variables
### <<<<<<<<<<<<<<<<
s3_bucket = 'exldev-scratch' # S3 bucket used to store the time for the next harvest
inst = 'TR_INTEGRATION_INST' # Alma institution code
domain = 'catalog' # CloudSearch domain
alma_inst = 'na01' # Alma instance name
oai_set = 'discovery' # OAI set spec configured in the publishing profile
### <<<<<<<<<<<<<<<<

When it runs the script provides a log of its activities:

$ ruby harvest.rb
2015-04-14 21:26:33 - Starting...
2015-04-14 21:26:33 - Retrieving 'from' time
2015-04-14 21:26:33 - Retrieved from time: &from=2015-04-14T18:25:38Z
2015-04-14 21:26:33 - Set 'to' time to: 2015-04-14T18:26:33Z
2015-04-14 21:26:33 - Calling OAI with query string ?verb=ListRecords&set=discovery&metadataPrefix=marc21&until=2015-04-14T18:26:33Z
2015-04-14 21:26:34 - 100 records retrieved
2015-04-14 21:26:36 - Sent to CloudSearch: <response status="success" adds="100" deletes="0"> </response>
2015-04-14 21:26:36 - Calling OAI with query string ?verb=ListRecords&resumptionToken=all@2015-04-14T18:26:33Z@discovery@marc21@230057560000561
2015-04-14 21:26:37 - 100 records retrieved
2015-04-14 21:26:38 - Sent to CloudSearch: <response status="success" adds="100" deletes="0"> </response>
2015-04-14 21:26:39 - Calling OAI with query string ?verb=ListRecords&resumptionToken=all@2015-04-14T18:26:33Z@discovery@marc21@230062000000561
2015-04-14 21:26:40 - 83 records to retrieved
2015-04-14 21:26:41 - Sent to CloudSearch: <response status="success" adds="83" deletes="0"> </response>
2015-04-14 21:26:41 - Storing 'to' time
2015-04-14 21:26:41 - Complete

 

Leave a Reply