Harvesting OAI into Amazon AWS CloudSearch
Alma supports OAI publishing via its publishing platform. Publishing is a very powerful means of sharing the content of our Alma repository with other systems. When configuring a publishing profile, we specify a set of the records which will be published. We can also specify if we want the results to be written to a file, or to be made available via OAI.
Amazon Web Services offers a search engine service called CloudSearch. CloudSearch exposes a Lucene-based search engine in the cloud.
In this recipe, we want to do the following:
- define a publishing profile with OAI,
- define a search domain in CloudSearch,
- create a script to harvest the data from Alma into CloudSeach
The publishing profile defines a set, and specifies OAI with a set spec of “discovery”
We’ve defined a relatively simple search domain in CloudSearch with four fields. Of course this solution can be expanded to index additional fields if desired.
The harvest script is written in Ruby and follows this basic flow:
- Determine the “from” and “to” dates for subsequent harvesting
- Call the OAI URL and retrieve the updated records
- Use an XSL to convert the OAI response into a CloudSearch request
- Continue to call the OAI URL until all records are processed
Once harvested, we can perform searches in the CloudFront search engine.
Using the script
The scripts are available as Github Gists. The harvest script is available here. It references the XSL so there’s no need to download it. But if you’re interested in the code, it’s here. It uses the following gems:
To configure the script to work in your environment, update the following variables:
### Define variables ### <<<<<<<<<<<<<<<< s3_bucket = 'exldev-scratch' # S3 bucket used to store the time for the next harvest inst = 'TR_INTEGRATION_INST' # Alma institution code domain = 'catalog' # CloudSearch domain alma_inst = 'na01' # Alma instance name oai_set = 'discovery' # OAI set spec configured in the publishing profile ### <<<<<<<<<<<<<<<<
When it runs the script provides a log of its activities:
$ ruby harvest.rb 2015-04-14 21:26:33 - Starting... 2015-04-14 21:26:33 - Retrieving 'from' time 2015-04-14 21:26:33 - Retrieved from time: &from=2015-04-14T18:25:38Z 2015-04-14 21:26:33 - Set 'to' time to: 2015-04-14T18:26:33Z 2015-04-14 21:26:33 - Calling OAI with query string ?verb=ListRecords&set=discovery&metadataPrefix=marc21&until=2015-04-14T18:26:33Z 2015-04-14 21:26:34 - 100 records retrieved 2015-04-14 21:26:36 - Sent to CloudSearch: <response status="success" adds="100" deletes="0"> </response> 2015-04-14 21:26:36 - Calling OAI with query string ?verb=ListRecords&resumptionToken=all@2015-04-14T18:26:33Z@discovery@marc21@230057560000561 2015-04-14 21:26:37 - 100 records retrieved 2015-04-14 21:26:38 - Sent to CloudSearch: <response status="success" adds="100" deletes="0"> </response> 2015-04-14 21:26:39 - Calling OAI with query string ?verb=ListRecords&resumptionToken=all@2015-04-14T18:26:33Z@discovery@marc21@230062000000561 2015-04-14 21:26:40 - 83 records to retrieved 2015-04-14 21:26:41 - Sent to CloudSearch: <response status="success" adds="83" deletes="0"> </response> 2015-04-14 21:26:41 - Storing 'to' time 2015-04-14 21:26:41 - Complete