Tech Blog

Programmatically Add Files to Representations

Alma supports the loading of digital materials via CSV files. MD Import profiles can create bibliographic records, representations, and digital files. The workflow involves creating a CSV file in the supported format with metadata and file information, uploading the CSV and digital files to the appropriate folder in S3, and running the MD import process.

Alma also supports loading digital content via API. In this post, we will use the Alma APIs to perform the following steps:

The script is written in Python and uses Python’s powerful CSV parsing and multi-threaded model to perform the tasks in parallel, reducing the total time needed to process all of the files. The CSV file contains only two columns- MMS_ID and file path:

99509041500561,file-01.png
99509041400561,file-92.png
99509041300561,file-87.png
99509041200561,file-22.png
99508941400561,file-68.png

The script expects the Alma API key to be in the environment (ALMA_APIKEY) and the AWS credentials to be in a default profile (or with the profile name in the  AWS_PROFILE environment variable). AWS credentials can be obtained in Alma by following these instructions. To configure the AWS Python SDK (Boto3) with your credentials, follow these directions.

The script can be configured by setting the 3 variables at the top:

INST_CODE = '01MYUNI_INST' # Alma instutition code
LIBRARY_CODE = 'MAIN' # Code of the library to which the representations should belong
THREADS = 3 # Number of threads for parallel processing

The main logic of the script is in the following function:

def process_line(l):
  rep = add_rep(l[0]) # Call the Alma API to add a representation
  key = upload_file(l[1]) # Upload the file (second column in the CSV) to AWS
  print(key)
  file = add_file(l[0], rep["id"], key) # Call the Alma API to add the file to the representation
  print(file)

In the output below, the script is configured with 3 parallel processes so you can see the output comes in batches of 3. Depending on your use of APIs, you can probably increase the number (8-10 is probably a safe bet).

$ python index.py files.csv 
Processing line: ['99509041500561', 'logo.png']
Processing line: ['99509041400561', 'logo.png']
Processing line: ['99509041300561', 'logo.png']
Uploaded file TR_INTEGRATION_INST/upload/migration/1b13a086-c78e-4478-90fb-2641a61b1e59/logo.png
Uploaded file TR_INTEGRATION_INST/upload/migration/223f7a53-28f5-4a50-aeae-c944241101c4/logo.png
Uploaded file TR_INTEGRATION_INST/upload/migration/a67b897e-b1ad-4c2b-b281-2b745214a0cb/logo.png
Added file to rep 13155769230000561
Processing line: ['99509041200561', 'logo.png']
Added file to rep 13155769220000561
Processing line: ['99508941400561', 'logo.png']
Added file to rep 13155759290000561
Processing line: ['99509041100561', 'logo.png']
...

The full text of the script can be found in this Gist, and the script can be expanded to add additional information, such as representation or file labels.

Leave a Reply