Migrating Your Digital Repository to Rosetta
Migrating your digital content to Rosetta has just become much easier with Rosetta 4.1. In the past many institutions have opted to create complex submission applications that extract data from legacy systems and create Rosetta SIPs. As of version 4.1, any repository that supports OAI-PMH can publish data for direct harvesting by Rosetta. The Rosetta OAI-PMH Harvester generates submission folders and uses the same familiar material flow building blocks, so setting up a workflow is pretty straightforward. In this post we’ll show how this is done.
Creating a Material Flow
Creating a material flow is a basic operation in Rosetta and should be trivial even for novices. In this context we want to know what kind of metadata format the OAI-PMH repository is going to be providing. Most repositories can provide Dublin Core (dc – or oai_dc – format), some can provide METS (it’s also important to keep in mind that Rosetta is going to offer me the optional to run xsl transformation on the OAI-PMH output – more on that below). For the current example we’ll assume we’re expecting dc format, and we’ll define a dc material flow, accordingly. To do this, we select a DC Converter content structure.
This content structure must be told what metadata field contains the reference to the files. Typically, this will be dc:identifier (perhaps with a qualifier). Rosetta presents me with a pre-configured list to choose from, which can be edited if necessary.
Note (v. 4.2+): Some repositories provide a URL to a proxy, such as a delivery servlet, but not a direct URL to the files. The URL itself may resolve to a generic filename (e.g. ‘fulltext’, download’ etc.), preventing Rosetta from being able to harvest the files properly – especially if some of your records refer to more than one file. Test your repository’s behavior by running a simple wget with the stream source URL to confirm the expected file is downloaded.
Since the harvester creates submission folders, my material flow must be automated and use an NFS submission format:
Creating a Submission Job
Submission Jobs are scheduled tasks that load submission folders into Rosetta. To process the data generated by the OAI-PMH Harvester, we must have a submission job that checks for ready folders. Submission jobs are configured to work with a specific material flow, and since we have a new material flow we now need to create a new submission job:
OAI XSL Transformation
As mentioned above, Rosetta allows me to run xsl transformation on the OAI-PMH output. This may be needed if I want to exclude or enrich certain metadata fields. Advanced users can leverage this to transform dc output into METS, allowing them to populate DNX fields such as CMS, checksums or access rights, or map multiple files into separate representations.
As part of preparing my migration, analysis of my data will provide my with the required information for a proper transformation. Running xsl transformations on sample records will mitigate the risk of encountering errors during ingest. Take an OOTB Rosetta xsl file as a baseline, and use your XML editor of choice to run your tests.
Creating an OAI-PMH Harvester Job
We now have all the peripheral components we need to configure our harvester job. We now need to gather the relevant mandatory information about our source. This includes the OAI-PMH base URL and metadata prefix, and the set name. The base URL needs to be everything up to ‘?verb=…”. Once you enter a valid URL, Rosetta will populate the metadata prefix and set fields with the available values returned by the repository.
- Rosetta does not yet support updating existing records that were harvested via OAI-PMH. The harvester currently offers an option for identifying and ignoring existing records. If your records are updated and are republished, your ongoing job should be configured to ignore these records. If, however, you expect all published records in your collection to be new to Rosetta, leave this checkbox clear.
- Any successful harvest job execution updates a timestamp that is appended to the OAI-PMH request for selective (incremental) harvesting. If you need to start over (for example, your transformation didn’t produce the expected results), select the ‘Ignore Last Run Time’ checkbox. Note: If this is a recurring job, only the first harvest run after selecting this option will be affected.
- You will typically want to avoid selecting ‘Check for Duplicates’ and ‘Ignore Last Run Time’ simultaneously on a production environment.
We’ll be uploading examples of xsl files for converting data from other repositories. Feel free to share you own examples either in comments or in separate posts.
- DSpace: METS with MODS to Rosetta METS with DC. Maps multiple filegroups, files and their checksum values.
- DigiTool DC to Rosetta DC. The trick here is that DigiTool does not provide a direct URL to the files (see above about using Original File Name). This example demonstrates building a filename from the PID and mimetype.
- DigitalCommons (bepress) DC to Rosetta DC. No direct link here either. This repository provides two dc:identifiers for each record, only one of which provides the file, which is named ‘fulltext’. We created a new field with a unique Original File Name.
- ContentDM DC to Rosetta DC. The provided link is to the discovery module only, not to the file itself. We used ContentDM’s APIs and data from other fields to construct a URL to the file – details are in the xsl file.