Tech Blog

Batch Update of DC records in Rosetta using Metadata Update Job

To update a batch of Dublin Core descriptive metadata records in IEs that are already in Rosetta, users have 3 tools at their disposal:

In this post we will show an example of the last solution, using the following two different methods:

Method 1. Publish the updateMD records directly from Rosetta using XSLT

1. Add a new XSL stylesheet

Go to Home > Advanced Configuration > General > Configuration Files and add a new XSL stylesheet

  • In the ‘File Group’ drop-down list, choose ‘IE XSL Transformation’, and in the ‘Sub-Group’ choose ‘Publishing’.
  • Give your stylesheet a name, such as ‘../xsl/mets2updateMD’, and a description ‘Publishing’.
  • Paste the XSL code in the text box.
<?xml version="1.0" encoding="UTF-8"?>

<!-- Add your custom namespaces to the list and to exclude-result-prefixes! -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:dnx="http://www.exlibrisgroup.com/dps/dnx"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:mods="http://www.loc.gov/mods/v3"
    xmlns:mets="http://www.loc.gov/METS/"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="http://com/exlibris/digitool/repository/api/xmlbeans"
    exclude-result-prefixes="dc dcterms mods xsi dnx mets"
    version="1.0">

  <xsl:strip-space elements="*" />
  <xsl:output method="xml" encoding="UTF-8" />

  <!-- Main -->
  <xsl:template match="/">
    <updateMD xmlns="http://com/exlibris/digitool/repository/api/xmlbeans">
        <xsl:apply-templates select="//mets:techMD[@ID='ie-amd-tech']//dnx:key[../dnx:key[@id='internalIdentifierType']/text()='PID' and @id='internalIdentifierValue']" />
        <metadata>
          <type>descriptive</type>
          <subType>dc</subType>
          <content>
            <xsl:text disable-output-escaping="yes">&lt;![CDATA[</xsl:text>
            <xsl:apply-templates select="//dc:record" />
            <xsl:text disable-output-escaping="yes">]]&gt;</xsl:text>
          </content>
        </metadata>
    </updateMD>
  </xsl:template>

  <!-- Identity transform -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" />
    </xsl:copy>
  </xsl:template>

  <!-- Copy PID -->
  <xsl:template match="//mets:techMD[@ID='ie-amd-tech']//dnx:key[../dnx:key[@id='internalIdentifierType']/text()='PID' and @id='internalIdentifierValue']">
    <PID>
      <xsl:apply-templates select="node()" />
    </PID>
  </xsl:template>

</xsl:stylesheet>

The above XSL code creates the updateMD document structure required by the Update Metadata job, but does not actually make any change in the Dublin Core record.

In order to apply changes to the DC record you will need to add your own templates after the last template in the XSL. The method used here is called Identity Template, and is used when one wants to copy all nodes and attributes of a XML document while doing some changes in the copied parts.

The following web page gives a good overview of what can be achieved this way: http://www.xmlplease.com/xsltidentity

Here are some examples applied to the Rosetta dc:record structure:

<!-- Rename a tag and remove its attribute -->
  <xsl:template match="dc:title[@xsi:type='dcterms:alternative']">
    <dcterms:alternative>
      <xsl:apply-templates select="node()"/>
    </dcterms:alternative>
  </xsl:template>

  <!-- Remove DC fields -->
  <xsl:template match="dc:language" />

  <!-- Add a new tag before the dc:title tag -->
  <xsl:template match="dc:title">
    <dc:description>Some text</dc:description>
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

… and it is of course possible to achieve much more by using XSLT and more precise X-Path expressions.

2. Create a new publishing profile

Go to Home > Data Management > Publishing Configuration and add a new publishing configuration. The General Details and Sets tabs do not differ from any other publishing job, so we’ll skip them to concentrate on the two Add Publishing Profile screens.

Add Publishing Profile – Step 1

In ‘Converter Type’, choose ‘XSLIEConverterPlugin’. In ‘Target Type’, choose ‘NFSPublisherPlugin’.

Add Publishing Profile – Step 2

In XSL-IEconverter Converter Parameters, ‘XSL File’, choose the XSL file created in 1.
In NFS-Publisher Publisher Parameters:

  • In ‘Folder’, choose a folder on your NFS, for example /operational_shared/mdupdate
  • In ‘Number Of Sub-directories’, enter 1 as we want all IEs to be in the same directory (we can later reuse the same directory in our updated job by entering ‘/operational_shared/mdupdate/0’ where ‘0’ is the directory created by the publishing job.
  • In ‘File Extension’, enter ‘xml’

After saving the Profile, go to the General Details Tab, check the ‘Republish all on next sync’ box and click on the Sync button to start publishing the IEs to the disk.

You can jump to the last part of this post to run the metadata update job.

Method 2. Publish the IE’s full METS structure and process it externally using a script

This method is very similar to the first one except that it publishes the whole METS and uses a script to update the DC record and create the input format for the job. The advantage is that you are able to use the script language of your choice.

1. Add a new XSL sheet to Home > Advanced Configuration > General > Configuration Files

In the ‘File Group’ drop-down list, choose ‘IE XSL Transformation’, and in the ‘Sub-Group’ choose ‘Publishing’.
Give your stylesheet a name, such as ‘../xsl/mets2updateMD’, and a description ‘Publishing’.
Paste the XSL code in the text box.

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:dnx="http://www.exlibrisgroup.com/dps/dnx"
  xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
  version="1.0">
  <xsl:template match="/">
    <xsl:copy-of select="/"/>
  </xsl:template>
</xsl:stylesheet>

Do not forget to add your custom namespaces, if you have some, in the document declaration.

2. Create a new publishing profile in Home > Data Management > Publishing Configuration.

The procedure is identical to the one described in Method 1.

3. Extract and update the metadata

Create the mets2updateMD.pl script using the following Perl code in the export directory and run it using the following command:

perl mets2updateMD.pl

Note: In order to run, this script needs the the Perl module XML::LibXML to be installed, which in turn requires libraries such as libxml2, libxml2-dev and zlib1g-dev. Some Linux distributions propose a package, such as libxml-libxml-perl in Ubuntu, or you can use cpan after having installed the libraries.

If the sub chgrec() is left empty the script will only extract the metadata without any changes. Use the XML::LibXML methods to add elements, change values and attributes, etc,.

see http://search.cpan.org/dist/XML-LibXML/LibXML.pod

#!/bin/usr/perl

use strict;
use warnings;
use XML::LibXML;

my @records;
mkdir('ies_out') unless(-d 'ies_out');

opendir(my $dh, ".") || die;
my @files = grep { /^IE\d+.xml/ && -f "$_" } readdir($dh);
closedir($dh);

for my $file (@files) {
    extrdc($file);
    $file =~ s/\.xml$//;
    for my $record (@records) {
        chgrec($record);
        output($record,$file);
    }
}

#Extracts the dc:record from the mets
sub extrdc {
    my $file = shift;
    my $parser = XML::LibXML->new();
    $parser->recover(1);
    $parser->keep_blanks(0);
    my $doc = $parser->load_xml(location => $file);
    my $xc = XML::LibXML::XPathContext->new($doc);
    $xc->registerNs("dc","http://purl.org/dc/elements/1.1/");
    @records = $xc->findnodes('//dc:record');
}

#Applies changes on the record
sub chgrec {
    my $record = shift;
# Insert your code here, for example
# Remove a field
#    $_->parentNode->removeChild($_) for $record->findnodes('dc:language');
# Add a field:
#    $record->setNamespace( "http://purl.org/dc/elements/1.1/", "dc" );
#    $record->appendTextChild( "language", "en" );
# Change a dc field to a dcterms field and remove the attribute:
#    for my $alt ($record->findnodes('dc:title[@xsi:type="dcterms:alternative"]')) {
#        $alt->removeAttribute("xsi:type");
#        $alt->setNamespace( "http://purl.org/dc/terms/" , "dcterms" );
#        $alt->setNodeName("alternative");
#    }
}

#Creates the updateMD file
sub output {
    my $record = shift;
    my $pid = shift;
    my $out = XML::LibXML->createDocument( "1.0", "UTF-8" );
    my $root = $out->createElementNS( "http://com/exlibris/digitool/repository/api/xmlbeans" , "updateMD" );
    $out->setDocumentElement( $root );
    $root->appendWellBalancedChunk( "<PID>$pid</PID><metadata><type>descriptive</type><subType>dc</subType><content /></metadata>" );
    my @contents = $root->getElementsByTagName("content");
    for (@contents) {
        my $cdata = XML::LibXML::CDATASection->new($record);
        $_->appendChild( $cdata );
    }
    $out->toFile("ies_out/$pid.xml","1");
}

Update IEs in Rosetta

  1. Create a directory under /operational_shared, for example

/operational_shared/md_update/INS01/updateMdJob1

  1. Place all the <PID>.xml files created by the script under this directory
  2. Add a new update job in Data Management > Advanced Tools > Update Metadata
  3. In the User ID and User Password fields, enter a staff user with sufficient privileges, and the directory you have created in the NFS Path field
  4. Save and click on Run Now to run the job. Logs can be seen in server.log.

Please refer to this article for detailed instructions

 

Leave a Reply