File Splitters

General Purpose

A file splitter is a plug-in application that allows you to implement your own parsing methodology and integrate it into the Primo pipe flow. If you need to load an unsupported file format into Primo, you can implement a new file splitter that corresponds to the new file structure.

The file splitter plug-in receives a file to parse and a “record saver” object, which is used to save each parsed record into Primo. A file splitter splits the received file into records and passes them to the “record saver” object to be saved into the Primo database.

Primo comes with many “out of the box” ready to use file splitters. You will only need to implement a new file splitter, if you are trying to parse a currently non supported file format (chances for this is low).

When running a Primo pipe using the harvesting of files option (instead of re-piping existing records), the harvested files are parsed and stored in the database. Because a file may contain more than one record, the parser may create more than one record in the database. The records are stored in the P_SOURCE_RECORD table. The NEP process reads these records, creates PNX records, and stores the PNX records in the P_PNX table.

In previous releases, Primo could only parse the following formats of harvested files:

  • XML files in an OAI structure
  • MARC Exchange file formats

For any other format, an XSLT program was needed to convert the XML file to an OAI format. XSLT is a program that receives an XML file and manipulates and transforms the file into a differently structured XML file.

XSLT has some problematic issues:

  • Performance – usually slow performance as opposed to other transformation possibilities.
  • Memory consumption – XSLT program loads into memory the whole processed file. Thus transforming a large file may result in an out of memory error.
  • Complexity – writing XSLT programs requires knowledge in the XSLT syntax.
  • Flexibility – There are certain operations that might not be possible using XSLT.

Configuring a pipe to use a file splitter

To parse harvested files, you must use the Primo Home > Ongoing Configuration Wizards > Pipe Configuration Wizard > Data Sources Configuration > Data Sources page to associate a file splitter to the data source:

Configuring File Splitters

The following mapping tables were added to configure the file splitters:

  • File Splitters
  • File Splitter Params

File Splitters

This mapping table is located in the Publishing subsystem and configures the file splitters that display in the drop-down list on the Data Source page in the Back Office. Each mapping row contains the following fields:

  • Name – The display name of the file splitter in the list.
  • Splitter class – A fully-qualified name of a class that implements the file splitter JAVA interface (also known as a plug-in implementation).
  • Enabled – Indicates whether this file splitter is accessible from the Back Office.
  • Description – The description for the file splitter.

File Splitter Params

This mapping table is located in the Publishing subsystem and configures which parameters are sent to the file splitter.
Note that not all file splitters receive parameters.

Each mapping row contains the following fields:

  • Enabled – Indicates whether the parameter is sent.
  • Param name/value – A key/value passed to the file splitter. The file splitter should use the parameter name to retrieve the needed value.
  • File Splitter name – A drop down list of the configured file splitters. Select the file splitter to which you want this parameter to be passed.

A file splitter may be passed as many parameters as you need.

Note: Do not change the parameters for the predefined file splitters. Any change to these parameters may result in a nonfunctioning file splitter.

Testing Your File Splitter

The Test File Splitter tool has been added to the Primo Home > Primo Utilities > System Tests & Monitor page to test the output of your file splitter without executing a pipe. To test a file splitter, select a file splitter to test, the file to be parsed (an uncompressed file), the character set, and the number of records to be printed. To display specific records, enter the record IDs separated with commas in the Record Id’s field.

XML File Splitter

The XML file splitter is used for harvested files that are in an XML format. To use this file splitter, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.generic.DomXmlSplitter

The following table lists the supported parameters:

  • Full XPath support – Full XPath support means you can use the full syntax supported by XPath as defined in:http://www.w3.org/TR/xpath.
    The XPath should start from the record start path as defined in FullRecordXPath. This path should NOT include all tags from the beginning of the file. Must add “//” to the beginning of the XPath.
  • Partial XPath support – Expecting an XPath, but will only support simple paths to elements. There is no support for a rule based on the value of an attribute. There is no support available for XPath expressions like “item []” (based on attributes). The path must include all tags from the beginning of the file.
NameDescriptionExplanationXPath SupportMandatory
RootXpathThe XPath to the first tag in the file.For example, for the following [file:
] <records>
<record> </record>
<record> </record>
</records> Use the value: records
PartialYes
FullRecordXPathThe XPath to the beginning of a record. A file may contain more than 1 record in it.For example, for the following [file:
] <records>
<record> </record>
<record> </record>
</records> Use the value: records/record
PartialYes
IdentifierXpathThe XPath to the identifier tag of the record. This should be the tag holding the unique identifier of the record.For example, for the following [file:
] <records>
<record><id>123</id> </record>
<record><id>123</id> </record>
</records> Use the value: //record/id Note – We do not need to use add the “records” to the path. We start from the root of the record.
FullYes
StatusXpathThe XPath to the location of the deleted status of the record. Default value for a record is: deleted = false.For example, for the following [file:
] <records>
<record>
<id>123</id><status>Y</status>
</record>
<record>
<id>124</id><status>N</status>
</record>
</records> Use the value: //record/status
FullNo
StatusWhenDeletedShould be a regular expression. This regular expression is run against the value found under the StatusXpath path. If the regular expression matches the value found, the record is marked as deleted.For example, for the following [file:
] <records>
<record>
<id>123</id><status>Y</status>
</record>
<record>
<id>124</id><status>N</status>
</record>
</records>
if we defined:
StatusXpath: //record/status
we will define StatusWhenDeleted = Y
The first record is marked as deleted and the second record is not.
NoNo (but if StatusXpath is defined then StatusWhenDeleted is mandatory)
ContentXpathIn situations when you do not want to pass the whole record content to the normalization rules, you can define a subsection of the record instead. If ContentXpath is not defined, the content is assumed to be the whole record as defined in FullRecordXPath.For example, for the following [file:
] <records>
<record>
<subRecord>
<id>123</id><status>Y</status>
</subRecord>
</record>
</records> ContentXpath: //record/subRecord
FullNo
ExternalResourceSourceXpathAn XPath to the location of a tag holding the path to a full text file. The file path is assumed to be pointing to the machines hard drive. If the file path doesn’t include a file extension (suffix), the file name searched for, will be first with a PDF extension and then with an HTML extension.
If file is not found, the file will be searched for using the original path. If path starts with “http”, the file will be searched for on the web.
For example, for the following [file:
] <records>
<record>
<path>www.xxx.com/file.pdf</path>
</record>
</records> ExternalResourceSourceXpath: //record/path
FullNo
MultipleFulltextIndicates whether there are multiple tags holding full text. If set to true, multiple full text will be extracted.
Possible values are: true/false (default value is false)
For example, for the following [file:
] <records>
<record>
<path>www.xxx.com/file1.pdf</path>
<path>www.xxx.com/file2.pdf</path>
</record>
</records>
No
ExtRsrcXpathRegexpA regular expression run against the path extracted from ExternalResourceSourceXpath.
Use this if you want to use only a part of the full file path. Extract the wanted text to be used, by using round brackets surrounding the text to be extracted. For example:
Regular expression: .(xx.)
Text extracted: all characters from the first occurrence of “xx” until the end of the string.
/aa/bb/xx/file.pdf -> xx/file.pdf
If the file path doesn’t include a file extension (suffix), the file name searched for, will be first with a PDF extension and then with an HTML extension.
For example, for the following [file:
] <records>
<record>
<path>www.xxx.com/a/b/file</path>
</record>
</records>
ExtRsrcXpathRegexp: .(b.)
output:
www.xxx.com/a/b/file -> b/file
FullNo
ExternalFilesPathA string to be used as the prefix of the path found in ExternalResourceSourceXpath. This should be used to add the folder location holding your full text files.
ExternalFilesPath = /Exlibris/primo/fulltexts/.
FullNo
ExternalResourceTargetXpathIf defined, the full text will be stored in the generated XML. Define an XPath to store the full text under.
The XPath can be any existing XPath or may include new tag/s at the end, which will result with new tags added to the file. An additional tag <Fulltext> will wrap the text under the given XPath. This is needed if both the ExternalResourceSourceXpath and EmbeddedFullTextXpath are defined, and then you may have more than one full text.
For example:
EmbeddedFullTextXpath = //record/myTag
will generate the following addition to the generated XML:
<record>

<myTag>
<Fulltext>text_1</Fulltext>
<Fulltext>text_2</Fulltext>
</myTag>
</record>
Or (without a new tag added)
EmbeddedFullTextXpath = //record
will generate the following addition to the generated XML:
<record>

<Fulltext>text_1</Fulltext>
<Fulltext>text_2</Fulltext>
</record>
FullNo. If ExternalResourceSourceXpath was defined, either ExternalResourceTargetXpath or AddExtensionsToExtensionsTable are mandatory.
ExternalResourceTypeRelevant only when AddFullTextToExtension and ExternalResourceTargetXpath are used.
Define a type for the extension.
By default the extension type will be: FULLTEXT.
Use this parameter in order to overwrite the type value.
This type should correlate with the following mapping table:

  • Subsystem: Front End
  • Table name: PNX_EXTENSIONS_MAPPING
For Example:
ExternalResourceType = abstract
FullNo
AddExtensionsToExtensionsTableIf defined, the full text extracted will be stored in the P_PNX_EXTENSION table.The value for this parameter has no meaning. This is just a flag. If the parameter name: AddExtensionsToExtensionsTable is defined the flag is on.FullNo. If ExternalResourceSourceXpath was defined either ExternalResourceTargetXpath or AddExtensionsToExtensionsTable are mandatory.
EmbeddedFullTextXpathAn XPath to the location of the full text which is assumed to be embedded in the harvested file.For example, for the following [file:
] <records>
<record>
<ft>Full text data is embeded</ft>
</record>
</records>
EmbeddedFullTextXpath = //record/ft
FullNo
EmbeddedFullTextTypeRelevant only when AddFullTextToExtension and EmbeddedFullTextXpath are used.
Define a type for the extension.
By default the extension type will be: FULLTEXT.
Use this parameter in order to overwrite the type value.
This type should correlate with the following mapping table:

  • Subsystem: Front End
  • Table name: PNX_EXTENSIONS_MAPPING
For Example:
EmbeddedFullTextType = abstract
FullNo

Ext parameters – The ExtType<grp>, ExtValue<grp>, and ExtOverride<grp> parameters can be used more than once, as indicated by the group number suffix <grp>. These parameters allow you to insert a hard-coded extension value and type.

In the following example, the parameters are configured for groups 1 and 2:

Group 1:

  • ExtType1=ABSTRACT
  • ExtValue1=”This is the abstract for Harry Potter”
  • ExtOverride1=true

Group 2:

  • ExtType2=ABSTRACT
  • ExtValue2=”My first abstract”
  • ExtOverride2=false

These parameters allow you to extract any part of an HTML file and place it in the generated XML.

ExtTypeDefines the extension type that will be used.
Default value is FULLTEXT.
This type should correlate with the following mapping table:

  • Subsystem: Front End
  • Table name: PNX_EXTENSIONS_MAPPING
ExtType1 = ABSTRACTFullNo
ExtValueThe value to be inserted as an extensionFullNo
ExtOverrideDefault value is true.
If true then the previous extensions with the same type, for this record, will be deleted prior to inserting this extension.
If false, the new extension will be added to the previous existing extensions.
ExtOverride3 = trueFullNo
FilterHtmlFromFullTextA flag to remove HTML tags from the full text extractedThe value for this parameter has no meaning. This is just a flag. If the parameter name: FilterHtmlFromFullText is defined the flag is onFullNo
StringNormalizersnormalazier names separated by a “;” (semicolon). Currently 1 normalizer is defined, called: AmpNormalizer. Will handle illegal & signs which cause errors when trying to parse an XML file.StringNormalizers = AmpNormalizerFullNo
ConnectionTimeoutThe amount of time in seconds to wait when trying connect to a web server for accessing full text from the webDefault is 2FullNo
ReadTimeOutThe amount of time in seconds to wait for reading the full text content from a web serverDefault is 20FullNo
NamespaceXpathAn XPath pointing to a tag containing the namespaces of the file. There are cases where a tag prior to the record content holds namespaces. When we extract only the record part from the XML, we do not have the needed namespaces in the extracted record. Adding this parameter will ensure the namespaces will be copied into the extracted record enabling the parsing of the generated XML during the NEP phase.For example, for the following [file:
] <records xmlns:g=”ynet.co.il”>
<record>
<g:id>123</g:id>
</record>
</records>
NamespaceXpath = records
The generated XML will look like this:
<record xmlns:g=”ynet.co.il”>
<g:id>123</g:id>
</record>
PartialNo
GenerateNameSpacesA flag to add undeclared namespaces. Use this flag if the file contains name spaces which are not declared.The value for this parameter has no meaning. This is just a flag. If the parameter name: GenerateNameSpaces is defined the flag is on.
For example, for the following [file:
] <records >
<record>
<g:id>123</g:id>
</record>
</records>
NamespaceXpath = records
The generated XML will look like this:
<record xmlns:g=”http://1”>
<g:id>123</g:id>
</record>
FullNo
FilterTagsXpath<number>There are cases where a nested tag may exist. A nested tag is a case such as:
<doc>
<a>value1 <b>value2</b>
value3</a>
<xx>value1 <i>value2</i>
value3</xx>
</doc>
In such cases, when we will use an XPath to this location such as: //doc/a the value we will receive will be: value1
value3. There are cases where we will want also to extract the value under the nested tag, meaning the extracted value expected should be: value1 value2 value3. For such cases use this parameter against the wanted path(i.e. FilterTagsXpath1 = //doc/a)
This parameter can be used as many times as needed by using a different group number (i.e. FilterTagsXpath1 = //doc/a, FilterTagsXpath2 = //doc/xx).
For example, for the following [file:
] <doc>
<a>value1 <b>value2</b>
value3</a>
</doc>
FilterTagsXpath1 = //doc/a
The generated XML will look like
<doc>
<a> value1 value2 value3
<b>value2</b>
</a>
</doc>
This will mean that now when requesting the value for path: //doc/a we will receive: value1 value2 value3
FullNo
SplitRecordIdBy default the value under IdentifierXpath, will be split by “:”, and the right most value will be used as the identifier (i.e. for the value, oai:digi-tool.com:233566 we will end up with 233566). In order to disable the splitting you must define this parameter with the value “false”. This is supported from version 4.5 and on.For the value, oai:digi-tool.com:233566 we will end up with 233566)FullNo

Treating namespaces in an XPath

When dealing with an XPath you sometimes need to take into account “namespaces”.
It doesn’t really matter if you know what namespaces are but you will need to know to identify them and treat them in a special way when defining XPaths.
You might sometimes see in an XML file, a tag with a prefix followed by a colon such as: “test:” resulting in a tag looking like <test:publisher>.
The prefix “test:” is a usage of a namespace. In such cases when defining an XPath including a tag with a namespace you will need special treatment for the namespace and just entering the tag name (i.e. “publisher” in previous example) will not work.
In order to deal with the namespace use the following: *[local-name()=’publisher’], this means match a tag starting with anything, but ends with “publisher”.
For example, for the file:

<metadata>

  <xb:digital_entity>

    <pid>233566</pid>

  </xb:digital_entity>

</metadata>

You will define the following XPath in order to get to the “pid” value:
//metadata/[local-name()=’digital_entity’]/pid*

HTML File Splitter

This file splitter is used for harvested files that are HTML format.

The generated XML includes the following sections:

  • Meta params – will store all data found in the HTML under the “meta” tags. See http://www.w3schools.com/tags/tag_meta.asp for more information on meta tags in HTML files.
  • Substrings – A set of parameters can be defined to extract specific parts of the file. These parts will be placed under the <substrings> section.

The resulting XML will look like this:

<record>

    <description>Free Web tutorials</description>

    <keywords> HTML,CSS,XML,JavaScript</keywords>

    <author>Hege Refsnes</author>


    <substrings>

        <sub1>blah blah blah…</sub1>

        <sub2>kuku kuku kuku…</sub2>

    </substrings>

</record>

To use this file splitter, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.html.HTMLFileSplitter

The following table lists the supported parameters:

NameDescriptionExplanationXPath supportMandatory
IdentifierXpathA meta param name holding the unique identifier of the record.For example, for the following [file:
] <html>
<head>
<meta name=”id”>123</meta>
</head>
<body>
</body>
</html>
IdentifierXpath = id
NoYes
StatusXpathA meta param name holding the deleted status of the record.For example, for the following [file:
] <html>
<head>
<meta name=”status”>Y</meta>
</head>
<body>
</body>
</html>
IdentifierXpath = status
NoNo
StatusWhenDeletedSame as for the XML file splitter.
ExternalResourceSourceXpathSame as for the XML file splitter except this is a name of a meta param and not an XPath.
NoNo
ExtRsrcXpathRegexpSame as for the XML file splitter.NoNo
ExternalFilesPathSame as for the XML file splitter.NoNo
AddFullTextToExtensionSame as for the XML file splitter.NoNo
RemoveDoctypeRemove the doctype line from the HTML file. In some cases this is needed so file can be parsed.For example:
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>
<html>
<head>
Will remove the first line leaving us with:
<html>
<head>
NoNo
ConnectionTimeoutSame as for the XML file splitter.NoNo
ReadTimeOutSame as for the XML file splitter.NoNo

Substring params – The SubStringStart<grp>, SubStringEnd<grp>, SubStringFilter<grp>, and SubStringTag<grp> parameters can be used more than once, as indicated by the group number suffix <grp>.

In the following example, the parameters are configured for groups 1 and 2:

Group 1:

  • SubStringStart1=”kuku”
  • SubStringEnd1=”dudu”
  • SubStringFilter1=true
  • SubStringTag1 = myTagKuku

Group 2:

  • SubStringStart2=”bubu”
  • SubStringEnd2=”lulu”
  • SubStringFilter2=false
  • SubStringTag2 = myTagBubu

These parameters allow you to extract any part of an HTML file and place it in the generated XML.

SubStringStartA unique string to be searched for identifying the beginning of the text to be extracted.NoNo
SubStringEndA unique string to be searched for identifying the ending of the text to be extracted.NoNo
SubStringFilterRemove HTML tags from the extracted text.The value for this parameter has no meaning. This is just a flag. If the parameter name: SubStringFilter is defined the flag is on.NoNo
SubStringTagA tag name which will be placed under the <substrings> section which will hold the extracted text. If SubStringTag=”EXTENSION,” the extracted data will be inserted into the P_PNX_EXTENSION table and will not be located under the <substrings> section.NoNo
SplitRecordIdBy default the value under IdentifierXpath, will be split by “:”, and the right most value will be used as the identifier (i.e. for the value, oai:digi-tool.com:233566 we will end up with 233566). In order to disable the splitting you must define this parameter with the value “false”. This is supported from version 4.5 and on.For the value, oai:digi-tool.com:233566 we will end up with 233566FullNo

CSV File Splitter

This file splitter is used for harvested files that are in CSV format.

To use this file splitter, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.csv.CSVFileSplitter

The following table lists the supported parameters:

NameDescriptionExampleMandatory
IDColumnNumberThe column number that holds the record’s unique identifier. The first column starts at number 1.IDColumnNumber=1Yes
DeletedColumnNumberThe column that holds the deleted indication. If none exists, no need to configure this value. The first column starts at number 1.DeletedColumnNumber=4No
DeletedIndicatorRegexpA regular expression to match against the value in the deleted column.DeletedIndicatorRegexp=YNo
delimiterThe delimiter used in the file. If not entered, this value defaults to a tab (\t).delimiter=, (for a comma)No

OAI File Splitter/ Static OAI Splitter

This file splitter is used to harvesting XML files that are OAI format. The OAI splitter is used for most pipes because most pipes harvest OAI XML files. The Static OAI splitter replaces all data sources configured in V2 with the “Static OAI repository”:

To configure new OAI file splitters, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.oai.OAISplitterCB

Note: The static OAI splitter is exactly the same as the OAI except it is configured with different XPaths values from the OAI splitter.

The following table lists the supported parameters:

NameDescriptionExplanationXPath supportMandatory
IdentifierXpathXPath to the tag holding the unique identifier of the record.For example, for the following [file:
] <OAI-PMH>
<ListRecords>
<record>
<header>
<identifier>0000582971</identifier>
</header>

</record>
</ListRecords>
</OAI-PMH>
IdentifierXpath = /OAI-PMH/ListRecords/record/header/identifier
NoYes
StatusXpathThe XPath to the header tag. We assume the attribute status exists on this tag. We also assume the values for this attribute are either true or false.For example, for the following [file:
] <OAI-PMH>
<ListRecords>
<record>
<header status=”deleted”> … </header>

</record>
</ListRecords>
</OAI-PMH>
StatusXpath:
/OAI-PMH/ListRecords/record/header
NoYes
ContentXPathAn XPath to the metadata section of the record.For example, for the following [file:
] <OAI-PMH>
<ListRecords>
<record>

<metadata> … <metadata>
</record>
</ListRecords>
</OAI-PMH>
ContentXpath: /OAI-PMH/ListRecords/record/metadata
NoYes
SplitByXpathAn XPath to the end of a record.For example, for the following [file:
] <OAI-PMH>
<ListRecords>
<record>

</record>
</ListRecords>
</OAI-PMH>
SplitByXpath:
/OAI-PMH/ListRecords/record
NoYes

Marc Exchange File Splitter

This file splitter is used for harvesting files that are Marc Exchange format. The MARC Exchange splitter replaces all data sources in V2 that have “Source format” set to “MARC Exchange”:

To configure new Marc Exchange file splitters, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.marc_exchange.MarcExchangeSplitter

The following table lists the supported parameters:

NameDescriptionExplanationXPath supportMandatory
MarcExchangeIdFieldInstructions as to which tag to get the Identifier value from.Possible value examples:
001 – goto to tag 001
907 ## a – goto to tag 907, ignore the indicators, take the value in subfield a.
907 10 a – goto to tag 907 with indicators 10, take the value in subfield a.Default value: If this parameter is not configured the default value is 001, meaning the ID will be taken from field 001.
NoNo
StatusFieldDefine the tag holding the status of the record (is the record for deletion)Possible value examples:
001 – goto to tag 001
907 ## a – goto to tag 907, ignore the indicators, take the value in subfield a.
907 10 a – goto to tag 907 with indicators 10, take the value in subfield a.Default value: If this parameter is not configured the default value is positions 05.
StatusWhenDeletedA regular expression correlating to the  StatusField. Describe what is the expected value under StatusField, which indicates the record is for deletion.Possible value examples:

d – if the value is d, it means that the record is for deletion.

d.* – Any value starting with d

d.*e – Any value starting with d and ending with e.

Default value: If this parameter is not defined, the system will use:

StatusField = positions 05
StatusWhenDeleted = d

MARC XML With Character Delimited Holdings Splitter

This splitter is helpful for ILS systems that can extract MARC XML but may not be able to expand holdings information into the xml and may not be able to add the OAI header.
Customers will need extract their own data which will include the ID, status and holdings info – outside the MARC record and also the MARC metadata – in one line.
The splitter will take each bib according to the bib ID and merge the holdings/location/availability information into the MARC XML as 999 subfields.
In cases of multiple location of the same bib the splitter output will be MARC xml with multiple 999 fields.

Format – The splitter can deal with records with the following format:

  • ID | Status | MARC XML | Holdings information in multiple fields delimited (no specific limit to number of fields)

Input data should be consistent with the following requirements:

  • The extract output must have 1 line per each bib + holdings.
  • The extract output file needs to be sorted according to the record ID (control no. – 1st field).
  • The extract output may be at library level or item level.
  • Records with multiple holdings/items should be extracted multiple time – each holdings records will have the full ID, Status, MARC and will only differ in the holdings info. The number of lines written to the file should be equal the number of items that bibliographic record has. (i.e. the bib information will be repeated and only the holdings information will vary from line to line).
  • The 1st field must always be the record ID.
  • The 2nd field must always be the record status – the delete flag.
  • The 3rd field must always be the MARC xml (legal xml).
  • Fields 4 and onwards are for holdings, location and availability information.

To configure new MARC XML with character delimited holdings splitter, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.marc_delimited.MarcDelimitedFileSplitter

The following table lists the supported parameters:

NameDescriptionExplanationXPath supportMandatory
delimiterWhat is the delimiter character(s) between fields in the extract file|NoYes
deleted_indicatorShould be a regular expression. This regular expression is run against the value found in the input second field. If the regular expression matches the value found, the record is marked as deleted.For example, with the following input:
477222 | d | <?xml version=”1.0″ encoding=”utf-8″?>…
477223 | n | <?xml version=”1.0″ encoding=”utf-8″?>…
If we will define deleted_indicator = d
The first record is marked as deleted and the second record is not.
NoYes
skip_first_linetrue/false flag – defining whether the 1st line in the file should be used or notfalseNoYes

Input example (one line, pipe delimited between fields):

477222 | n | <?xml version="1.0" encoding="utf-8"?>

<collection>

    <record>

        <leader>00397cas a2200157 a 4500</leader>

        <controlfield tag="001">0000477222</controlfield>

        <controlfield tag="005">20060227110415</controlfield>

        <controlfield tag="008">970721c19759999caumr1p       0uuua0eng  </controlfield>

        <datafield tag="022" ind1=" " ind2=" ">

            <subfield code="a">0730-0158</subfield>

        </datafield>

        <datafield tag="040" ind1=" " ind2=" ">

            <subfield code="a">211006</subfield>

            <subfield code="c">211006</subfield>

        </datafield>

        <datafield tag="090" ind1=" " ind2=" ">

            <subfield code="a">780.5</subfield>

            <subfield code="b">K441</subfield>

            <subfield code="x">P</subfield>

        </datafield>

        <datafield tag="245" ind1="0" ind2="0">

            <subfield code="a">Keyboard.</subfield>

        </datafield>

        <datafield tag="260" ind1="0" ind2="0">

            <subfield code="a">San Francisco, CA. :</subfield>

            <subfield code="b">Miller Freeman, Inc.,</subfield>

            <subfield code="c">1975-</subfield>

        </datafield>

        <datafield tag="650" ind1=" " ind2="0">

            <subfield code="a">Music</subfield>

        </datafield>

    </record>

</collection> | MED | W0187164 | available | P 780.5 K441 v.35 no.1-12 2009 | 5462 |

Output example:

Record ID: 477222

<?xml version="1.0" encoding="utf-8"?>

<record>

    <leader>00397cas a2200157 a 4500</leader>

    <controlfield tag="001">0000477222</controlfield>

    <controlfield tag="005">20060227110415</controlfield>

    <controlfield tag="008">970721c19759999caumr1p 0uuua0eng </controlfield>

    <datafield tag="999">

        <subfield code="a">MED</subfield>

        <subfield code="b"> W0187164</subfield>

        <subfield code="c">available</subfield>

        <subfield code="d"> P 780.5 K441 v.35 no.1-12 2009</subfield>

        <subfield code="e">5462</subfield>

    </datafield>

    <datafield tag="022" ind1=" " ind2=" ">

        <subfield code="a">0730-0158</subfield>

    </datafield>

    <datafield tag="040" ind1=" " ind2=" ">

        <subfield code="a">211006</subfield>

        <subfield code="c">211006</subfield>

    </datafield>

    <datafield tag="090" ind1=" " ind2=" ">

        <subfield code="a">780.5</subfield>

        <subfield code="b">K441</subfield>

        <subfield code="x">P</subfield>

    </datafield>

    <datafield tag="245" ind1="0" ind2="0">

        <subfield code="a">Keyboard.</subfield>

    </datafield>

    <datafield tag="260" ind1="0" ind2="0">

        <subfield code="a">San Francisco, CA. :</subfield>

        <subfield code="b">Miller Freeman, Inc.,</subfield>

        <subfield code="c">1975-</subfield>

    </datafield>

    <datafield tag="650" ind1=" " ind2="0">

        <subfield code="a">Music</subfield>

    </datafield>

</record>

Note that the splitter output includes the data from fields 4 and on added as 999 subfields.

WARC File Splitter

This file splitter is used to harvest files that are WARC format.

To configure new WARC file splitters, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.warc.WarcSplitter

The following table lists the supported parameters:

NameDescriptionExplanationXPath supportMandatory
UseSrcFileAsIsA flag indicating to use the harvested file as is and not try to extract it’s content (like a tar.gz, zip …). The WARC reader knows to deal with a warc.gz format as well, no previous etraction is needed by the system.The value for this parameter has no meaning. This is just a flag. If the parameter name: UseSrcFileAsIs is defined the flag is on.NoYes

SFX File Splitter

This file splitter should be used for harvesting files which are in SFX XML format.
This splitter is using the generic XML splitter with specific parameters according to the structure of the SFX XML file.

The SFX XML splitter replaces the data sources previously using of the SFXOAI.xsl transformation program (XSLT program previously mentioned):

To configure new SFX XML file splitters, specify the following class name in the File Splitters mapping table:

  • com.exlibris.primo.publish.platform.harvest.splitters.generic.DomXmlSplitter

The following table lists the supported parameters:

NameDescriptionExplanationXPath supportMandatory
RootXpathThe XPath to the opening tag of the file (The first tag in the file).For example, for the following [file:
] <collection> … </collection>
RootXpath: collection
PartialYes
FullRecordXPathThe XPath to the beginning of a record. A file may contain more than 1 record in it.For example, for the following [file:
] <collection>
<record> … </record>

<record> … </record>
</collection>
FullRecordXPath: collection/record
PartialYes
IdentifierXpathThe XPath to the identifier tag of the record. This should be the tag holding the unique identifier of the record.<collection>
<record>
<datafield tag=”090″ ind1=”” ind2=””>
<subfield code=”a”>
954921333016
</subfield>
</datafield>

</record>
</collection>
IdentifierXpath: //record/datafield[]/subfield[]
FullYes
StatusXpathThe XPath to the location of the deleted status of the record. Default value for a record is: deleted = falseSee example of StatusWhenDeleted below.
StatusXpath: //record/leader
FullNo
StatusWhenDeletedShould be a regular expression. This regular expression will be run against the value found under the StatusXpath path. If the regular expression matches the value found, the record is marked as deleted.For example, for the following [file:
] <record>
<leader>—–d-nas-a22—-z-4500</leader>
</record>
StatusWhenDeleted: …..d.*
NoNo (but if StatusXpath is defined then StatusWhenDeleted is mandatory)

OAI File Splitter – configured using the XML file splitter

If you want to parse an OAI file and get the support of full text loading you will have to manually configure an XML file splitter for your OAI files. You cannot use the OTB OAI file splitter, as it doesn’t support full text.

Below is an example of how to configure the XML file splitter.

Create a new file splitter in the “File splitters” mapping table. Use the following as the “class name”: com.exlibris.primo.publish.platform.harvest.splitters.generic.DomXmlSplitter.

The following are the mandatory parameters:

NameValue
RootXpathOAI-PMH
FullRecordXPathOAI-PMH/ListRecords/record
IdentifierXpath//record/header/identifier
StatusXpath//record/header/@status
StatusWhenDeleteddeleted
ContentXpath//record/metadata/record

Implementing your own File Splitter

You can implement your own file splitter in case the out of the box splitters do not cover your needs.

Implementing a new file splitter requires the ability to program in the JAVA code.

You will first need to get a copy of a JAR file from Primo’s installation directory.

From the BO machine get the following JAR file:

1) up to Primo version 4.5: Extract the primo_publishing-api.jar from the $primo_dev/ng/primo/home/system/publish/client directory

2) From Primo version 4.5 to 4.6: Extract the primo-publishing-api-<version>.jar from the $primo_dev/ng/primo/home/system/tomcat/publish/webapps/primo_publishing#admin/WEB-INF/lib directory

3) From Primo version 4.6.1 and later: Extract the primo-common-api-<version>.jar from the $primo_dev/ng/primo/home/system/tomcat/publish/webapps/primo_publishing#admin/WEB-INF/lib directory

The relevant class files

com.exlibris.primo.api.spliter.plugin.IFileSplitter.java
com.exlibris.primo.api.spliter.plugin.RecordData.java
com.exlibris.primo.api.spliter.plugin.IRecordSaver.java
com.exlibris.primo.api.spliter.plugin.ExtensionData.java
com.exlibris.primo.api.spliter.plugin.TerminateParsingException.java

IFileSplitter

This is the interface you will need to implement. Once implemented, the name of your implementation should be registered in the “File Splitters” mapping table.

The following functions for this interface must be implemented:

  • init()
  • parse()
  • doneParsing()

1) init()

This function is called once for every pipe run. It receives 3 parameters:

  • charSet – The configured character set configured for the data source. Use this character set to read the given file in the correct encoding

  • logger – an object allowing you to write to the Primo pipe log any important output needed.
  • Params – all the parameters configured in the “File Splitters Params” mapping table.

2) parse()

This function will be called for every harvested file extracted from the compressed harvested file. This function receives the following parameters:

  • Input stream – An input stream to read the file to be parsed.
  • File – deprecated, DO NOT USE
  • Record saver – The IRecordSaver mentioned above. Use this parameter to save your parsed records.
  • Exceptions thrown – A parse call may throw an exception during the parsing. If an exception is thrown the file will be added as a failed file and skipped and we will continue on to parse the next file. Be aware that even if an exception is thrown this file could have already generated records that have been saved to the database.
  • TerminateParsingException – The parse() function may also throw the TerminateParsingException, which means the pipe will stop its work and we will not continue to parse the remaining files. The pipe will be stopped with a “Stopped harvest error” indication. Use this Exception only when you are sure there is no point in continuing to parse files (for example, when you are working against an additional system to retrieve data from, and for some reason the system is down).

3) doneParsing()

This function will be called when no more file are left for parsing. Usually this function will do nothing, unless there are resources used by the splitter that should be closed or whatever other needed tasks that should be done when parsing is done.

RecordData

This class holds the following members:

  • identifier – holds the unique record id.
  • recordData – holds the XML file which will be passed on to the normalization rules for creating the PNX record.
  • isDeleted – an indication whether this record is for deletion. Default value is false.

The file splitter must create this object for every parsed record from the file.

IRecordSaver

This interface is passed to the file splitter in order to allow the saving of each parsed record. It knows how to save RecordData objects. The save() method is declared as throwing an exception. You do not need to worry about exceptions thrown from the save method. Any exception thrown will be dealt with by the framework. The only exception that might be thrown is a TerminateParsingException which will cause the pipe to terminate. Do not catch these exceptions, just ignore them.

A basic usage of the IRecordSaver interface would be:

//parse a record from the file and create a RecordData object for it
RecordData record = getNextParsedRecord();
 
//call the recordSaver to save the parsed record into the dataabse
recordSaver.save(record);

Integrating Your Plug-In

  1. Extract the primo-common-api-<ver>.jar file from the location, as mentioned above.
  2. Place this jar file in your development environment so you can start implementing the interfaces defined in the extracted JAR file.
  3. Implement the IFileSplitter interface.
  4. Wrap your implementation class in a JAR file, which must be given a unique name.
  5. Place the new JAR file into your primo installation under the following directory:
    $primo_dev/ng/primo/home/profile/publish/publish/production/conf/fileSplitter/lib/.
  6. Place any other needed third party JAR file under the lib directory from step #5.
  7. Configure the File Splitters and File Splitter Params mapping tables that were previously mentioned.

Note: There is no need to restart the Back Office server for these changes to take effect. Any change will be seen automatically by the next pipe run.

Adding Extensions with File Splitters

Extensions contain additional data that is indexed along with the PNX record. These extensions enable users to search for terms in the extensions and retrieve the associated PNX records for the extensions.

The Generic file splitters allow configuring some parameters that relate to extensions being added to your PNX record.

When implementing your own file splitter (using the JAVA code)  and want to load full text data for PNX records during the harvesting phase, you can add the extensions to the RecordData object.

Extension Types

Each extension has a type. You can define the type of an extension (usually the type used will be FULLTEXT).

The OTB types are defined in the PNX_EXTENSION_MAPPINGS mapping table located under the Front End subsystem:

Only the types listed and enabled in this mapping table are indexed by the search engine. You can add any type to this table and map it to the needed PNX path.

Make sure when configuring a generic file splitter, or when implmenting your own splitter to only use types that are configured in the table above.

Adding extensions programmatically

When implementing your own file splitter, in order to add extensions to the RecordData, you must first create the ExtensionData object, which represents the extension to be added, and then add the ExtensionData object to RecordData.

The ExtensionData object includes the following members:

  • Type – Should correlate to the types in the PNX_EXTENSION_MAPPINGS table.
  • Value – The value for this extension to be indexed.
  • Overwrite – Default value is true. Determines whether previous extensions for an updated record should be deleted before the new extensions are added. If this value is set to false, any update with extensions will add the new extensions to the previous extensions. The deletion of the previous extensions is done by the extension type. For example, If a record previously had FULLTEXT and ABTRACT extensions and the record is updated only with FULLTEXT extensions, the previous FULLTEXT extension will be replaced with the new ones and the ABSTRACT extension will be left as is.

A common way of doing this in the file splitter is:

//parse a record from the file and create a RecordData object for it
RecordData record = getNextParsedRecord();
 
//adding 2 different full texts
record.addExtension(new ExtensionData("FULLTEXT", "blah blah balh…"));
record.addExtension(new ExtensionData("FULLTEXT", "kuku kuku kuku…"));
 
//call the recordSaver to save the parsed record into the dataabse
recordSaver.save(record);

Indexing Extensions

You must notify the system that you are loading extensions.

The Datasource Index Extensions mapping table under the Publishing subsystem is used to notify the system which extensions to index.

 

Each mapping row contains the following fields:
Data Source Name – Select the data source that is using the file splitter that adds extensions.
Type – The optional values are:

  • Index All – Use this option if all records parsed by this file splitter will be added with extensions.
  • Index If Exists – Use this option if only a partial set of the parsed records will be added with extensions.

Note: If you do not configure the data source in this mapping table, the system will load the extensions into the database, but it will not index them, making the extensions unsearchable.

Additional interfaces that can be implemented

Record Saver Wrapper

The IRecordSaverWrapper interface allows you to modify any predefined file splitter to make additional modifications to a record before it is saved into the system. As mentioned previously, a file splitter creates a RecordData object and calls the RecordSaver to save the record. IRecordSaver will automatically execute the wrapper containing your modifications between the following lines of code:

RecordData rd = getNextRecord();

recordSaver.save(rd);

In addition, your wrapper can be used to reject a record to keep it from being saved into the system.

Configuring the Wrapper

To use the wrapper, add the following parameter and value to the File Splitter Params mapping table for each file splitter you want to modify:

  • Parameter name: RecordSaverWrapper
  • Parameter value: A class name implementing the IRecordSaverWrapper interface.

Implementing the IRecordSaverWrapper

The same instructions used for implementing a file splitter are needed to implement the IRecordSaverWrapper interface. The only difference is the name of the interface to implement. For more information, see the Implementing and Integrating Your Plug-In section.

To implement the IRecordSaverWrapper interface, you must implement the methods shown in the following code:

public interface IRecordSaverWrapper {


    //will be called before recordSaver.save(rd) is called.

    //returning false will cause the record to not be saved

    public boolean doBeforeSave(RecordData rd);


    //will be called once, after the class is created

    public void init(IPrimoLogger logger, Map<String, Object> params);

}

  • doBeforeSave() – implement your business logic to run against the current record which is going to be saved. Return false if you want to skip this record and keep it from being saved into the system.
  • init() – This functions receives the map of parameters sent to the file splitter. This allows you to send the IRecordSaverWrapper additional parameters, including the parameters specifically supported by the file.

End Parsing Handler

The IEndParsingHandler interface allows you to modify a predefined file splitter right after it has executed the doneParsing() method. This allows you perform additional parsing actions (such as adding additional records) before the records are stored to the database.

This interface is called just after the file splitter’s doneParsing() method is called and before the records are saved to the database.

//call the file splitters doneParsing function

splitter.doneParsing();


//call end parsing interface if defined

doEndParsing(recordSaver, params);


//save to the database the remaining records

recordSaver.saveLeftOvers(config);

Configuring the Handler

To use the handler, add the following parameter and value to the File Splitter Params mapping table for each file splitter you want to modify:

  • Parameter name: EndParserHandler
  • Parameter value: A class name implementing the IEndParsingHandler interface.

Implementing the IEndParsingHandler Interface

The same instructions used for implementing a file splitter are needed to implement the IEndParsingHandler interface. The only difference is the name of the interface to implement. For more information, see the Implementing and Integrating Your Plug-In section.

To implement the IEndParsingHandler interface, you must implement the methods shown in the following code:

public interface IEndParsingHandler {


    public void init(IRecordSaver recordSaver,

            IPrimoLogger logger, Map<String, Object> params);


    public void execute();

}
  • init() – Receive “helper” classes to be used in the execute() function
  • execute() – Do you business logic here. Records can still be added to the recordSaver if needed.

Communicating between IRecordSaverWrapper and IEndParsingHandler

To communicate between the IRecordSaverWrapper and IEndParsingHandler interfaces, you can use IRecordSaverWrapper to store any data in the params map and then pass the data to IEndParsingHandler.