Accented characters in SIS import and export
Alma is capable of accepting Unicode characters in the first name, middle name and last name (among other) fields of Users. It even finds names with accents when you don’t type the accents. For example, it finds Jiménez when you search for Jimenez.
For the user XML you give Alma’s SIS import job, the only characters you need to explicitly escape are the ones required by XML itself. So <
needs to become <
, >
needs to become >
, and &
itself needs to become &
.
Unicode characters in general needs to be output in UTF-8 encoding. So é
needs to be output as the byte sequence c3 a9
in the file. This is how Alma’s SIS export job outputs XML. Alma’s SIS import job also accepts this encoding. (See the footnote.)
If you actually don’t desire anything other than basic (unaccented) Latin characters in your user fields, you can strip them out. The way to reduce a general Unicode string to basic Latin is to convert the string into Normalization Form KD (KD = Compatibility Decomposition) and then strip out any code points greater than 7F.
This technique can be used for both SIS import and export.
Python implementation
The following function in Python 3 implements this technique.
import unicodedata def reduce_to_basic_latin(text: str) -> str: nfkd_form = unicodedata.normalize('NFKD', text) basic_latin = (ch for ch in nfkd_form if ord(ch) < 0x80) reduced_form = ''.join(basic_latin) return reduced_form
How it works
From Python’s REPL, you can see how this works.
>>> text = '你好 Jiménez' >>> bytes(text, 'utf-8') b'\xe4\xbd\xa0\xe5\xa5\xbd Jim\xc3\xa9nez'
Encoding the text ‘你好 Jiménez’ as UTF-8, 你 encodes as e4 bd a0
, 好 encodes as e5 a5 bd
, é encodes as c3 a9
and the other characters encode their ASCII values (which Python’s REPL doesn’t escape into hexadecimal).
>>> nfkd_form = unicodedata.normalize('NFKD', text) >>> bytes(nfkd_form, 'utf-8') b'\xe4\xbd\xa0\xe5\xa5\xbd Jime\xcc\x81nez'
Normalizing the text into Normalization Form KD, the encoding remains the same except for é. é goes from c3 a9
to e followed by cc 81
. UTF-8 c3 a9
is LATIN SMALL LETTER E WITH ACUTE. While UTF-8 cc 81
is COMBINING ACUTE ACCENT. So, in NFKD é goes from one character to two.
This happens not just for é. The Unicode standard specifies how all of the characters with Latin base characters decompose in this way. For the compatibility decomposition it goes on to specify how characters like the single ligature character fi decomposes into f followed by i. This is handy when someone copies from a Word document.
Here is a table of all the actual bytes in the NFKD form for our example. I’ve shaded the bytes that are greater than 7F, illustrating how if these were removed you are left with a string containing only basic Latin characters.
Character | 你 | 好 | [space] | J | i | m | e | [acute] | n | e | z | |||||
Hex | e4 | bd | a0 | e5 | a5 | bd | 20 | 4a | 69 | 6d | 65 | cc | 81 | 6e | 65 | 7a |
Reducing a whole file
While the function above could be applied to a whole XML file, it won’t do things like trimming the leading and trailing whitespace after reducing to basic Latin.
The Python 3 script below however will read an XML file, reduce any text to basic Latin, trim them, and then output the XML file in UTF-8.
import re import sys import unicodedata from xml import sax from xml.sax.saxutils import XMLGenerator def reduce_to_basic_latin(text: str) -> str: if not text: return text nfkd_form = unicodedata.normalize('NFKD', text) basic_latin = (ch for ch in nfkd_form if ord(ch) < 0x80) reduced_form = ''.join(basic_latin) return reduced_form class ReduceToBasicLatinContentHandler(XMLGenerator): def __init__(self): super().__init__(sys.stdout, 'utf-8') self.whitespace_re = re.compile(r'(\s{2,})') def startElement(self, name, attrs): attrs = {k: reduce_to_basic_latin(v) for k, v in attrs.items()} super().startElement(name, attrs) def characters(self, content): if any(ord(ch) > 0x7F for ch in content): stripped_content = content.strip() if stripped_content: content = reduce_to_basic_latin(stripped_content).strip() content = self.whitespace_re.sub(' ', content) super().characters(content) if __name__ == '__main__': sax.parse(sys.stdin, ReduceToBasicLatinContentHandler())
How to use it
Put the script above into a file called reduce-to-basic-latin.py
.
Put the following XML document in a file called a.xml
and ensure your editor saves the file as UTF-8.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <users> <user> <record_type>PUBLIC</record_type> <primary_id>test</primary_id> <first_name>Öçàr</first_name> <middle_name>field™</middle_name> <last_name>Jiménez</last_name> <user_notes> <user_note> <note_type>LIBRARY</note_type> <note_text>This 【敏捷的棕色狐狸跳过了懒狗】 should be stripped.</note_text> <user_viewable>false</user_viewable> </user_note> </user_notes> </user> </users>
In your terminal/shell, run python3 reduce-to-basic-latin.py < a.xml
You will see this output.
<?xml version="1.0" encoding="utf-8"?> <users> <user> <record_type>PUBLIC</record_type> <primary_id>test</primary_id> <first_name>Ocar</first_name> <middle_name>fieldTM</middle_name> <last_name>Jimenez</last_name> <user_notes> <user_note> <note_type>LIBRARY</note_type> <note_text>This should be stripped.</note_text> <user_viewable>false</user_viewable> </user_note> </user_notes> </user> </users>
Footnote on using UTF-8 encoding in imports
At the bottom of the SIS import documentation is a note that the input file must be escaped based on xml encoding, with a link to a 2014 blog post that says you need to use XML character references for characters outside of basic Latin.
As of at least 2019, The University of Sydney has been successfully sending XML files to Alma using UTF-8 encoding and not using XML character references. Alma has been accepting them and successfully processing them.
“All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode.” [https://www.w3.org/TR/2008/REC-xml-20081126/#charsets]
2 Replies to “Accented characters in SIS import and export”
Leave a Reply
You must be logged in to post a comment.
Thanks Jim. We have updated the documentation (and removed the outdated blog post) following your useful footnote.
Ori
Alma API Team
Post on Alma’s Forum on the topic: https://developers.exlibrisgroup.com/forums/topic/accented-characters-in-sis-import-and-export/