Tech Blog

Accented characters in SIS import and export

Alma is capable of accepting Unicode characters in the first name, middle name and last name (among other) fields of Users. It even finds names with accents when you don’t type the accents. For example, it finds Jiménez when you search for Jimenez.

For the user XML you give Alma’s SIS import job, the only characters you need to explicitly escape are the ones required by XML itself. So < needs to become &lt;, > needs to become &gt;, and & itself needs to become &amp;.

Unicode characters in general needs to be output in UTF-8 encoding. So é needs to be output as the byte sequence c3 a9 in the file. This is how Alma’s SIS export job outputs XML. Alma’s SIS import job also accepts this encoding. (See the footnote.)

If you actually don’t desire anything other than basic (unaccented) Latin characters in your user fields, you can strip them out. The way to reduce a general Unicode string to basic Latin is to convert the string into Normalization Form KD (KD = Compatibility Decomposition) and then strip out any code points greater than 7F.

This technique can be used for both SIS import and export.

Python implementation

The following function in Python 3 implements this technique.

import unicodedata
def reduce_to_basic_latin(text: str) -> str:
    nfkd_form = unicodedata.normalize('NFKD', text)
    basic_latin = (ch for ch in nfkd_form if ord(ch) < 0x80)
    reduced_form = ''.join(basic_latin)
    return reduced_form

How it works

From Python’s REPL, you can see how this works.

>>> text = '你好 Jiménez'
>>> bytes(text, 'utf-8')
b'\xe4\xbd\xa0\xe5\xa5\xbd Jim\xc3\xa9nez'

Encoding the text ‘你好 Jiménez’ as UTF-8, 你 encodes as e4 bd a0, 好 encodes as e5 a5 bd, é encodes as c3 a9 and the other characters encode their ASCII values (which Python’s REPL doesn’t escape into hexadecimal).

>>> nfkd_form = unicodedata.normalize('NFKD', text)
>>> bytes(nfkd_form, 'utf-8')
b'\xe4\xbd\xa0\xe5\xa5\xbd Jime\xcc\x81nez'

Normalizing the text into Normalization Form KD, the encoding remains the same except for é. é goes from c3 a9 to e followed by cc 81.  UTF-8 c3 a9 is LATIN SMALL LETTER E WITH ACUTE. While UTF-8 cc 81 is COMBINING ACUTE ACCENT. So, in NFKD é goes from one character to two.

This happens not just for é. The Unicode standard specifies how all of the characters with Latin base characters decompose in this way. For the compatibility decomposition it goes on to specify how characters like the single ligature character fi decomposes into f followed by i. This is handy when someone copies from a Word document.

Here is a table of all the actual bytes in the NFKD form for our example. I’ve shaded the bytes that are greater than 7F, illustrating how if these were removed you are left with a string containing only basic Latin characters.

Character[space]Jime[acute]nez
Hexe4bda0e5a5bd204a696d65cc816e657a
You might have noticed that after removing the bytes greater than 7F, first character of what remains is a space. Generally, you will want to trim/strip whitespace from both ends of the string.

Reducing a whole file

While the function above could be applied to a whole XML file, it won’t do things like trimming the leading and trailing whitespace after reducing to basic Latin.

The Python 3 script below however will read an XML file, reduce any text to basic Latin, trim them, and then output the XML file in UTF-8.

Security warning: Do not use with untrusted data. “The Python XML processing modules are not secure against maliciously constructed data. An attacker can abuse XML features to carry out denial of service attacks, access local files, generate network connections to other machines, or circumvent firewalls.” [https://docs.python.org/3.7/library/xml.html#xml-vulnerabilities]
import re
import sys
import unicodedata
from xml import sax
from xml.sax.saxutils import XMLGenerator

def reduce_to_basic_latin(text: str) -> str:
    if not text:
        return text
    nfkd_form = unicodedata.normalize('NFKD', text)
    basic_latin = (ch for ch in nfkd_form if ord(ch) < 0x80)
    reduced_form = ''.join(basic_latin)
    return reduced_form

class ReduceToBasicLatinContentHandler(XMLGenerator):

    def __init__(self):
        super().__init__(sys.stdout, 'utf-8')
        self.whitespace_re = re.compile(r'(\s{2,})')

    def startElement(self, name, attrs):
        attrs = {k: reduce_to_basic_latin(v) for k, v in attrs.items()}
        super().startElement(name, attrs)

    def characters(self, content):
        if any(ord(ch) > 0x7F for ch in content):
            stripped_content = content.strip()
            if stripped_content:
                content = reduce_to_basic_latin(stripped_content).strip()
                content = self.whitespace_re.sub(' ', content)
        super().characters(content)

if __name__ == '__main__':
    sax.parse(sys.stdin, ReduceToBasicLatinContentHandler())

How to use it

Put the script above into a file called reduce-to-basic-latin.py.

Put the following XML document in a file called a.xml and ensure your editor saves the file as UTF-8.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<users>
    <user>
        <record_type>PUBLIC</record_type>
        <primary_id>test</primary_id>
        <first_name>Öçàr</first_name>
        <middle_name>field™</middle_name>
        <last_name>Jiménez</last_name>
        <user_notes>
            <user_note>
                <note_type>LIBRARY</note_type>
                <note_text>This 【敏捷的棕色狐狸跳过了懒狗】 should be stripped.</note_text>
                <user_viewable>false</user_viewable>
            </user_note>
        </user_notes>
    </user>
</users>

In your terminal/shell, run python3 reduce-to-basic-latin.py < a.xml

You will see this output.

<?xml version="1.0" encoding="utf-8"?>
<users>
    <user>
        <record_type>PUBLIC</record_type>
        <primary_id>test</primary_id>
        <first_name>Ocar</first_name>
        <middle_name>fieldTM</middle_name>
        <last_name>Jimenez</last_name>
        <user_notes>
            <user_note>
                <note_type>LIBRARY</note_type>
                <note_text>This should be stripped.</note_text>
                <user_viewable>false</user_viewable>
            </user_note>
        </user_notes>
    </user>
</users>

Footnote on using UTF-8 encoding in imports

At the bottom of the SIS import documentation is a note that the input file must be escaped based on xml encoding, with a link to a 2014 blog post that says you need to use XML character references for characters outside of basic Latin.

As of at least 2019, The University of Sydney has been successfully sending XML files to Alma using UTF-8 encoding and not using XML character references. Alma has been accepting them and successfully processing them.

“All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode.” [https://www.w3.org/TR/2008/REC-xml-20081126/#charsets]

2 Replies to “Accented characters in SIS import and export”

Leave a Reply