Introducing Potnia: A Python language library for the conversion of ancient texts to Unicode

Saturday 1:30 PM–2:00 PM in Eureka 2

The last five years have seen a significant increase in the application of machine learning to the study of ancient scripts. Applications are broad, and include recognition via Optical Character Recognition (OCR), textual restoration, palaeographic analysis, topic modelling, representation learning, decipherment and machine translation (Sommerschield 2023). A large number of ancient language corpora have been digitised in recent decades, supporting this research. However, while the necessary Unicode blocks for many of these ancient scripts are available, a number of these data sets are still presented as Romanised transliterations.

In response to this situation, we have created Potnia (https://pypi.org/project/potnia/), an open-source Python language library under the Apache 2.0 license, designed to convert such transliterated texts to Unicode. The session image accompanying this proposal provides an example of Potnia’s conversion process, with a Romanised transliteration of a Linear B text as the input, and the Unicode representation of this same text as the output. This conversion is crucial for downstream machine learning tasks, as tokenisation in the original Unicode script allows for more accurate representation of linguistic structures and mitigates potential biases introduced by transliteration.

Potnia's flexible architecture, built on Python's object-oriented principles, employs string manipulation techniques and regular expressions to handle various complexities inherent in ancient texts, such as uncertain readings, missing elements, and script-specific notations. At present, the library can be used for Linear B texts, with functionality for Linear A, Sumerian and Akkadian soon to follow.

Potnia's design allows for easy addition of new scripts, each with its own set of rules for tokenisation, regularisation, and character mapping. This extensibility positions us well for future inclusion of additional scripts. To ensure reliability and facilitate open-source contributions, we've implemented a comprehensive test suite using pytest, with test cases defined in YAML files for easy expansion. This approach covers key functionalities across different scripts and simplifies the process of adding new test scenarios as the library grows.

References Sommerschield, T., Y. Assael, J. Pavlopoulos, V. Stefanak, A. Senior, C. Dyer, J. Bodel, J. Prag, I. Androutsopoulos, and N.D. Freitas. 2023. “Machine Learning for Ancient Languages: A Survey.” Computational Linguistics 49 (3): 1–45. doi:10.1162/coli_a_00481.

Emily Tour she/they • @TourEmily

Emily Tour (she/they) is an archaeologist and PhD candidate at the University of Melbourne. Their research focuses on the study of Bronze Age Aegean administrative documents; in particular, the application of digital methods such as 3D modelling, shape analysis and phylogenetics to better understand these artefacts. In addition to their PhD research, Emily is involved in an ongoing collaboration with the Melbourne Data Analytics Platform (MDAP), exploring the application of deep learning techniques to the decipherment of ancient scripts, including the presently undeciphered Linear A. Prior to retraining as an archaeologist, Emily worked in the IT industry as a software tester and business analyst, and is passionate about combining both her digital and archaeological skills for improved research outcomes, as well as supporting and promoting the uptake of digital techniques in the humanities. Emily is a current committee member for the Australasian chapter of Computer Applications and Quantitative Methods in Archaeology (CAA).

Kabir Manandhar Shrestha