Interscript: Interoperable script conversion systems

Introduction

General

Evaluating existing transliteration technologies for Farsi, we found the alternative solutions doing little than just basic mappings, thus not delivering a quality solution.

In order to improve this, if no good quality transliteration data is available, there was no alternative to write our own transliteration code.

This implies not only building a highly complex solution, modelling mappings, solving multiple special cases and accounting for languages complexity, but also porting the solution into production, away many scientific and NLP tools necessary for an advanced design.

On the top of that, we have also experienced that linguist and developer’s had quite different focus and understanding, which is not always easy to integrate, thus creating misunderstandings and efficiency losses.

We decided to solve this challenge with an innovative strategy that it is aggressively data based:

The transliteration code is designed by the linguist via an graphical interface (and diagrams/graphs).
Transliteration code is then built on transformers (neural networks) trained as a translation pbm.

This has been successfully applied to Farsi and we think the workflow portable to any other languages applying 1. + 2. where no large transliteration corpus is available for just 2..

Below, we present the approach in more details before reviewing our scores and discussing it in the end of the document.

Transliteration for Farsi

Transliteration is the process of transforming a language into another script, transforming the letters to maintain the sounds.

EXAMPLE: in Farsi transliteration, مزایایی` → mazAyAyi.

Motivation behind the Approach

Building our initial solution, we faced following problems:

Transliteration code had to implement highly complex rules.
Language expert and developer’s workflows were quite different and difficult to integrate. For instance, the developer is asked to encode complex rules while he/she is likely to have little chance to comprehend,

As a result, not only the code takes a considerable amount of time to be written, it has little flexibility for tweaks or redesigns and even debugging.

For that reason, we have considered an alternative approach consisting in:

Empowering language experts in the process of designing complex rules and logic
Encapsulating both language expert and developer workflows
Building a flexible solution that can be visualized and tweaked by rearranging logic blocks

Furthermore, as production is done in javascript from Ruby, key libraries do not exist and would need to be fully retro-engineered.

Fortunately, neural network can be used here to learn the necessary set of complex rules if generating large quality datasets with our code and datasets. For doing this, we could partially rely there on some of the team’s previous R&D work secryst.

Strategy

General

In a nutshell:

Transliteration design + code generation
- linguist designs code diagrams
- nodes implemented by developer
- code diagrams are exported (into CSV)
- ⇒ transliteration code generated/"learnt" from the diagram export
Learning of transliteration with NNets
- transliteration dataset creation with above code
- ⇒ training of Transformers (Neural Network)
Production/Transliteration in Ruby
- torch NNets exported in ONNX from python
- ⇒ production code in ruby

Design Transliteration logic with diagrams

Our linguist worked out an extensive, original solution to combine various mappings, rules and transformations (lemmatization, , stemming, PoS tagging, …)

Better than many description, the full architecture is available online with a diagram: transliteration strategy.

Experimenting with various approaches for this project, we decided to work with lucidchart. However, any technology allowing to create workflows could be an alternative and massaged.

Building code from graphs

Once exported into a csv format, the diagrams can be computerized and modelled as trees, associating rules to each node.

Among the problems that had to be solved:

Modelling of logic components as Trees
Modelling of computation "states"
Jumps from one logic flow to another
Handling recursions

We are using a functional approach with the data "states" getting transformed as they "flow" within the graphical, logical structures.

Details can be found in the original Julia code.

Nodes/Rules

Nodes are associated to an action to be performed that is described in plain English. For instance, in the diagram example:

node|> "output its transliteration!"
node|> "is the prefix ب or بی?"
node|> "is it a verb?"
node|> "transliterate it using affix-handler"

Our code can be found with more explanations here: learn-graph code.

The current code is based on a design with over 150 nodes, dealing with main and edge cases. This represents a very complex solution.

Assets used and created

Mappings

As mentioned above, mappings are an essential part of the workflow. They were transformed into tables in the code. Our rules combine NLP tags to break down words into smaller segments for which a transliteration is found in the database.

The main database we used for this project mappings was a 50k-word database used previously in a Farsi text to speech project called Parskhan.

It includes word roots and their frequency in conversations and affixes that can be attached to those roots. That database had to be edited in multiple occasions.

Mappings, Data creation and generation

Most of our datasets to apply our transliteration method on were publicly available from Farsi NLP communities and Github repositories.

As a second step, transliteration data was generated by applying to it our diagrams-generated code.

We have also produced a small test dataset to benchmark various transliteration algorithms. With this data, we have tried to cover many cases our rules were designed to solve.

NLP in Farsi

After some research, we have selected the hazm library. As already mentioned, it is available only in python but we could use neural networks to bypass this issue for production, as explained below.

Workflow details for Linguist & Developer

While the developer’s job consists in the implementation of the above commands (searches in tables, comparing and concatenating strings, …), the linguist can produce various nodes with commands and organize/re-organize them on the graphical editor.

In more details:

Starting from a diagram the linguist can use a graphical editor to design various rules. In this process:
- new rules can be created
- existing rules and branches can be re-arranged
If a new node has to be created interact with a developer to implement it.
Learn build code from graphs
Run test and benchmarks and review results and bugs
Run single examples with an extensive debugging mode
Back to 1.

Below, we show the code output in full verbose, debug mode. The linguist can track the computation steps and help to identify bugs and inaccuracies.

> julia transliterateSingleString.jl --path-model resources/Model0.9.dat --farsi-text یویو --pos-tagging noun
[ Info: ("brain name ::> ", "transliterator")
[ Info: ("data::> ", Dict{String, Any}("brain" => "transliterator", "pos" => "Noun", "word" => "یویو", "pre_pos" => nothing, "state" => nothing))
[ Info: ("node::> ", "change all instances of ي and ك in the text to ی and ک")
[ Info: ("data::> ", Dict{String, Any}("brain" => "transliterator", "pos" => "Noun", "word" => "یویو", "pre_pos" => nothing, "state" => nothing))
[ Info: ("node::> ", "is the word found in the db?")
[ Info: ("response::> ", "yes")
[ Info: ("data::> ", Dict{String, Any}("brain" => "transliterator", "data" => Dict{Any, Any}[Dict("الگوی تکیه" => "WS", "WrittenForm" => "یویو", "PhonologicalForm" => "yoyo", "Freq" => 1, "SynCatCode" => "N1")], "pos" => "Noun", "word" => "یویو", "pre_pos" => nothing, "state" => "yes"))
[ Info: ("node::> ", "collision?")
[ Info: ("response::> ", "no")
[ Info: ("data::> ", Dict{String, Any}("brain" => "transliterator", "data" => Dict{Any, Any}[Dict("الگوی تکیه" => "WS", "WrittenForm" => "یویو", "PhonologicalForm" => "yoyo", "Freq" => 1, "SynCatCode" => "N1")], "pos" => "Noun", "word" => "یویو", "pre_pos" => nothing, "state" => "no"))
[ Info: ("node::> ", "output its transliteration!")
yoyo

We also provide with a test to assess the strategy overall scores. The mismatches are outputted into a local directory for further analysis, for instance with the above mode.

> julia run.jl --path-model resources/Model5.dat --file-name test
accuracy: 0.7019607843137254
error summary in: tests/test_debug.csv

Learning to transliterate with transformers

Transformers are a modern neural network architecture (attention is all you need) used on transduction problems such as language modeling and translation. They can be naturally applied to the problem of learning to transliterate.

Various libraries can be found online. We also experimented with multiple approaches, characters or words-based. The current method implemented in production is the latter.

Several resources are available online to explain transformers.

Training and ONNX conversion

As for other projects, after training, ONNX was used to port trained neural networks onto a universal format.

This work (training+ ONNX export) can be found in our python scripts.

Implementation of greedy decoding

In production, we found that various components (neural networks) of the transformers had to be exported, such as generator, tokenizers, encoder, and decoder.

They had then to be combined correctly in our native ruby code.

Benchmarking

Scores

Our codes can be tested/benchmarked with a test data set that we have designed. We are reporting simple metrics ACCU as (word accuracy %):

	ACCU
CODE 0.9	96%*
CODE D	70%
CODE Transfo	65%
CODE Ruby	65%
UROMAN	5%

CODE 0.9 is our first transliteration code.

It has been optimized on our test set and after quite some work, could reach a very decent score. However, the code does not not cover/fails with many sentence (>50%), which is bad for training nnets because of the patterns loss.

Exactly this issue motivated the rewriting of a new version of CODE 0.9, which in turn, because of its difficulty/challenge, led to the alternative graphical approach featured in this blog.

CODE D is the code based on diagrams, it is our highlighted solution
CODE Transfo is the code trained with transformer
CODE Ruby is the final, production code
UROMAN is an alternative ressource we found online

Notes on Performance

Even though the code was built just overtaking an expert’s diagram and despite the considerable amount of data passed around, we found CODE D ~2 times faster than the initial implementation.

One reason is Julia, which can be several times faster than python, but probably also the non-optimal implementation in CODE 0.9.

We find this interesting to report as none of the many possible performance improvements were needed to transliterate ~20 sentences / sec. on a small machine, which is fast enough to produce huge amount of transliteration data overnight.

Summary & Discussion

For reasons explained in benchmarks and in the introduction, we found impractical and sub-efficient to build a transliteration code from a set of mappings and written rules.

Thinking that the integration between software developer and linguist was one of the challenges, with difficulties for the former to develop some sort of intuition about a foreign language and the latter to debug or implement himself tweaks or changes, we have approached the problem with a graphical editor allowing a linguist to creates his own logic designs.

Transliteration is put into production after training of neural networks, allowing to bypass the usage of NLP libraries not available in ruby but also for a compact solution.

In the final step, we found a lightweight way to export torch transformers into native ruby, without using more than very standard libraries (no torch-rb).

We think that the approach or part of it can be ported to the transliteration of any other languages, also the ones where no transliteration data is available.

After having demonstrated its application to a complex software implementation, we also think that the graphical approach and allowing for a good encapsulation of technical and specialist workflow can be very useful in many situations.

Several new technologies suggest many ideas to scale up the approach, for instance integrating other works graph transliteration or even in a near future technologies like AI pair programmer.

Outlook

We found that empowering a linguist to be able to design and visualise complex logics was by far more successful than a first attempt separating developer/linguist workflows. This possibly because the linguist could apply is understanding/intuition directly.
The solution (diagram to code) is very portable, visual, intuitive and creating less dependencies as software consists in simple code snippets/actions.
Details could be improved for debugging, specifying types, expected inputs/outputs and standards, for example at the editor level (alternatives to LucidChart).

Therefore, the approach seems promising for other cases not necessarily confined to transliteration.