is

Interscript

Interoperable
script conversion systems

This document describes the DSL-based files with an extension .iml or .imp.

An .imp file is a file containing a standalone transliteration map. For instance, a map that can transliterate a Korean file to a Latin file.

An .iml file is a file that contains a library of aliases and stages to be used by the .imp maps. It follows the same format, but does not require the metadata and tests parts to exist and doesn’t allow the main stage to exist. This document describes the map version of the format if it isn’t noted otherwise.

Basic syntax

A # character is a comment character. This means, that the part that follows a # character till the end of the line is ignored by Interscript, but exist to communicate to a human reader the intention behind the content. In this document it is most often a hint to a person reading this document.

A String is a part of the document of a form either: "content" or 'content'. It denotes a group of characters to be used. It can be joined together using a + character like so: "a" + "b" which is equal to as if someone wrote just "ab".

Except for the strings of the form 'content', all those forms can contain escape forms like \u0410, which means "An Unicode character 0410". The usage of those forms is discouraged in new maps, but possible.

An array (or a list) is a part of the document of a form ["a", "b", "c"]. It means a sequential group of Strings, or other types.

Document

The root part of the .iml file is called a document. A map has a format as follows:

metadata {
  # Metadata part comes here
}

tests {
  # Tests part comes here
}

# A dependency directive may happen zero or more times. It will be described in
# a subsection.
dependency "other-map-or-library", as: shortname

# This part is optional
aliases {
  # Aliases part comes here
}

stage {
  # A stage description comes here
}

# There may be more than 1 stage, the other stages need to have a name. The
# default stage name is `main`. A name can't happen more than once in a document.

stage(stage_name) {
  # A stage description comes here
}

Dependency

Dependency is an instruction to be issued only in the document context. It means that we want to import some aliases or stages from another map or a library.

dependency "other-map-or-library", as: shortname

This instruction will allow us to reference aliases and stages from other libraries in this form: map.shortname.stage.stagename for stages and map.shortname.aliasname for aliases.

There is a second syntax, mostly useful for loading libraries that will import the stages and aliases to a global context resulting in possibly more human readable maps:

dependency "other-map-or-library", as: shortname, import: true

This form allows to reference other stages and aliases in the following form: stage.stagename, aliasname

It is not possible to load maps using this form, only libraries, because we can’t override the main stage.

The standard library is implicitly imported this way. There’s no way or need to import it explicitly.

Standard library

All maps depend on a standard library implicitly. This standard library defines a few useful aliases that may or may not be expressed otherwise.

Below is a table that describes the aliases defined by the standard library:

none

An empty string

space

A space character

whitespace

Any whitespace ascii character (space, tab, line-delimiter, …​)

boundary

A word boundary (see below for what institutes a word character)

word

An ascii word character (a-z, A-Z, 0-9, _)

not_word

Negation of the above

alpha

Any ascii alphabetic character (a-z, A-Z)

not_alpha

Negation of the above

digit

Any ascii digit

not_digit

Negation of the above

line_start

Beginning of a line

line_end

Ending of a line

string_start

Beginning of a string

string_end

Ending of a string

Any standard library (or otherwise) aliases can be joined with anything else using a + command, for example: line_start + "rest".

Metadata part

The metadata part describes our map. It follows a YAML syntax.

metadata {
  # ID of the authority that provided the transliteration rules we are about to implement
  authority_id: iso
  # ID of the rules, most often the year they were defined
  id: 1996-method1
  # The language code of the map
  language: iso-639-2:kor
  # The source script of our map, in our example Hang for Hangul
  source_script: Hang
  # The destination script of our map
  destination_script: Latn
  # The longer name of our map
  name: ISO/TR 11941:1996 Information and documentation — Transliteration of Korean script into Latin characters
  # The URL where it was published
  url: https://www.iso.org/standard/20564.html
  # The creation date of our map
  creation_date: 1996
  # The adoption date of our map, or empty if not adopted
  adoption_date: ""
  # The description of our map
  description: |
    Establishes a system for the transliteration of the characters of Korean script into Latin characters.
    Intended to provide a means for international communication of written documents.

  # The notes that describe some parts of our map that we are about to implement
  notes:
    - A word-initial hard sign 'ъ' is not represented, but instead is left out of the transliteration.
    - The romanization follows the dialect spoken in Chechnya rather than other local pronunciations.
}

Tests part

The tests part describes a group of the tests to be executed by the automated system to verify that the map is defined properly. An example tests part looks like this:

tests {
  test "애기", "aeki"
  test "방", "pang"
}

This means, that we want to test our map to transliterate a string "애기" to "aeki" and "방" to "pang".

Aliases part

An aliases part describes a group of aliases to be used by the stages to simplify the code of our map.

Let’s suppose that our map refers to "Double consonant jamo" and "Aspirated consonant jamo" quite extensively. We can alias those

aliases {
  def_alias double_cons_jamo, any("ᄁᄄᄈᄍᄊ")
  def_alias aspirated_cons_jamo, any("ᄏᄐᄑᄎ")
}

And later in the stage part refer to them by just double_cons_jamo, not needing to repeat ourselves.

Stage part

A stage part describes a stage, a sequential group of steps to transliterate a string from a source script code to a destination script code. An example stage looks like the following:

stage {
  run map.hangjamo.stage.main
  sub any("ᄀᆨ"), "k"
  sub any("ᄏᆿ"), "kh"
  parallel {
    sub "ᅡ", "a"
    sub "ᅥ", "eo"
  }
}

A stage can be named, as described in the Document section. The default name of a stage is main.

sub call

A sub call does a substitution of an item (string, character, alias) with another item.

stage {
  sub "source", "destination"
}

This call allows for some named parameters:

before:

Execute this substitution only if the "source" is preceded by what is given as a parameter, but won’t replace it, it will only replace the "source".

after:

Same, but this parameter denotes what is used after.

not_before:, not_after:

Negation of before: and after:. The substitution will only happen if a parameter is NOT present before or after the "source".

For example:

stage {
  sub boundary + "Е", "Ye", not_before: "’"
  sub boundary + "е", "ye", not_before: "’"

  sub none, "'", not_before: hangul, after: aspirated_cons
}

Multiple replacements

In various maps there was a need to document multiple replacements. Let’s suppose our character set has a character "a" that can be transliterated to any of the forms "X", "Y" or "Z". As of now, it means that "a" is always translated to "X", as it came first. In the future it will be possible to execute such a map in reverse as well.

stage {
  sub "a", any("XYZ")
}

parallel block

A parallel block can be defined as a subsection of a stage part. It indicates that the steps inside need to be executed in parallel. At the current time, only sub calls can be executed in parallel. It also means, that those steps will try to find the longest substrings first.

stage {
  parallel {
    sub "А", 'A'
    sub "Б", 'B'
    sub "В", 'V'
    sub "Г", 'G'
  }
}

Simple mode

If there are only rules with simple sub rules, we are using a fast track implementation. By simple sub rules we mean those rules that lack a before/after part and ones that only use string and possibly any items with concatenation.

run call

The run call runs a stage defined inside the document, or another map or library. If this map isn’t local, a map or library dependency needs to be declared using the dependency call.

For example:

stage {
  # If dependency declared without import: true
  run map.hangjamo.stage.main
  # If dependency declated with import: true, or we reference a local stage
  run stage.remove_spaces
}

Standard library functions

There are certain conversions that may be hard to be achieved using stages, those are implemented in respective standard libraries using programming languages.

For a function named title_case, it can be called with the following:

stage {
  title_case
}

A standard library function can take (named) arguments. Those are described in the table below and they may be omitted if a default value is specified.

List of standard library functions

Function name Arguments Sample input Sample output

title_case

word_separator: " "

"example string"

"Example String"

downcase

"HELLO WORLD"

"hello world"

compose

"ᄆ"+"ᅮ"

"무"

decompose

"무"

"ᄆ"+"ᅮ"

separate

separator: " "

"こんいちは"

"こ ん い ち は"

secryst

model:

Consult: Usage with Secryst.

Items

Interscript doesn’t work purely on Strings, even though Strings are mostly referenced to by this document. The items can be used in the alias and stage context.

String item

The most basic kind of item. For example "Г" means "match Г" or "replace with Г" depending on usage context. Some contexts will only accept strings, or aliases to strings.

+ method

Items can be concatenated (added together) to denote a complex item. For instance: any("ab") + "e" means "either ae or be" and is equivalent to any(["ae", "be"]).

any item

Any denotes some alternative variations of a string. It has 3 forms of call:

  • any("abcde") - any character: a, b, c, d or e

  • any(["one", "two"]) - any string: one or two

  • any("a".."z") - any character from a to z

Any can be also used with other kinds of items than String, for instance:

stage {
  sub any([line_start + "a", "a" + line_end]), none
}

maybe, some, maybe_some items

If you want a given item to be allowed to be repeated respectively: 0 to 1 times, 1 to Infinity times, 0 to Infinity times, you can surround it with respectively: maybe(), some(), maybe_some().

stage {
  sub "a"+maybe("-")+"b", "AB"       # Equivalent to regexp: a-?b
  sub "a"+some("-")+"b", "AB"        # Equivalent to regexp: a-+b
  sub "a"+maybe_some("-")+"b", "AB"  # Equivalent to regexp: a-*b
}

alias item

An alias item references an alias. For example map.other_map.alias_from_other_map or simply a_local_alias_or_an_alias_from_imported_library.

capture and ref items

Sometimes there may be a need to reference a group from input inside output (or input too). People who know regular expressions are familiar with expressions of some form of replace /(a)/, '[\1]'. Interscript supports this kind of syntax:

stage {
  sub capture(any("abc")), "["+ref(1)+"]"
}

When ran against a string "abcde", this stage will produce an output of "[a][b][c]de".

Ending notes

This document described everything Interscript currently supports, but it is strongly advised to read the existing maps to get a grasp of how those functionalities can be used best.