Jens Getreu's blog - Tp-Note news: Multilingual note-taking with linguistic heuristics

Tp-Note's main design goal is to convert some input text - usually provided by the system's clipboard - into a Markdown note file with a descriptive YAML header and meaningful filename. This blog post explains how to configure Tp-Note for multilingual note-taking. Consider the following note file example:

---
title:    Demokratie und Menschenbild
subtitle: Vorlesung
author:   Prof. Rainer Mausfeld
date:     2023-04-21
lang:     de-DE
---

Kognitionswissenschaftliche Einsichten in die Beschaffenheit des Menschen.

The YAML variable lang: indicates the natural language in which the note text is authored, here de-DE. This IETF language tag is useful when working with grammar and spell checkers. For example: when placed in last position, the grammar checker LanguageTool (LTEX) reads the variable lang: to activate the grammar and spelling rules for the indicated specific language (c.f. blog post). The recently released Tp-Note version 1.21.0 integrates linguistic analysis provided by the Lingua library. This new feature allows Tp-Note to determine automatically the natural language of new note files during their creation process.

Linguistic data models of natural languages

Linguistic data models of natural languages, are huge: compared to the previous Tp-Note version 1.20.1, the new version's binary is 20 times larger, now approximately 100 MiB. Also, running a linguistic models over some - even trivial - input text requires important computational resources.

In the context of Tp-Note, the task consists of detecting the natural language of some short input text, for example the note's title. The set of all potential languages, e.g. English, French, and so on, is represented by one language model per language. At the time of writing, the underlying Lingua library provides 75 language models, which are all - by default - compiled in Tp-Note's binary. The latter explains Tp-Note's important binary size. However, not only the binary size is of concern: the large set of language choices must be further limited at runtime for another reason, because the processing time increases with the number of language models. As a rough estimation, my Lenovo T570s laptop loads and processes one language model per 100ms. As the note file creation should not take more than 0.5 seconds, the set of potential languages is limited to 4-5 languages. But don't worry, most of us do not write notes in more than 5 different languages, this is why the above limitation has no practical impact on our workflows. Nevertheless, there is one downside: there is no way for Tp-Note to determine automatically which of the 75 languages it should search for. Obviously, this set depends solely on the user's preferences. Note that, the user's locale setting - as reported from the operating system - is automatically appended to the above list of language candidates. Therefore, manual configuration is only required when you write your notes also in other languages than the one defined by your locale setting. The following sections explain how to configure Tp-Note for multilingual note-taking.

Configuration of the natural language detection algorithm

When creating a new header for a new or an existing note file, a linguistic language detection algorithm tries to determine in what natural language the note file is authored. Depending on the context, the algorithm processes as input: the header field title: or the first sentence of the text body. The natural language detection algorithm is implemented as a template filter named get_lang, which is used in various Tera content templates tmpl.*_content in Tp-Note's configuration file. The filter get_lang is parametrized by the configuration variable tmpl.filter_get_lang containing a list of ISO 639-1 encoded languages, the algorithm considers as potential detection candidates, e.g.:

[[scheme]]
name = "default"

[scheme.tmpl]
filter_get_lang = [
    'en',
    'fr',
    'de',
    'et'
]

Note, that the above list is internally completed by the user's default language as reported from the operating system. Therefore, manual configuration is only required when you write your notes also in other languages than the one defined by your locale setting. Please refer to the documentation of the environment variable TPNOTE_LANG for further details.

As natural language detection is CPU intensive, it is advised to limit the number of detection candidates to 5 or 6, depending on how fast your computer is. The more language candidates you include, the longer the note file creation takes time. As a rule of thumb, with all languages enabled the creation of new notes can take up to 4 seconds on my computer. Nevertheless, it is possible to enable all available detection candidates with the pseudo language code “+all” which stands for “add all languages”:

[[scheme]]
name = "default"

[scheme.tmpl]
filter_get_lang = [
    '+all',
]

Once the language is detected with the filter get_lang, it passes another filter called map_lang. This filter maps the result of get_lang - encoded as ISO 639-1 code - to an IETF language tag. For example, en is replaced with en-US or de with de-DE. This additional filtering is useful, because the detection algorithm can not figure out the region code (e.g. -US or -DE) by itself. Instead, the region code is appended in a separate processing step. Spell checker or grammar checker like LTeX rely on this region information, to work properly.

The corresponding configuration looks like this:

[[scheme]]
name = "default"

[scheme.tmpl]
filter_get_lang = [ 'en', 'fr', 'de', 'et' ]
filter_map_lang = [
    [ 'en', 'en-US', ],
    [ 'de', 'de-DE', ],
]

When the user's region setting - as reported from the operating system's locale setting - does not exist in above list, it is automatically appended as additional internal mapping. When the filter map_lang encounters a language code for which no mapping is configured, the input language code is forwarded as it is without modification, e.g. the input fr results in the output fr. Subsequent entries that differ only in the region subtag, e.g. ['en', 'en- GB'], ['en', 'en-US'] are ignored.

Note, that the environment variable TPNOTE_LANG_DETECTION - if set - takes precedence over the tmpl.filter_get_lang and tmpl.filter_map_lang settings. This allows to configure the language detection feature system-wide without touching Tp-Note's configuration file. The following example achieves the equivalent result to the configuration hereinabove:

TPNOTE_LANG_DETECTION="en-US, fr, de-DE, et" tpnote

If you want to enable all language detection candidates, add the pseudo tag +all somewhere to the list:

TPNOTE_LANG_DETECTION="en-US, de-DE, +all" tpnote

In the above example the IETF language tags en-US and de-DE are retained in order to configure the region codes US and DE used by the map_lang template filter.

For debugging observe the value of SETTINGS in the debug log:

tpnote -d trace -b

If wished for, you can disable Tp-Note's language detection feature, by deleting all entries in the tmpl.filter_get_lang variable:

[[scheme]]
name = "default"

[scheme.tmpl]
filter_get_lang = []

Like above, you can achieve the same with:

TPNOTE_LANG_DETECTION="" tpnote

A good start is Tp-Note's project page or the introductory video. The source code is available on GitHub - getreu/tp-note and some binaries and packages for Linux, Windows and Mac can be found here. To fully profit of Tp-note, I recommend reading Tp-Note's user manual. If you like Tp-Note, you probably soon want to customize it. How to do so, is explained in Tp-Note's manual page.

Last updated on 2.6.2023

Contents

Tp-Note news: Multilingual note-taking with linguistic heuristics

Linguistic data models of natural languages

Configuration of the natural language detection algorithm

Read more