Tp-Note news: Multilingual note-taking with linguistic heuristics
Tp-Note's main design goal is to convert some input text - usually provided by the system's clipboard - into a Markdown note file with a descriptive YAML header and meaningful filename. This blog post explains how to configure Tp-Note for multilingual note-taking. Consider the following note file example:
---
title: Demokratie und Menschenbild
subtitle: Vorlesung
author: Prof. Rainer Mausfeld
date: 2023-04-21
lang: de-DE
---
Kognitionswissenschaftliche Einsichten in die Beschaffenheit des Menschen.
The YAML variable lang:
indicates the natural language in which the note
text is authored, here de-DE
. This IETF language tag is useful when
working with grammar and spell checkers. For example: when placed in last
position, the grammar checker LanguageTool (LTEX) reads the variable
lang:
to activate the grammar and spelling rules for the indicated
specific language (c.f. blog post). The recently released Tp-Note
version 1.21.0 integrates linguistic analysis provided by the Lingua
library. This new feature allows Tp-Note to determine automatically
the natural language of new note files during their creation process.
Linguistic data models of natural languages
Linguistic data models of natural languages, are huge: compared to the previous Tp-Note version 1.20.1, the new version's binary is 20 times larger, now approximately 100 MiB. Also, running a linguistic models over some - even trivial - input text requires important computational resources.
In the context of Tp-Note, the task consists of detecting the natural language of some short input text, for example the note's title. The set of all potential languages, e.g. English, French, and so on, is represented by one language model per language. At the time of writing, the underlying Lingua library provides 75 language models, which are all - by default - compiled in Tp-Note's binary. The latter explains Tp-Note's important binary size. However, not only the binary size is of concern: the large set of language choices must be further limited at runtime for another reason, because the processing time increases with the number of language models. As a rough estimation, my Lenovo T570s laptop loads and processes one language model per 100ms. As the note file creation should not take more than 0.5 seconds, the set of potential languages is limited to 4-5 languages. But don't worry, most of us do not write notes in more than 5 different languages, this is why the above limitation has no practical impact on our workflows. Nevertheless, there is one downside: there is no way for Tp-Note to determine automatically which of the 75 languages it should search for. Obviously, this set depends solely on the user's preferences. Note that, the user's locale setting - as reported from the operating system - is automatically appended to the above list of language candidates. Therefore, manual configuration is only required when you write your notes also in other languages than the one defined by your locale setting. The following sections explain how to configure Tp-Note for multilingual note-taking.
Configuration of the natural language detection algorithm
When creating a new header for a new or an existing note file, a linguistic
language detection algorithm tries to determine in what natural language the
note file is authored. Depending on the context, the algorithm processes as
input: the header field title:
or the first sentence of the text body.
The natural language detection algorithm is implemented as a template filter
named get_lang
, which is used in various Tera content templates
tmpl.*_content
in Tp-Note's configuration file. The filter get_lang
is parametrized by the configuration variable tmpl.filter_get_lang
containing a list of ISO 639-1 encoded languages, the algorithm considers as
potential detection candidates, e.g.:
[[scheme]]
name = "default"
[scheme.tmpl]
filter_get_lang = [
'en',
'fr',
'de',
'et'
]
Note, that the above list is internally completed by the user's
default language as reported from the operating system. Therefore,
manual configuration is only required when you write your notes also
in other languages than the one defined by your locale setting. Please
refer to the documentation of the environment variable TPNOTE_LANG
for further details.
As natural language detection is CPU intensive, it is advised to limit the number of detection candidates to 5 or 6, depending on how fast your computer is. The more language candidates you include, the longer the note file creation takes time. As a rule of thumb, with all languages enabled the creation of new notes can take up to 4 seconds on my computer. Nevertheless, it is possible to enable all available detection candidates with the pseudo language code “+all” which stands for “add all languages”:
[[scheme]]
name = "default"
[scheme.tmpl]
filter_get_lang = [
'+all',
]
Once the language is detected with the filter get_lang
, it passes another
filter called map_lang
. This filter maps the result of get_lang
-
encoded as ISO 639-1 code - to an IETF language tag. For example, en
is
replaced with en-US
or de
with de-DE
. This additional filtering
is useful, because the detection algorithm can not figure out the region code
(e.g. -US
or -DE
) by itself. Instead, the region code is appended in a
separate processing step. Spell checker or grammar checker like LTeX rely on
this region information, to work properly.
The corresponding configuration looks like this:
[[scheme]]
name = "default"
[scheme.tmpl]
filter_get_lang = [ 'en', 'fr', 'de', 'et' ]
filter_map_lang = [
[ 'en', 'en-US', ],
[ 'de', 'de-DE', ],
]
When the user's region setting - as reported from the operating system's
locale setting - does not exist in above list, it is automatically appended as
additional internal mapping. When the filter map_lang
encounters a language
code for which no mapping is configured, the input language code is forwarded
as it is without modification, e.g. the input fr
results in the output fr
.
Subsequent entries that differ only in the region subtag, e.g.
['en', 'en- GB'], ['en', 'en-US']
are ignored.
Note, that the environment variable TPNOTE_LANG_DETECTION
- if set -
takes precedence over the tmpl.filter_get_lang
and tmpl.filter_map_lang
settings. This allows to configure the language detection feature system-wide
without touching Tp-Note's configuration file. The following example achieves
the equivalent result to the configuration hereinabove:
TPNOTE_LANG_DETECTION="en-US, fr, de-DE, et" tpnote
If you want to enable all language detection candidates, add the pseudo tag
+all
somewhere to the list:
TPNOTE_LANG_DETECTION="en-US, de-DE, +all" tpnote
In the above example the IETF language tags en-US
and de-DE
are retained
in order to configure the region codes US
and DE
used by the map_lang
template filter.
For debugging observe the value of SETTINGS
in the debug log:
tpnote -d trace -b
If wished for, you can disable Tp-Note's language detection feature, by
deleting all entries in the tmpl.filter_get_lang
variable:
[[scheme]]
name = "default"
[scheme.tmpl]
filter_get_lang = []
Like above, you can achieve the same with:
TPNOTE_LANG_DETECTION="" tpnote
Read more
A good start is Tp-Note's project page or the introductory video. The source code is available on GitHub - getreu/tp-note and some binaries and packages for Linux, Windows and Mac can be found here. To fully profit of Tp-note, I recommend reading Tp-Note's user manual. If you like Tp-Note, you probably soon want to customize it. How to do so, is explained in Tp-Note's manual page.
Last updated on 2.6.2023