Tp-Note's main design goal is to convert some input text - usually provided by the system's clipboard - into a Markdown note file with a descriptive YAML header and meaningful filename. This blog post explains how to configure Tp-Note for multilingual note-taking. Consider the following note file example:
--- title: Demokratie und Menschenbild subtitle: Vorlesung author: Prof. Rainer Mausfeld date: 2023-04-21 lang: de-DE --- Kognitionswissenschaftliche Einsichten in die Beschaffenheit des Menschen.
The YAML variable
lang: indicates the natural language in which the note
text is authored, here
de-DE. This IETF language tag is useful when
working with grammar and spell checkers. For example: when placed in last
position, the grammar checker LanguageTool (LTEX) reads the variable
lang: to activate the grammar and spelling rules for the indicated
specific language (c.f. blog post). The recently released Tp-Note
version 1.21.0 integrates linguistic analysis provided by the Lingua
library. This new feature allows Tp-Note to determine automatically
the natural language of new note files during their creation process.
Linguistic data models of natural languages
Linguistic data models of natural languages, are huge: compared to the previous Tp-Note version 1.20.1, the new version's binary is 20 times larger, now approximately 100 MiB. Also, running a linguistic models over some - even trivial - input text requires important computational resources.
In the context of Tp-Note, the task consists of detecting the natural language of some short input text, for example the note's title. The set of all potential languages, e.g. English, French, and so on, is represented by one language model per language. At the time of writing, the underlying Lingua library provides 75 language models, which are all - by default - compiled in Tp-Note's binary. The latter explains Tp-Note's important binary size. However, not only the binary size is of concern: the large set of language choices must be further limited at runtime for another reason, because the processing time increases with the number of language models. As a rough estimation, my Lenovo T570s laptop loads and processes one language model per 100ms. As the note file creation should not take more than 0.5 seconds, the set of potential languages is limited to 4-5 languages. But don't worry, most of us do not write notes in more than 5 different languages, this is why the above limitation has no practical impact on our workflows. Nevertheless, there is one downside: there is no way for Tp-Note to determine automatically which of the 75 languages it should search for. Obviously, this set depends solely on the user's preferences. Note that, the user's locale setting - as reported from the operating system - is automatically appended to the above list of language candidates. Therefore, manual configuration is only required when you write your notes also in other languages than the one defined by your locale setting. The following sections explain how to configure Tp-Note for multilingual note-taking.
Configuration of the natural language detection algorithm
When creating a new header for a new or an existing note file, a linguistic
language detection algorithm tries to determine in what natural language the
note file is authored. Depending on the context, the algorithm processes as
input: the header field
title: or the first sentence of the text body.
The natural language detection algorithm is implemented as a template filter
get_lang, which is used in various Tera content templates
tmpl.*_content in Tp-Note's configuration file. The filter
is parametrized by the configuration variable
containing a list of ISO 639-1 encoded languages, the algorithm considers as
potential detection candidates, e.g.:
[tmpl] filter_get_lang = [ 'en', 'fr', 'de', 'et' ]
Note, that the above list is internally completed by the user's
default language as reported from the operating system. Therefore,
manual configuration is only required when you write your notes also
in other languages than the one defined by your locale setting. Please
refer to the documentation of the environment variable
for further details.
As natural language detection is CPU intensive, it is advised to limit the number of detection candidates to 5 or 6, depending on how fast your computer is. The more language candidates you include, the longer the note file creation takes time. As a rule of thumb, with all languages enabled the creation of new notes can take up to 4 seconds on my computer. Nevertheless, it is possible to enable all available detection candidates with the pseudo language code “+all” which stands for “add all languages”:
[tmpl] filter_get_lang = [ '+all', ]
Once the language is detected with the filter
get_lang, it passes another
map_lang. This filter maps the result of
encoded as ISO 639-1 code - to an IETF language tag. For example,
de-DE. This additional filtering
is useful, because the detection algorithm can not figure out the region code
-DE) by itself. Instead, the region code is appended in a
separate processing step. Spell checker or grammar checker like LTeX rely on
this region information, to work properly.
The corresponding configuration looks like this:
[tmpl] filter_map_lang = [ [ 'en', 'en-US', ], [ 'de', 'de-DE', ], ]
When the user's region setting - as reported from the operating system's
locale setting - does not exist in above list, it is automatically appended as
additional internal mapping. When the filter
map_lang encounters a language
code for which no mapping is configured, the input language code is forwarded
as it is without modification, e.g. the input
fr results in the output
Subsequent entries that differ only in the region subtag, e.g.
['en', 'en- GB'], ['en', 'en-US'] are ignored.
Note, that the environment variable
TPNOTE_LANG_DETECTION - if set -
takes precedence over the
settings. This allows to configure the language detection feature system-wide
without touching Tp-Note's configuration file. The following example achieves
the equivalent result to the configuration hereinabove:
TPNOTE_LANG_DETECTION="en-US, fr, de-DE, et" tpnote
If you want to enable all language detection candidates, add the pseudo tag
+all somewhere to the list:
TPNOTE_LANG_DETECTION="en-US, de-DE, +all" tpnote
In the above example the IETF language tags
de-DE are retained
in order to configure the region codes
DE used by the
For debugging observe the value of
SETTINGS in the debug log:
tpnote -d trace -b
If wished for, you can disable Tp-Note's language detection feature, by
deleting all entries in the
[tmpl] filter_get_lang = 
Like above, you can achieve the same with:
A good start is Tp-Note's project page or the introductory video. The source code is available on GitHub - getreu/tp-note and some binaries and packages for Linux, Windows and Mac can be found here. To fully profit of Tp-note, I recommend reading Tp-Note's user manual. If you like Tp-Note, you probably soon want to customize it. How to do so, is explained in Tp-Note's manual page.
Last updated on 2.6.2023