Module stringsext::mission [−][src]
Expand description
Parse and convert command-line-arguments into static MISSION
structures,
that are mainly used to initialize ScannerState
-objects.
Structs
Mission
represents the instruction parameters used mainly in scanner::scan()
.
Each thread gets its own instance and stores it in ScannerState
.
A collection to bundle all Mission
-objects.
When the decoder finds a valid Unicode character, it decodes it into UTF-8.
The leading byte of this UTF-8 multi-byte-character must then pass an
additional filter before being printed: the so called Utf8Filter
. It comes
with three independant filter criteria:
Constants
ASCII filter: Let all ASCII pass the filter (0x01..0x100) except Null (0x00) which is “end of string” marker. Null character - Wikipedia
ASCII filter:
Controls: (0x00..0x20, 0x7F)
C0 and C1 control codes - Wikipedia
Unlike traditional strings
we exclude “Space” (0x20) here, as
it can appear in filenames. Instead, we consider “Space” to be
a regular character.
ASCII filter:
Set defaults close to those in traditional strings
.
ASCII filter: Nothing passes ASCII pass filter
ASCII filter: White-space (0x09..=0x0c, 0x20) C0 and C1 control codes - Wikipedia It do not include “Carriage Return” (0x0d) here. This way strings are divided into shorter chunks and we get more location information.
Unicode-block-filter: Accents: (U+300..U+380).
Unicode-block-filter: Armenian: (U+0540..), Hebrew: (U+0580..), Arabic: (U+0600..), Syriac: (U+0700..), Arabic: (U+0740..), Thaana: (U+0780..), N’Ko: (U+07C0..U+800)
Unicode-block-filter:
A filter that let pass all valid Unicode codepoints, except for ASCII where
it behaves like the original strings
. No leading bytes are filtered.
Unicode-block-filter: No leading bytes are filtered.
Unicode-block-filter: Arabic: (U+600..U+700, U+740..U+780)
Unicode-block-filter: Armenian: (U+540..U+580)
Unicode-block-filter: Kana: (U+3000..), CJK: (U+4000..), Asian: (U+A000..), Hangul: (U+B000..U+E000).
Unicode-block-filter: CJK: (U+3000..A000).
Unicode-block-filter: All 2-byte UFT-8 (U+07C0..U+800) #[allow(dead_code)]
Unicode-block-filter: Cyrillic: (U+400..U+540)
Unicode-block-filter: Greek: (U+380..U+400).
Unicode-block-filter: Hangul: (U+B000..E000).
Unicode-block-filter: Hebrew: (U+580..U+600)
Unicode-block-filter: These leading bytes are alway invalid in UTF-8
Unicode-block-filter: IPA: (U+240..U+300).
Unicode-block-filter: Kana: (U+3000..U+4000).
Unicode-block-filter:
Latin: (U+80..U+240).
Usually used together with UBF_ACCENTS
.
Unicode-block-filter: Misc: (U+1000..), Symbol:(U+2000..U+3000), Forms:(U+F000..U+10000).
Unicode-block-filter: No leading byte > 0x7F is accepted. Therefor no multi-byte-characters in UTF-8, which means this is an ASCII-filter.
Unicode-block-filter: Private use area (U+E00..F00), (U+10_0000..U+14_0000).
Unicode-block-filter: Syriac: (U+700..U+740)
Unicode-block-filter: Besides PUA, more very uncommon planes: (U+10_000-U+C0_000).
Shortcuts for the hexadecimal representation of a unicode block filter.
The array is defined as (key, value)
tuples.
For value see chapter Codepage layout in
UTF-8 - Wikipedia
A filter for ASCII encoding searches only. No control character pass, but
whitespace is allowed. This works like the traditional stringsext
mode.
Unless otherwise specified on the command line, his filter is default for
ASCII-encoding searches.
A default filter for all non-ASCII encoding searches.
For single-byte-characters (af
-filter), no control character
pass, but whitespace is allowed. This works like the traditional
stringsext
mode.
For multi-byte-characters we allow only Latin characters
with all kind of accents.
Unless otherwise specified on the command line, this filter
is default for non-ASCII-encoding searches.