Announcing Stringsext Version 2
This blog post is about stringsext version 2, a major rewrite of the original software. Version 2 comes with performant filter for scripts like “Latin“, “Arabic”, “Hebrew”, “Cyrillic”, ...
stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, CJKV characters and other scripts in all supported multi-byte-encodings, while GNU strings fails in finding any of these scripts in UTF-16 and many other encodings.
Update 2020-02-23: Besides Linux, Windows and Mac binaries, Debian/Ubuntu packages are available for download. More articles about Stringsext are available.
Screenshot
stringsext -tx -e utf-8 -e utf-16le -e utf-16be \
-n 10 -a None -u African /dev/disk/by-uuid/567a8410
3de2fff0+ (b UTF-16LE) ݒݓݔݕݖݗݙݪ
3de30000+ (b UTF-16LE) ݫݱݶݷݸݹݺ
<3de36528 (a UTF-8) فيأنمامعكلأورديافىهولملكاولهبسالإنهيأيقدهلثمبهلوليبلايبكشيام
>3de36528+ (a UTF-8) أمنتبيلنحبهممشوش
<3de3a708 (a UTF-8) علىإلىهذاآخرعددالىهذهصورغيركانولابينعرضذلكهنايومقالعليانالكن
>3de3a708+ (a UTF-8) حتىقبلوحةاخرفقطعبدركنإذاكمااحدإلافيهبعضكيفبح
3de3a780+ (a UTF-8) ثومنوهوأناجدالهاسلمعندليسعبرصلىمنذبهاأنهمثلكنتالاحيثمصرشرححو
3de3a7f8+ (a UTF-8) لوفياذالكلمرةانتالفأبوخاصأنتانهاليعضووقدابنخيربنتلكمشاءوهياب
3de3a870+ (a UTF-8) وقصصومارقمأحدنحنعدمرأياحةكتبدونيجبمنهتحتجهةسنةيتمكرةغزةنفسبي
3de3a8e8+ (a UTF-8) تللهلناتلكقلبلماعنهأولشيءنورأمافيكبكلذاترتببأنهمسانكبيعفقدحس
3de3a960+ (a UTF-8) نلهمشعرأهلشهرقطرطلب
3df4cca8 (c UTF-16BE) փօև։֍֏֑֛֚֓֕֗֙֜֝֞
<3df4cd20 (c UTF-16BE) ־ֿ׀ׁׂ׃ׅׄ׆ׇ
History
stringsext was first publicly released in December 2016 and hadn't changed
much during the last 3 years. Of course, there had been some smaller
bug-fixes and its migration towards the Rust's 2018-edition took some time, but
the design and structure remained untouched. In the meantime Rust's backend
library rust-encoding
had been abandoned in favour of encoding_rs
and it
became apparent, that migrating to the new library equals a major rewrite of
the source code. And when it came to start the new development, it was a good
opportunity to revise some early design decisions, taken when I first stated
the development in summer 2016. I was not very experienced
in Rust programming by then. Since memory safety was - from the start - a
fundamental requirement, the first version of stringsext had to get along
with no unsafe{}
Rust code. This design decision, together with
stringsext's multi-threaded architecture, resulted in numerous costly
copying of strings and heap allocations.
In order to optimize speed, stringsext version 2 avoids copying of stream
data as much as possible. This goal could not have been achieved without some
unsafe{}
tagged pointer arithmetics. To prevent misunderstanding: the
unsafe{}
tag does not mean, that the embraced code is unsafe. It rather
means, that for a few, well encapsulated lines of code, the borrow-checker is
partly disabled and the programmer has to guarantee and test the memory
safety without compiler help. The biggest security concern with
stringsext is, that software tools used in disk and memory
forensics are particularly exposed to malicious input. In stringsext, the
source code, that first reads the potentially malicious input, is the so
called the “decoder“, which is provided by the external library
encoding_rs
. This explains why stringsext's security heavily depend on
the security guaranties of this library. encoding_rs
was chosen with
security concerns in mind: encoding_rs
is well established, part of common
web-browsers and as such widely used, tested and trustworthy. In addition,
encoding_rs
is also written in Rust, and thus offers the same memory safety
guarantees as stringsext does. Because of its upstream decoder, stringsext
is never directly in contact with the input stream, which reduces its attach
surface significantly.
One of the first features, that had been added to stringsext version 1 was the Unicode-Block-Filter, allowing to restrict the characters to search for, to some Unicode code-point range. This feature proved to be very successful in memory forensics, that deals a lot with UTF-16. Almost any random byte-stream can be interpreted as valid UTF-16, therefore the Unicode-Block-Filter is vital for reducing false positives, while scanning for this encoding. At the same time, forensic practise also showed the limitation of the first filter design. To improve the situation stringsext version 2 comes with a new and better configurable Unicode-Block-Filter.
stringsext is a stream oriented tool, that harmonizes well with other UNIX
tools like head
, tail
, sed
or grep
. Version 2 has its own
simple grep
-like filter, useful to scan for file-paths and URLs. For more
complex filtering with regular expressions, pipe stringsext's output
through grep
or sed
.
Summery of changes, improvements and new features
- Much faster (>30%).
- Improved Unicode-Block-Filter:
- Search in scripts with predefined filters e.g. Latin, Arabic, Syriac, Cyrillic, etc and any combination of these.
- Configurable custom filter.
- Improved ASCII-Filter:
- Predefined filters e.g. all ASCII except control characters or all ASCII with white-space, without controls".
- Custom filters with configurable sets of ASCII-codes that pass the filter.
- New internal "grep"-like filter, mainly useful to search paths strings.
- More detailed position indication for long strings.
- Better interface with other stream oriented tools e.g. "head", "tail", "sed" and "grep".
- Better handling of zero terminated (C-style) strings in large fields.
- New backend "encoding_rs".
Resources
Read more about stringsext on its project page. A paper about stringsext, published in 2019 in the Journal of Digital Forensics, Security and Law is available for free download.
stringsext comes with a man-page and API-documentation. The source-code is hosted on Github and on Gitlab. Binaries are available for download here.