[][src]Module stringsext::input

This module abstracts the data-input channels i.e. file and stdin.

Constants

BUF_LEN

The from_stdin() function implements its own reader buffer BUF_LEN to allow stepping with overlapping windows. The algorithm requires that BUF_LEN is greater or equal than WIN_LEN (the greater the better the performance).

FINISH_STR_BUF

WIN_OVERLAP is the overlapping fragment of the window. The overlapping fragment is used to read some Bytes ahead when the string is not finished. WIN_OVERLAP is subject to certain conditions: For example the overlapping part must be smaller than WIN_STEP. Furthermore, the size of FINISH_STR_BUF = WIN_OVERLAP - UTF8_LEN_MAX determines the number of Bytes at the beginning of a string that are guaranteed not to be spit.

UTF8_LEN_MAX

In Unicode the maximum number of Bytes a multi-Byte-character can occupy in memory is 6 Bytes.

WIN_LEN

WIN_LEN is the length of the memory chunk in which strings are searched in parallel.

WIN_OVERLAP

The scanner tries to read strings in WIN_LEN as far as it can. The first invalid Byte indicates the end of a string and the scanner holds for a moment to store its finding. Then it starts searching further until the next string is found. Once WIN_OVERLAP is entered the search ends and the start variable is updated so that it now points to restart-at-invalid as shown in the next figure. This way the next iteration can continue at the same place the previous had stopped.

WIN_STEP

As Files are accessed through 4KiB memory pages we choose WIN_STEP to be a multiple of 4096 Bytes.

Functions

from_file

Streams a file by cutting the input into overlapping chunks and feeds the ScannerPool. After each iteration the byte_counter is updated. In order to avoid additional copying the trait memmap is used to access the file contents. See: https://en.wikipedia.org/wiki/Memory-mapped_file

from_stdin

Streams the input pipe by cutting it into overlapping chunks and feeds the ScannerPool. This functions implements is own rotating input buffer. After each iteration the byte_counter is updated.

process_input

Read the appropriate input chunk by chunk and launch the scanners on each Chunk. If file_path_str == None read from stdin, otherwise read from file.