[][src]Module stringsext::input

This module abstracts the data-input channels i.e. file and stdin.



The from_stdin() function implements its own reader buffer BUF_LEN to allow stepping with overlapping windows. The algorithm requires that BUF_LEN is greater or equal than WIN_LEN (the greater the better the performance).


WIN_OVERLAP is the overlapping fragment of the window. The overlapping fragment is used to read some Bytes ahead when the string is not finished. WIN_OVERLAP is subject to certain conditions: For example the overlapping part must be smaller than WIN_STEP. Furthermore, the size of FINISH_STR_BUF = WIN_OVERLAP - UTF8_LEN_MAX determines the number of Bytes at the beginning of a string that are guaranteed not to be spit.


In Unicode the maximum number of Bytes a multi-Byte-character can occupy in memory is 6 Bytes.


WIN_LEN is the length of the memory chunk in which strings are searched in parallel.


The scanner tries to read strings in WIN_LEN as far as it can. The first invalid Byte indicates the end of a string and the scanner holds for a moment to store its finding. Then it starts searching further until the next string is found. Once WIN_OVERLAP is entered the search ends and the start variable is updated so that it now points to restart-at-invalid as shown in the next figure. This way the next iteration can continue at the same place the previous had stopped.


As Files are accessed through 4KiB memory pages we choose WIN_STEP to be a multiple of 4096 Bytes.



Streams a file by cutting the input into overlapping chunks and feeds the ScannerPool. After each iteration the byte_counter is updated. In order to avoid additional copying the trait memmap is used to access the file contents. See: https://en.wikipedia.org/wiki/Memory-mapped_file


Streams the input pipe by cutting it into overlapping chunks and feeds the ScannerPool. This functions implements is own rotating input buffer. After each iteration the byte_counter is updated.


Read the appropriate input chunk by chunk and launch the scanners on each Chunk. If file_path_str == None read from stdin, otherwise read from file.