Searching for non-ASCII characters in a file
03 December 2019
Most online solutions involve grep -P -n "[\x80-\xFF]"
but these solutions do not work on Mac OS X or BSD variants of grep
. Instead, use Perl for a more portable solution:
perl -ne 'print if /[^[:ascii:]]/' filename.txt
Tags: software, textprocessing, Unix, MacOS
Sampling a large text file
14 June 2019
Given a large enough text file, sampling solutions like the shuf
utility will run out of memory. For statistical sampling, use one line of awk:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= 0.01) print $0}' input.txt > output.txt
This returns about 1% of the lines in the file. For an exact number of lines, use a higher sampling ratio and | head -n
.
Tags: software, textprocessing, Unix