How Did I Fix That?

A log of solutions to problems I've encountered. No warranties.

Sampling a large text file

14 June 2019

Given a large enough text file, sampling solutions like the shuf utility will run out of memory. For statistical sampling, use one line of awk:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= 0.01) print $0}' input.txt > output.txt
This returns about 1% of the lines in the file. For an exact number of lines, use a higher sampling ratio and | head -n.

Tags: software, textprocessing, Unix