How Did I Fix That?

A log of solutions to problems I've encountered. No warranties.

Sampling a large text file

14 June 2019

Given a large enough text file, sampling solutions like the shuf utility will run out of memory. For statistical sampling, use one line of awk:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= 0.01) print $0}' input.txt > output.txt
This returns about 1% of the lines in the file. For an exact number of lines, use a higher sampling ratio and | head -n.

Tags: software, textprocessing, Unix

Shrinking large PDF images

10 June 2019

So you have some large (MB-sized) PDF images and you need to reduce them in size, maybe because arXiv requires images to be compressed. Starting with a file f1.pdf (1031756 bytes):

  • Use ImageMagick to compress to JPG or PNG and then re-encode as PDF. The JPG compression is very efficient, but the PDF re-encoding is not.
    convert f1.pdf -format JPG -quality 50 f1a.jpg  → 78532 bytes (7.6%)
    convert f1.pdf -format JPG -quality 10 f1a.pdf	→ 758028 bytes (73%)
    convert f1.pdf -format JPG -quality 90 f1a.pdf	→ 758028 bytes (73%)
    convert f1.pdf -format PNG -quality 50 f1a.pdf	→ 758028 bytes (73%)
  • ImageMagick output can be processed with jpeg2ps and then epstopdf for better results:
    convert f1.pdf -format JPG -quality 50 f1a.jpg
    jpeg2ps f1a.jpg > f1a.eps
    epstopdf f1a.eps   → 81228 bytes (8%)
  • Use Ghostscript with the /screen or /ebook PDF output settings.
    gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
           -dPDFSETTINGS=/screen -sOutputFile=f1b.pdf f1.pdf  → 176120 bytes (17%)
    The /ebook setting output was nearly the same size as the /screen output.
  • On Mac OS X (10.14.5 Mojave), exporting from Preview with the "Reduce File Size" Quartz filter gave excellent results (105368 bytes, 10%). This is harder to access from the command line, but the default filter is at:
    /System/Library/Filters/Reduce File Size.qfilter
    and the ColorSync utility can create modified versions of that filter in the ~/Library/Filters folder.
    Preview's Save as JPG also gives very good results:
    Quality setting = 7 [1...9]  → 166649 bytes (16%)
    Quality setting = 5 [1...9]  → 87077 bytes (8.5%)
  • Adobe Acrobat Pro has a PDF Optimizer, but does not give as compact results even with 72 DPI output and minimum JPG quality settings (412528 bytes, 40%). Photoshop and Illustrator can produce fairly compact JPGs that can be wrapped back into PDF files (as above), but Preview offers a simple and good enough solution.

Bottom line: Use Preview for conversions by hand, or use the ImageMagick convert utility for JPG output, then wrap it as PDF via jpeg2ps and epstopdf, if scripting is required.

Tags: software, MacOS, graphics, PDF

What's inside a mystery software package file?

27 March 2019

A package (.pkg file in OS X) is an .xar archive containing a cpio.gz archive of installable files in "Payload", along with a "bill of materials", scripts, etc. To inspect the contents, unpack the .xar into a directory, and then open the Payload file:

mkdir scratch; cd scratch
xar -xf ../mystery.pkg
gunzip -dc Payload | cpio -i

The file hierarchy shows where the contents of the package would be distributed during installation.

Tags: MacOS, software

How is this blog generated?

29 August 2018

This web log is generated by a modified version of BashBlog, a simple Bash script blogging engine. My version, with customized CSS and global variables, is available here.

Tags: HTML, software