Multicore compress and decompress a directory with tar and pzstd

May 24, 2023   

TL;DR:

I managed multicore compression & decompression of a directory with tar and pzstd.

History

These days, I work with large codebases such as Android or RDK. Sometimes, I need to archieve those codebases.

My PC has 8C/16T CPU (Ryzen 5700G), 64Gbyte DDR4 RAM, SAMSUNG 970 EVO Plus NVME M.2 2TB and I want this archiving operation to be as fast as possible.

In the past, I used tools like pigz, but these days I love to use zstd, and there is a multithread verison of it is available: pzstd.

Prerequisities

  • tar ( Should be available on most distros)
  • pzstd ( install via your package manager )

Compressing a directory with multiple threads

Generic usage:

tar --use-compress-program pzstd -TAR_ARGUMENTS FINAL_TARBALL_NAME DIRECTORY_TO_ARCHIEVE

Example: I cloned AOSP-13 source via “repo” tool, and it gave me a directory with 100Gbyte size. Lets assume the name of the directory is “AOSP-13”.

tar --use-compress-program pzstd -cvf AOSP-13.tar.zstd AOSP-13/

Or even simpler:

tar -I pzstd -cvf AOSP-13.tar.zstd AOSP-13/

Performance

Above AOSP-13 directory size is 171 Gbyte with hundreds of thousands of various text & binary files in it.

stulluk /media/WORK/ANDROID/temp/AOSP-13 $  du -sh
171G	.
stulluk /media/WORK/ANDROID/temp/AOSP-13 $  find . -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
 239067 h
 119647 c
 107300 java
  72684 xml
  48560 cpp
  34947 html
  33718 txt
  32929 py
  29148 png
  24600 cc
  21136 sha1
  19403 md5
  18299 go
  14180 jar
  12855 ll
  12477 so
  12390 rs
   9636 lsdump
   8975 bp
   8500 aidl
   7968 rst
   7319 pbtxt
   7068 sh
   6582 pom
   6091 json
   5890 md
   5566 hpp
   5511 dts
   5452 a
   5405 S
   5101 kt
   3897 dtsi
   3577 yaml
   3546 out
   3443 s
   3198 te
   3049 test
   2949 mk
   2779 smali
   2675 in
   2626 gitignore
   2336 git/info/exclude
   2336 git/HEAD
   2336 git/description
   2336 git/config
   2007 cfg
   1877 proto
   1837 aar
   1783 m
   1739 gradle
   1714 o
   1700 bin
   1573 jpeg
   1572 ttf
   1539 frag
   1500 dump
   1435 inc
   1404 pem
   1259 3
   1207 cs
   1202 amber
   1193 1
   1173 git/refs/remotes/m/t-tv-dev
   1173 git/logs/refs/remotes/m/t-tv-dev
   1173 git/index
   1173 git/FETCH_HEAD
   1156 git/packed-refs
   1144 rlib
   1125 idx
   1123 pack
   1118 def
   1090 properties
   1047 js
   1045 hal
   1044 td
   1007 dat
    974 otf
    973 cmake
    947 jpg
    938 rscript
    901 ko
    896 gn
    882 gz
    871 groovy
    828 asm
    814 asc
    803 apk
    797 mm
    797 bat
    772 expected
    738 zip
    727 rc
    723 sha256
    720 crt
    710 tmpl
    693 yml
    684 glsl
    672 data
    650 svg
    648 toml
    636 ogg
    634 bazel
    633 sha512
    632 pcap
    616 go2
    611 ini
    593 vert
    591 keep
    590 err
    577 0
    569 jmod
    564 bzl
    562 gif
    544 expect
    520 bz2
    518 g
    508 mlir
    488 ttx
    472 8
    465 patch
    465 mp4
    461 input
    456 syms
    455 css
    448 sksl
    444 m4
    443 asciipb
    432 conf
    431 ts
    431 d
    427 p12
    427 2
    417 pl
    417 4
    413 csv
    .....

Compressing it with tar & pzstd ( with default params ) takes less than 4 minutes:

stulluk /media/WORK/ANDROID/temp $  /usr/bin/time -v tar -I pzstd -cf AOSP-13.tar.zstd AOSP-13/ 
	Command being timed: "tar -I pzstd -cf AOSP-13.tar.zstd AOSP-13/"
	User time (seconds): 719.49
	System time (seconds): 224.74
	Percent of CPU this job got: 409%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:50.83
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 532216
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 1236154
	Voluntary context switches: 5994338
	Involuntary context switches: 627436
	Swaps: 0
	File system inputs: 356223840
	File system outputs: 233924608
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
stulluk /media/WORK/ANDROID/temp $

Note that I did not use bash built-in “time” keyword, it doesn’t have ability to report CPU usage.

Decompressing a directory with multiple threads

Generic usage:

tar --use-compress-program pzstd -TAR_ARGUMENTS TARBALL_NAME 

Example: I have above AOSP-13.tar.zstd file and I want to decompress it with multiple cores

tar --use-compress-program pzstd -xvf AOSP-13.tar.zstd

Or even simpler:

tar -I pzstd -cvf AOSP-13.tar.zstd 

Performance

stulluk /media/WORK/ANDROID/temp $  /usr/bin/time -v tar -I pzstd -xf AOSP-13.tar.zstd 
	Command being timed: "tar -I pzstd -xf AOSP-13.tar.zstd"
	User time (seconds): 141.14
	System time (seconds): 203.48
	Percent of CPU this job got: 168%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:24.63
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 392308
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 2371705
	Voluntary context switches: 17620483
	Involuntary context switches: 25312
	Swaps: 0
	File system inputs: 233926600
	File system outputs: 356182424
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
stulluk /media/WORK/ANDROID/temp $

Amazing, isn’t it :)

BASH Alias

I added following line to my ${HOME}/.bashrc for lazyness:

alias tpz='tar -I pzstd '

After source’ing this .bashrc , I can easily do:

tpz -cvf AOSP-13.tar.zstd AOSP-13

or

tpz -xvf AOSP-13.tar.zstd

Note

Adding “v” to tar arguments slightly reduce the speed in my environment, at around %10-15. I think this is due to the terminal speed of my PC ( I use & love terminator )

Final words

Of course there are better compressors than zstd but critical point for me is the speed.

zstd gives a good balance between compress / decompress speed and compression ratio.

Most modern kernels support zstd out-of-the-box and thanks to facebook/meta for this beautiful software.



comments powered by Disqus