I need to ship a clone of a VM image “transhemispherically” over a high latency link that tends to have many connection failures with scp/ssh+rsync. I decided to investigate shrinking the raw images that are in use for performance reason.
The VM image I used for testing in a clone of a real production image. It’s a 40GB raw file with the following paritioning and data usage.
$ du -sk foo.img 41943044 foo.img
[jhoblitt@foo ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/bootvg.foo-root 7.9G 3.5G 4.1G 47% / tmpfs 1.9G 0 1.9G 0% /dev/shm /dev/vda1 248M 57M 179M 25% /boot /dev/mapper/bootvg.foo-home 7.9G 147M 7.4G 2% /home /dev/mapper/bootvg.foo-tmp 4.0G 140M 3.7G 4% /tmp /dev/mapper/bootvg.foo-var 7.9G 1.6G 6.0G 21% /var /dev/mapper/bootvg.foo-ftp 7.9G 147M 7.4G 2% /d1/ftp
The test setup was as follows to test raw -> qcow2 (with and without native qcow2 zlib compression), lzma, bzip2, and gzip (zlib) compression.
# convert raw to qcow2 (non-sparse) qemu-img convert -O qcow2 foo.img foo.qcow2 # convert raw to compresed qcow2 (non-sparse) qemu-img convert -O qcow2 -c foo.img foo.compressed.qcow2 # compressing the raw image lzma -k --best foo.img & bzip2 -k --best foo.img & gzip --best -c foo.img > foo.img.gz & # compressing the non-natively compressed qcow2 image lzma -k --best foo.qcow2 bzip2 -k --best foo.qcow2 & gzip --best -c foo.qcow2 > foo.qcow2.gz &
The results are as follows:
$ du -sk foo.* | sort -nr 41943044 foo.img 6438348 foo.qcow2 2106320 foo.compressed.qcow2 1884580 foo.img.gz 1850252 foo.qcow2.gz 1667536 foo.img.bz2 1667204 foo.qcow2.bz2 1122428 foo.img.lzma 1116324 foo.qcow2.lzma
I find several things in the results fairly surprising. I would not have have expected the gzip’d qcow2 image to be ~12% smaller than the native qcow2 zlib compression. The other big surprise was that lzma compression of the raw image was within 1% of running lzma on the qcow2 image. Since I want raw images after transfer anyways, lzma of the raw image is the winner for me with it’s factor of 37 compression ratio. However, keep in mind that for my usage case I’m shipping across a high latency / low throughput link so I’m willing to pay for the substain lzma compression time. I did not include compression times in the results since I was compressing many images in parallel.