Aug 21, 2012

Recovering large corrupted files

I've uploaded relatively big archive, about 8G, and sent download link to remote colleague.
He downloaded the file, but unfortunately his copy was corrupted. File size was pretty close, but checksum was wrong.
Obvious and boring solution, download it again, but there might be a better solution.
Rdiff is not-so-well-known utility, it uses rolling checksum algorithm from rsync.
Usage is pretty straight forward:

  • rdiff signature broken-file broken-file-signature
  • rdiff delta broken-file-signature good-file good-file-delta
  • rdiff patch broken-file good-file-delta broken-file-fixed


To summarize:

  1. Colleague creates signature of his corrupted file, and sends it to me. 
  2. Based on a copy of good file I have, and signature file from colleague, I create delta and send it to colleague.
  3. Colleague applies my delta to his corrupted file, and gets original good file.
  4. PROFIT




In our specific case we've got:
~8G original file
~50M signature file
~200M delta file

~250M of traffic and ~30 minutes of CPU spent to fix the file instead of re-downloading it.

Unfortunately this utility does not come native on most machines.
On OSX I had to install it through ports (sudo port install rdiff-backup), this installed pretty ancient version - (librsync 0.9.7).
Luckily, same ancient version exists as precompiled binary for windows (http://personal.hlfslinux.hu/hijaszu/rdiff.html).
On linux you'd probably have to install it from source.

P.S. If you have direct/vpn/tunnel connectivity it would be easier just to use rsync. In this case I didn't have the luxury.

No comments: