Data Integrity

Data loss and corruption is an integral part of any software activity. This chapter deals with data integrity or the science to minimize the risk of loosing valuable information stored on digital medium.

We organize all of our data into four categories. Each category represents the major threat to loosing the data and how to minimize its risk.

data categories
Backup Duplicate System Temp
Examples source code e-mails compilers object files
Threat Human errors Hardware failures Trojan, viruses, broken updates None, could be entirely re-generated
Prevention Take snapshots through time by using tar to create dated archives. Keep multiple identical copies on different machines using rsync. Fingerprint files using mtree and sha. None

Many times in the litterature, duplicate and replicate are used interchangeably but for clarity, we will try to stick to the term replicate for data that is copied at an identical pathname on both the local and remote machine and to the term duplicate for remote data that is archived under a common directory hierarchy on the local machine. In the case of replication, if the local and remote machine switched roles, services will be configured identically on both machines. In the case of duplication, it is possible to restore data onto the remote machine but fundamentally the local machine serves only as an archive backup.

How it works

An integrity tool generates a fingerprint of the system, duplicate and backups. Those fingerprints are stored into logDir. The backupTops, which includes logDir, are archives into backupDir. The backupDir is trimmed-down out of archives through a stamping mechanism. A script running on the backup machine pulls the duplicateTops, which includes backupDir, from the remote machine into a local directory (duplicateTop). The data is pulled from the backup machine instead of pushed to the backup machine. As a result, de-duplication work very much the same way as a build of the repository. In fact, authentication of backup machines is setup through the same mechanism as any contributors.

Apart from integrity, the difference between two fingerprinting of a system can also be used for auditing.

description examples
backupDir directory where archive (.tar.bz2) files are stored /var/backup
backupTops all directory hierarchies to backup /home, /var/log
excludeTops all directory hierarchies to exclude from fingerprinting *build*
logDir directory where system fingerprints are stored /var/log/mtree
duplicateDir path on the local backup machine where the remote data is duplicated. /var/duplicate
duplicateTops all directory hierarchies on the remote machine to duplicate /var/backup, /var/fortylines/data

On a conventional Linux systems, applications follow more or less the Filesystem Hierarchy Standard. As a result, amongst other things, most of /var/log and /var/lib is worth replicating.

Data retention

Keeping data around costs money and headaches. You don't need to keep everything on high availability disks when fifty percent of your files haven't been touched in ninety days. It is a much better solution to move those old files to cheaper storage.

There is often legal constraints to keep some information for an extended period of time and potential legal downfall of keeping them longer so an even better solution is to delete files or "forget" all data older than a certain date.

Important Data might be kept for very long period of times, often on machines that outlive there life expectency. At some point it is vital to understand the inner workings of the physical media, the file system implementation and the file format in which data is preserved. A good rule of thumb is to keep backup data on a separate partition from the system used to access it. This way the system can be modified and upgraded, for example to patch security issues. The file system used to organized blocks of data on the medium should be as simple and as widely available accross operating systems. This will reduce the risk that the machine used to read the data cannot be made to work reliably anymore. Finally as much as possible, data should be stored in widely available file format, preferrably human-readable text formats if possible. This insures that the source code of an application used to read the data would dissapear, it is still relatively simple to interpret the information contained in the files.