Database Structure¶
Information regarding the filesystem to be backed-up, and the current content of volumes, is stored in a simple mongoDB database.
Filesystem¶
The collection that stores the information about the files currently in the filesystem is (uninspiredly!) named files
.
The entries/documents in it have the form:
{
'_id': ObjectId("59e0a71c2afc32cfc4e7fa48"),
'filename': r"Multimedia\video\animePlex\Shin Chan\Season 01\Shin Chan - S01E613.mp4",
'hash': "4a7facfe42e8ff8812f9cab058bf79981974d9e2e300d56217d675ec5987cf05",
'timestamp': 1197773340,
'size': 68097104
}
where:
- The
filename
field is the file path relative to the mountpoint of the filesystem.- The
hash
field is the SHA-256 hash of the file.timestamp
is the file’s last-modified timestamp.size
is the size of the file in bytes, obtained withos.stat(fn).st_mtime
.
The fields used for look-up are filename
and hash
, so the collection should have an index on each of them.
The one on filename
should use unique=True
, to ensure no filename is added twice [2] .
The class that manages this collection is FileDB
.
Volumes¶
On the other hand, the present state of backup volumes is stored in the collection volumes
,
with entries like
{
'_id': ObjectId("59e484603e12972bd4209fbe"),
'volume': "3EC0BECC",
'hash': "0017eef276f4247807fa3f4e565b8c925a2db0f8bfbb020248ad6c3df6a6ea77",
'size': 97092
}
where:
volume
is the volume id. In Windows, volume serial numbers are used; in Linux, disk serial numbers.hash
field is the SHA-256 hash of the file.size
is the size of the file in bytes.
This entry is saying that volume 3EC0BECC contains a file with the given hash, and filesize 97,092 bytes.
There should be a a unique index on field hash
[1] .
The methods that add/remove files from a volume (see class HashVolume
)
also update this collection, so that it remains up-to-date.
Footnotes
[1] | In fact, this enforces that only one volume may contain a file with a specific hash. If the backup methods are working correctly this should be the case. If the same file is found in different folders in the filesystem, or with different names, no space is wasted and just one copy will be present in backup volumes. |
[2] | This is not true for hash , because we need to be able to backup systems that contain the same file in different locations.
I was surprised to find that I had about a 5% of file redundancy in number of files, it turned out that some tiny files were necessary in many locations. |