Duplicacy Issue: system design & performance issues

All issues

system design & performance issues

mwang Oct 3 11:05AM 2017 CLI

Hi,

I've been trying out various backup solutions that I hope can replace CrashPlan in a few months. The field has been narrowed down to three candidates: Duplicacy, Arq and Hashbackup. While I like Duplicacy a lot, I've seen some puzzling issues.

Some basic information first:

I want to back up to the cloud. (All three can also do local backups; just to make clear my focus of evaluation.)
My cloud backup destination of choice post CrashPlan would be Google Drive. (For I get unlimited storage through my university G Suite subscription.)
The contract bandwidth with my ISP is 100Mbs/40Mbs.
I'm backing up from a Mac, though I do have a Linux home server, and may consider moving my cloud backup operation to it someday.
I plan to conduct my backups from a single source machine (Mac now, maybe Linux later).
The size of data to be backed up is about 4+ TB at the moment: 10GB documents and other small files, 20GB mail, 50 GB photo, 500 GB music, and 3.5 TB video. Only the first two sets (doc and mail) see regular daily changes. Others come in bunches after travels or when I have time to rip more CD/DVD/BDs.

Now, the questions:

Duplicacy's main attraction is the well-crafted (and well-explained) lock-free deduplication system design that enables backing up multiple sources to the same destination, maximizing dedup efficiency. It also promises fast backup/restore operations. The former isn't that important to me for the time being due to reason 5 above, but is a nice feature nonetheless.

In my tests, Duplicacy is indeed fast when the backup storage is on a local disk. When trying to restore a directory of some 20MB in size from the Google Drive, however, Duplicacy spent more than 40 min. to produce the file list for the revision I asked for, and then another 40+ min. to restore the data. For comparison, both ARQ and Hashbackup did both jobs in seconds.

Since I haven't uploaded everything yet with Duplicacy (the folder on Google Drive reserved for Duplicacy backups stands at 552 GB), and have conducted only about 20 rounds of backups, it worries me how long a restore will take down the road.

After repeated tests (with similar results), I found Duplicacy spent a long time pulling down some 3+ GB data into the cache (under the hidden .duplicacy folder in the repository) before establishing the file list, then another 3+ GB data into another cache folder under the target directory when restoring. The two cache folders were bit-identical. So obviously Duplicacy didn't utilize the first cache folder during the 2nd step. Is there a good reason for this?

Supposing that's a bug that can be easily corrected, it still doesn't explain the long restoration time. When backing up, I usually see 160-250 MB/min. upload speed, so the download speed should at least double that. Pulling down 3+ GB of data should take less than 20 min on a bad day. On what was the other 20 min. spent?

The directory I tried to restore was "~/Library/Application Support". It's not large is size (20MB or so), but is constantly changed so that each backup round would pick up some changes. It seems to me Duplicacy was taking time to find what chunks to download.

If that's the case, it's a common issue for all dedup backup software. Arq & Hashbackup can nevertheless turn to their local database for such information, while Duplicacy has to pull the information from the "backup storage". When that storage is local, the performance doesn't suffer, but if it's in the cloud, the performance impact could be quite severe unless you have a very fat pipe. (100/40 Mbs is pretty fat already, at least in my country.)

I wonder if this is something that can be alleviated by a larger cache, large enough to cache all chunks allocation info locally.

That brings us to the next issue: the overhead of Duplicacy operation seems to be substantially greater than Arq & Hashbackup. It's an accidentally discovery from a pair of tests designed to test dedup efficiency. The first involves two Calibre ebook library folders:

A: 2145 total items (folders and files), total size 2GB.

B: 2036 total items, total size 1.7GB.

B is essentially a subset of A, with only 4 files (totaling less than 2MB) different from A.

If I back up A first, then A + B, with Duplicacy, the total size of the backup storage would grow from 1773MB to 2030MB, an increase of 256MB, much larger than the 2MB difference in original data. Arq, Hashbackup & Borg (a candidate before being eliminated), in comparison, saw their total storage grew by less than 2MB (1.6, 0.8 & 1.5 MB respectively).

Since my curiosity was piqued, another experiment was conducted with two much larger folder of music files:

C: 3358 items, total size 64.5 GB.

D: 3062 items, total size 62.6 GB.

Again D is a subset of C (they are in fact hardlinks, taking up no additional space on the source disk), with only 3 different files (two text files and a .xlsx file) totaling less than 1MB. Backing up C first, then C + D, Duplicacy's total storage grew from 61GB to 74GB, an increase of almost 13GB! Hashbackup & Borg again saw trivial increases: 1.4 & 2 MB respectively. (Arq wasn't tested for my trial period was up.)

The backup time suffered because of it, too. It's tolerable only because the experiment was conducted on local HDs. Uploading 13GB to Google Drive would take me somewhere between 50-80 min. Restoring from cloud would suffer for the same reason, and to cache all these data locally (as alluded earlier) would require a lot of space.

There are other issues and questions. Since this post is way too long already, however, I'll stop here. Thanks for reading if you make it here.

gchen Oct 3 12:43PM 2017

First of all thank you for your thorough tests and detailed report. This indeed exposes some weak points of Duplicacy.

After repeated tests (with similar results), I found Duplicacy spent a long time pulling down some 3+ GB data into the cache (under the hidden .duplicacy folder in the repository) before establishing the file list, then another 3+ GB data into another cache folder under the target directory when restoring. The two cache folders were bit-identical. So obviously Duplicacy didn't utilize the first cache folder during the 2nd step. Is there a good reason for this?

Are you using the GUI version to restore? The GUI version can only restore to a different directory (by creating another repository with the same repository id on the destination directory), therefore it needs to download the chunks that make up the snapshot file again.

If you run the CLI version to restore the original repository to a different revision, Duplicacy will be able to use chunks already in the local cache. It may still need to download some chunks from the cloud, but definitely not all of those 3GB.

Also, the file listing operation in the GUI version is known to be slow -- it basically parses the output from the CLI version to get the file list and the parsing code is inefficient. Populating the file list into the file list widget also takes some time due to limitations of wxWidgets (can't create subtree on-demand when a folder is clicked). This will be fixed in the next GUI major version (likely 2.1.0).

If I back up A first, then A + B, with Duplicacy, the total size of the backup storage would grow from 1773MB to 2030MB, an increase of 256MB, much larger than the 2MB difference in original data

Again D is a subset of C (they are in fact hardlinks, taking up no additional space on the source disk), with only 3 different files (two text files and a .xlsx file) totaling less than 1MB. Backing up C first, then C + D, Duplicacy's total storage grew from 61GB to 74GB, an increase of almost 13GB!

I guess the missing files in B or D are scattered among many folders. Each missing file may cause a few new chunks to be created -- this is because Duplicacy uses a pack-and-split approach. Files are packed first into a conceptual tar file which is then split into chunks, so a missing file will cause the file before it and the file after it to form at least one new chunk, and maybe a few more if the chunking algorithm can't quickly arrive at the original breaking points in the files that follow.

You may be able to reduce this overhead by changing the average chunk size from the default 4 MB to 1MB (by contrast in Arq the default chunk size is perhaps 64KB). Even if the use of a smaller chunk size doesn't reduce the overhead significantly, I think this level of overhead is acceptable if A and B, or C and D are on different computers. If that is the case, then I would argue that Duplicacy just needs 74GB to back up 2 computers, whereas others may need 61 * 2 = 122 GB which is 65% more. The saving would be more if there are more computers.

gchen Oct 3 1:37PM 2017

Your dedup tests did remind me of a simple technique that I thought about but didn't get to implement. The current pack-and-split approach treats all files the same way, regardless of their sizes. However, it may be beneficial to introduce artificial breaking points every time it sees a file larger than the average chunk size, so large files will always have their own chunks, not to share with other files. This may make it less susceptible to missing files.

I'll try it and run some tests myself.

mwang Oct 4 9:22AM 2017

Thanks for the prompt reply.

Are you using the GUI version to restore?

Come to think of it, yes.

The GUI version can only restore to a different directory (by creating another repository with the same repository id on the destination directory), therefore it needs to download the chunks that make up the snapshot file again.

Understood, but shouldn't it be made to utilize the cache in the original repository?

If you run the CLI version to restore the original repository to a different revision, Duplicacy will be able to use chunks already in the local cache. It may still need to download some chunks from the cloud, but definitely not all of those 3GB.

OK, trying that now. The local cache has grown to 10+ GB after two more backup jobs, so hopefully it'll be quick.

And yet I'm having a hard time figuring out the right syntax to restore "~/Library/Application Support". I managed to restore a single file using the following syntax:

duplicacy restore -r 23 -threads 4 -stats -- +*/filename

So I thought it must be the space in the path that's causing troubles. But even

duplicacy restore -r 23 -threads 4 -stats -- +*/Library

would not restore ~/Library for me. And Duplicacy always tell me it restored nothing after 1 min. and 17 or 18 sec., so I gave up after a dozen tries (wasting more than 20 min. doing so).

The guide only refers to the "Include/Exclude Patterns" section for syntax, but I couldn't see what's wrong with "+*/Library" even after re-reading the section. Nor does it tell me how to handle spaces in path. In the filters file they don't have to be escaped as each item is on a single line, but I don't know how to deal with it here. Tried escaping with "\", putting the path in quotes, putting the whole thing (including the "+" sign) in quotes, and so on. Nothing worked.

The lone successful single file (6KB) restore (from the latest revision) also took 1:17, way longer than Hashbackup (and Arq from past experience).

Another small complaint: this is not very convenient for I have to rename the symlink to /Users to avoid restoring into the original location.

Have to leave for a while. Will continue later.

gchen Oct 4 11:11AM 2017

This should work:

duplicacy restore -r 23 -threads 4 -stats -- "Library/Application Support/*"

Double quotes are to prevent bash from expanding the argument. The path is always relative to the root of the repository (assuming it is your home directory).

Currently restoring a single file still requires the whole snapshot file to be downloaded. This is highly inefficient and the right way is to download only chunks needed by that file.

mwang Oct 4 11:34AM 2017

This will be fixed in the next GUI major version (likely 2.1.0).

Good to know. Thanks!

I guess the missing files in B or D are scattered among many folders.

A & B are Calibre libraries, so they are structured similarly in the following style:

Author 1 - Book 1 - Book 1 data file (epub or pdf)
                  - Book 1 cover file
                  - Book 1 metadata file
         - Book 2
         - ...

except A has more authors/books.

C & D, on the other hand, have completely different folder structure. Except the 3 files unique to D (they don't exist in C), all music files (flac or mp3) in D are hardlinks to files in C, but are named and organized differently.

Files are packed first into a conceptual tar file which is then split into chunks, so a missing file will cause the file before it and the file after it to form at least one new chunk, and maybe a few more if the chunking algorithm can't quickly arrive at the original breaking points in the files that follow.

I take it to mean the 13GB extra data are not all snapshot chunks; some are file data chunks. Right?

You may be able to reduce this overhead by changing the average chunk size from the default 4 MB to 1MB (by contrast in Arq the default chunk size is perhaps 64KB).

OK, but then cloud backup performance may suffer because "a lot of overhead will be in sending requests and receiving responses" (quoting from the guide).

Even if the use of a smaller chunk size doesn't reduce the overhead significantly, I think this level of overhead is acceptable if A and B, or C and D are on different computers.

Maybe so, but then Duplicacy is not for me. As I said, I'll be backing up from a single machine. Performance means more than multi-machine dedupe capability.

mwang Oct 4 11:44AM 2017

This should work: duplicacy restore -r 23 -threads 4 -stats -- "Library/Application Support/*"

Hmm, that's one of the earliest way I tried, except ... I got it, I was missing the ending "/*". I thought "Library/Application Support" was enough to restore the directory.

Thanks for the tip. Would be great if the guide can be revised to make the restoration syntax clearer.

mwang Oct 4 12:52PM 2017

One more thing: I wonder why Duplicacy uses different folder structure for cloud storage and local storage. On Google Drive, all chunks are in one big "chunks" folder, while in local storage, there are a lot of subfolders under "chunks". I found this out the hard way when I ran into performance issues. Google Drive File Stream had just become available, and I thought I could point Duplicacy to the same folder inside the locally mounted Google Drive and see how it would perform, only to see Duplicacy began making a bunch of subfolders under "chunks" and re-uploading all files.

I made a separate folder on Google Drive later and found Duplicacy didn't work well with Google Drive File Stream, so that doesn't matter now. But still, I wonder why they have to have different folder structures.

One main reason to back up to the cloud is to hedge against the possibility, however remote, that all local backups are lost to a fire or something like that. In a situation like that, one may want to borrow a fatter pipe (from one's company, e.g.) to pull down the whole backup storage from the cloud onto a hard drive, and restore from there. It nevertheless wouldn't work with Duplicacy.

A few other smaller things:

Chinese & Japanese File/folder-names are garbled in logs.
Standard keyboard shortcut for "paste" (cmd-v on Macs) doesn't work in the GUI.
Symlinks are backed up, but not shown in the file list (both in GUI & in CLI).

It's almost 2am here; I'm going to bed. Thanks for your patience. I like Duplicacy a lot of. Really want to make it work for me.

gchen Oct 4 1:28PM 2017

I take it to mean the 13GB extra data are not all snapshot chunks; some are file data chunks. Right?

Right, in fact most of them should be new file data chunks due to the gaps from those missing files.

Maybe so, but then Duplicacy is not for me. As I said, I'll be backing up from a single machine. Performance means more than multi-machine dedupe capability.

I think the combination of a smaller chunk size and the technique to split large files separately may be able to bring the overhead to an acceptable level for your use case. However, if they don't, then this is one of the corner cases where the pack-and-split method may not perform well. The main advantage of the pack-and-split method is that the number of chunks is roughly bounded, whereas methods taken by other backup tools may create too many small chunks if there are many small directories.

One more thing: I wonder why Duplicacy uses different folder structure for cloud storage and local storage.

There is a plan to use the same structure for all storages: https://github.com/gilbertchen/duplicacy/issues/222

Chinese & Japanese File/folder-names are garbled in logs.

Standard keyboard shortcut for "paste" (cmd-v on Macs) doesn't work in the GUI.

Symlinks are backed up, but not shown in the file list (both in GUI & in CLI).

I'll add these to my to-do list.

mwang Oct 5 10:57AM 2017

I think the combination of a smaller chunk size and the technique to split large files separately may be able to bring the overhead to an acceptable level for your use case.

Indeed that seems to be the case after redoing my tests using 1M average chunk size.

For A -> A + B (the Calibre libraries) backups, the size increase is much more modest (256MB -> 46MB). Virtually no change in performance.

Note: all time recorded here are for local backups between different HDs.

   -c 4M            -c 1M
sec     MB       sec     MB
 18   1,773       17   1,763
 13   2,030       15   1,809

Similar story for C -> C + D (music files) backups, except the initial backup time is substantially longer with "-c 1M".

   -c 4M           -c 1M
time    GB      time    GB
9:49   61.3    12:19   61.3
8:41   74.2    08:39   65.0

Two things to note, however:

As expected, the number of chunks increases significantly as well with smaller chunk size. How much negative impact that would have for cloud backup remains to be seen.
While the overhead is greatly reduced, it's still substantially higher than others.

The main advantage of the pack-and-split method is that the number of chunks is roughly bounded, whereas methods taken by other backup tools may create too many small chunks if there are many small directories.

From what I can tell, both Arq & Hashbackup (& Borg, too, IIRC) pack small chunks into larger archive files, a necessity I think given the even smaller average chunk size (64K). The obvious issue is the whole archive file has to be downloaded even when one needs just a chunk in it. And yet their restoration performance doesn't seem to suffer with my limited tests. Will keep an eye on this.

There is a plan to use the same structure for all storages: https://github.com/gilbertchen/duplicacy/issues/222

What you guys are discussing there is beyond me, but it's good to know many smart people are working on improving this.

I'll add these to my to-do list.

Thanks!

edited to add a note about the nature (local, not cloud) of tests in this post.

gchen Oct 5 8:09PM 2017

The impact of a smaller chunk size on the performance is expected, as the overhead of checking the existence of a chunk and then sending the chunk becomes proportionally larger. You can always use multiple uploading threads to alleviate this problem.

Arq, Hashbackup, and Borg, I believe, all store the chunk database locally. This approach can't be easily extended to the multiple client case. In fact, even if you plan to run backup from a single machine, having the option to upload from a different machine, especially for the initial backup, is sometimes valuable -- a couple of weeks ago a user wrote to me to report how he sped up the initial backup by bring his external disk to his office and running Duplicacy from there (instead of the slow DSL connection at home).

mwang Oct 6 2:28AM 2017

In fact, even if you plan to run backup from a single machine, having the option to upload from a different machine, especially for the initial backup, is sometimes valuable ...

Yes, I do appreciate that possibility, for a number of reasons. And that's why I would very much like to stick to Duplicacy.

Come to think of it, I'm using it already even though from a single machine, as I set up different repo folders (each containing numerous symlinks to the real folders to be backed up) so that I can set up different backup schedules for different kinds of data. I'm not expecting cross-repo deduplication though as I've grouped data with dedupe potential in the same repositories. Still, it's comforting to know that if I need to move a symlink from one repository to another someday, (most of) the data won't have to be uploaded again.

Arq, Hashbackup, and Borg, I believe, all store the chunk database locally. This approach can't be easily extended to the multiple client case.

Indeed. What I'm thinking is whether there is any possibility to have the cake and eat it too. Is the kind of performance boost associated with local databases truly incompatible with multi-client backup?

From Arq & Hashbackup's side, I guess that's indeed the case. Adding multi-client support (with cross-client deduplication, not the each-one-his-own-silo kind) is truly difficult.

Duplicacy, on the other hand, has done the hard part. All it needs is some sort of better local caching. Currently, the most recent snapshot chunks are cached. As I mentioned earlier, each time it took Duplicacy 1'17" or so to tell me it could find nothing to restore when I was bumbling on the restoration syntax, even with the cache. In comparison, Hashbackup, with a local database, could respond in a second. So I gather the extra 1'16" was spent on constructing some sort of index/database/access table in memory from the 3.3 GB cache. Can't we save this "database" to avoid having to reconstruct it every time?

Let's call it a "lazy" database, or a "non-authoritative" database, similar to how DNS servers operate. Look at it this way: for non-storage-updating jobs (diff, history, list, restore etc.), Duplicacy (acting on behalf of a specific repository) doesn't need to know if the storage has been updated by other clients, does it? When I merely want to restore a file, I don't really care if changes have been made to the storage, as long as the chunks associated with my files are still there. And they should be, as guaranteed by Duplicacy's brilliant two-step fossil collection mechanism.

When the time comes for this specific repository to update the storage (backing up or pruning), it can update the lazy database, or simply remove it until the next time an access-only job (diff, history, list, restore etc.) is executed.

All of the above are just wild ideas from the perspective of an amateur programmer not doing much coding these days. So pardon my ignorance if they are in fact useless. Thanks for listening.

gchen Oct 6 10:15AM 2017

Duplicacy always checks if the chunk exists in the storage even if the local cache has it. This is mainly to catch the error early if the chunk exists in the local cache but no in the storage for whatever reason. So this and the need to reconstruct the whole snapshot file may contribute to the most of the 1'17'' it took to restore a single file.

Can you add the -stats to the backup command and paste the statistics log lines here? I am curious about how large your snapshot file is.

The 'lazy' database as you suggested should definitely help. My only concern is that it may take much more space than the current local chunk cache implementation which is de-duplicated. This is perhaps the only reason I didn't implement it in the first place.

mwang Oct 6 5:40PM 2017

Can you add the -stats to the backup command and paste the statistics log lines here? I am curious about how large your snapshot file is.

Sure. With "-threads 4", here are the final lines:

Files: 20937 total, 487,357M bytes; 431 new, 40,476K bytes

File chunks: 99449 total, 487,739M bytes; 6 new, 40,476K bytes, 11,024K bytes uploaded

Metadata chunks: 651 total, 3,260M bytes; 5 new, 21,076K bytes, 13,806K bytes uploaded

All chunks: 100100 total, 491,000M bytes; 11 new, 61,552K bytes, 24,831K bytes uploaded

Total running time: 00:39:58

The cache folder now has 1,547 items, totaling 3.42 GB as reported by Finder.

Danny Oct 15 10:08AM 2017

Pulling down 3+ GB of data should take less than 20 min on a bad day. On what was the other 20 min. spent?

I found that duplicacy is much slower downloading than it is uploading. So much so that in my case when I wanted to test restore so that I knew the end-to-end process worked, I used a different tool (the official aws client) to download everything from the cloud storage provider into a new local repo. I also had to write a script to move the chunks to where duplicacy expects them to be when using a local repo. Then I restored from the local repo. It was inconvenient and took a LOT more space than using duplicacy restore. But since I only had a few snapshots whose contents were almost identical it was still much faster than using duplicacy restore. Adding more snapshots with more changes would cause the space and time performance of this workaround to get progressively worse because it would download all chunks, not only the chunks it would need to restore the requested snapshot.

The download code in duplicacy is (or at least was until very recently) single-threaded which I think is what accounts for the terrible download performance. It makes sense that the upload code was optimized first, because this is a backup tool and backups are done much more frequently than restores. But this is an area where I think duplicacy needs to improve in the future. It's only a minor inconvenience for the way I use duplicacy but it might be a showstopper for people who do things a different way.

gchen Oct 15 9:39PM 2017

Duplicacy started to support multi-threaded backup and restore since 2.0.0, which was released about 4 months ago.

It makes sense that the upload code was optimized first, because this is a backup tool and backups are done much more frequently than restores.

This was exactly what I had in mind when I was developing Duplicacy, but now I tend to think that backup and restore should be treated equally, because 1) many users who try Duplicacy will likely to run a few restore tests, and 2) restore may be more performance-critical than backup -- it is sometimes done in a tight time constraint.

There are a few snapshot optimizations that will greatly improve the restore performance:

If there is only a small set of files to be restored, there isn't really a need to download the entire snapshot. Just find out which snapshot chunks are needed and download these chunks only to restore the files
The extended attributes should not be saved in the snapshot. They should have their own chunks (much like how the file contents are saved). This will significantly reduce the size of snapshot on macOS
Files should be handled as soon as each snapshot chunk is downloaded and parsed. This can avoid loading the entire snapshot into memory and thus reduce memory usage for both backup and restore.

I'll start working on these optimizations after the new GUI version is done.