Duplicacy Issue: Considering Duplicacy with Wasabi

All issues

Considering Duplicacy with Wasabi

skidvd Jun 5 8:23AM 2018 GUI

Hi,

I am presently a CrashPlan home user that like so many others is searching for a replacement solution. I have have multiple Linux boxes and a sole Windows box with a little over 3TB of data across all of them that I need to backup. My shortlist storage providers presently include Wasabi and B2. My shortlist software solutions include Duplicacy, Duplicati and CloudBerry.

My primary question is if Duplicacy plus Wasabi can relied upon to produce configure and forget backups that run continuously in the background and then be further relied upon to produce valid restores in the case of a failure? I see some issues related to this combination when searching goggle. Is this a bad/flawed combination?

More generally, I want to set up backups, have confidence that they are both happening and will be useful in the event of an unexpected failure or something. How does Duplicacy plus ??? compare to other alternatives mentioned to make this more reliable?

How/why is Duplicacy a better choice than Duplicati? CloudBerry?

TIA!

towerbr Jun 5 2:47PM 2018

Duplicacy is a more stable and mature product than Duplicati. I tested it before adopting Duplicacy. The big problem with Duplicati is the use of database to save the information of backups. You can see from its forum that there are numerous reports of corrupted databases.

I use Duplicacy CLI. It took me some time to learn all the nuances of the settings and nomenclatures (somewhat confusing), but it has been running perfectly with Wasabi for some time now.

I have no experience with CloudBerry, sorry.

gchen Jun 5 8:43PM 2018

In addition to what @towerbr mentioned, one major advantage of Duplicacy is the ability to backup multiple computers to the same storage location. This would result in a huge saving in storage costs if multiple of your computers share a lot of identical files. This is a unique feature of Duplicacy; no other tools can do this, including Duplicati and CloudBerry.

skidvd Jun 6 6:55AM 2018

@towerbr, @gchen, Thanks!

@towerbr, any suggested resources that were particularly helpful to come up the learning curve?

@gchen, do the multiple computers need to be backed up to the same 'bucket' in order for these savings to be possible? Or can you still have separate buckets fro each machine but Duplicacy manages it somehow. Similarly, do the files need to be 'named' and/or 'pathed' identically in order to trigger the savings, or does it work on hashing or something else to find dupes?

towerbr Jun 6 7:24AM 2018

@towerbr, any suggested resources that were particularly helpful to come up the learning curve?

Unfortunately, not specific. The wiki is your best initial reference.

I started with very simple scripts and then added parameters later. It was a very empirical process.

However I have some tips:

1) Use the centralized configuration, it is much easier to manage than the decentralized configuration (see -pref-dir at https://github.com/gilbertchen/duplicacy/wiki/init); (However not everyone here agrees with me ... lol)

2) When creating your repositories, rename the first "default" entry in the preferences file to a clearer name, to make it easier to use the -storage <storage name> option.

3) The include / exclude patterns are slippery, study them before applying https://github.com/gilbertchen/duplicacy/wiki/Include-Exclude-Patterns.

Similarly, do the files need to be 'named' and/or 'pathed' identically in order to trigger the savings, or does it work on hashing or something else to find dupes?

I think you should study a little more how Duplicacy works. A good pass on the wiki would be interesting. Duplicacy does not back up files, but chunks of the files.

gchen Jun 6 8:37PM 2018

do the multiple computers need to be backed up to the same 'bucket' in order for these savings to be possible? Or can you still have separate buckets fro each machine but Duplicacy manages it somehow. Similarly, do the files need to be 'named' and/or 'pathed' identically in order to trigger the savings, or does it work on hashing or something else to find dupes?

Yes, they should be backed up to the same bucket (and the same destination directory) for the cross-computer deduplication to take effect. No, the files don't need to have the same name or the same path relative to their own repository root. Duplicacy breaks down files into chunks, and whenever there are identical chunks, the deduplication mechanism will kick in.

skidvd Jun 7 9:56AM 2018

Any thoughts on wasabi vs b2 or others - looking for rock solid reliability at an affordable price? Wasabi free api/egress options appears compelling, but perhaps I'm missing something.

Yes, they should be backed up to the same bucket (and the same destination directory) for the cross-computer deduplication to take effect. No, the files don't need to have the same name or the same path relative to their own repository root. Duplicacy breaks down files into chunks, and whenever there are identical chunks, the deduplication mechanism will kick in.

Based upon the above then, if I want to be able to, perhaps, see and independently restore to/from different machines, what is the recommended approach? tags? multiple snapshot names with same destination?

gchen Jun 7 1:15PM 2018

You just need to use different repository ids on different machines.

On machine 1:

duplicacy init id1 wasabi://bucket

On machine 2:

duplicacy init id2 wasabi://bucket

skidvd Jun 9 5:05PM 2018

I'm sorry for the basic questions, but I'm struggling with filters and a few other concepts a bit:

1) Are filters required? What I mean is, if no filters are provided in either .duplicacy/filters file or on command line will the default be to include all dirs/files recursively in the repository?

2) Given the following filters contents:

+<parent dir>
+<parent dir>/<subdir>
+<parent dir>/<subdir>/*

Does this imply that all contents and all children dirs recursively under ,subdir> will be backed up? What if / is later added following intial backup? Will its contents also be recursively backed up even if filters is not modified?

3) If they are required, what is the simplest way to achieve the described behavior mentioned above? It seems like adding every single path including parents could become quite cumbersome and error prone - especially if you forget to modify filters every time you add a new dir!

4) Once you have init-ed a repository and run a backup, what command can you use to list all files contained in a specified dir from the repository? the history command appears to allow me to show info for a given dir, but there does not appear to be a way to show similar information for the dir contents?

5) Similarly, what command would you use to restore, for example, a given repository subdir and all of its contents recursively?

6) If I have a rather large repository that I anticipate will take many days, perhaps weeks, to complete the initial backup, what happens if that initial backup is interrupted - perhaps an unanticipated power loss or required system reboot - before the initial backup is completed? Will it have to start all over? Will it recover gracefully? Any important command options to include or special commands to use with this in mind?

7) It appears as if the global options must appear before the command (backup, history, etc), while the non-global options follow the command. Is that correct? I was surprised to see that order mattered this much?

8) When using the -background option, it is possible to query where the credentials are coming from/stored? Is it possible to remove them? Update them?

TIA!

gchen Jun 10 10:14PM 2018

1) Are filters required? What I mean is, if no filters are provided in either .duplicacy/filters file or on command line will the default be to include all dirs/files recursively in the repository?

yes.

2) Given the following filters contents: + +/ +//* Does this imply that all contents and all children dirs recursively under ,subdir> will be backed up? What if / is later added following intial backup? Will its contents also be recursively backed up even if filters is not modified?

Not sure what you meant by if / is later added following intial backup?.

3) If they are required, what is the simplest way to achieve the described behavior mentioned above? It seems like adding every single path including parents could become quite cumbersome and error prone - especially if you forget to modify filters every time you add a new dir!

I would suggest using exclude patterns instead. Only exclude certain subdirectories that you don't want to back up, so new dirs will be included by default.

4) Once you have init-ed a repository and run a backup, what command can you use to list all files contained in a specified dir from the repository? the history command appears to allow me to show info for a given dir, but there does not appear to be a way to show similar information for the dir contents?

You can run the list command and then filter the output:

duplicacy list -files | grep path/to/some/dir

5) Similarly, what command would you use to restore, for example, a given repository subdir and all of its contents recursively?

restore takes exclude/include patterns as arguments:

duplicacy restore -r 1 -- +/subdir* -*

6) If I have a rather large repository that I anticipate will take many days, perhaps weeks, to complete the initial backup, what happens if that initial backup is interrupted - perhaps an unanticipated power loss or required system reboot - before the initial backup is completed? Will it have to start all over? Will it recover gracefully? Any important command options to include or special commands to use with this in mind?

An initial backup can be fast-resumed. You can try it yourself.

7) It appears as if the global options must appear before the command (backup, history, etc), while the non-global options follow the command. Is that correct? I was surprised to see that order mattered this much?

That is right.

8) When using the -background option, it is possible to query where the credentials are coming from/stored? Is it possible to remove them? Update them?

The -background option was mainly designed for the GUI version. The recommended way to force reentering passwords is to run duplicacy list -reset-passwords.

skidvd Jun 11 7:43AM 2018

2) Given the following filters contents: + +/ +//* Does this imply that all contents and all children dirs recursively under ,subdir> will be backed up? What if / is later added following intial backup? Will its contents also be recursively backed up even if filters is not modified?

Not sure what you meant by if / is later added following intial backup?.

My apologies for the typo: the markdown appears to have mangled my intent. The question was supposed to say What if <parent dir>/<subdir>/<grand-child-dir> is later added after initial backup? Will the grand-child-dir contents also be recursively backed up even if filters is not modified?

8) When using the -background option, it is possible to query where the credentials are coming from/stored? Is it possible to remove them? Update them?

The -background option was mainly designed for the GUI version. The recommended way to force reentering passwords is to run duplicacy list -reset-passwords.

I'd still like to know if it is possible to query where the credentials are coming from/stored please?

gchen Jun 11 7:15PM 2018

Yes, <parent dir>/<subdir>/<grand-child-dir> matches the pattern +<parent dir>/<subdir>/* so they will be included.

You can run duplicacy -d list and there will be some log messages showing how credentials are read.

skidvd Jun 11 9:04PM 2018

3) If they are required, what is the simplest way to achieve the described behavior mentioned above? It seems like adding every single path including parents could become quite cumbersome and error prone - especially if you forget to modify filters every time you add a new dir!

I would suggest using exclude patterns instead. Only exclude certain subdirectories that you don't want to back up, so new dirs will be included by default

Unfortunately, in my scenario, I will need to use positive/include filters in some places. Given that requirement, what is the simplest filter combination to get a given subdir and all of its cursive contents included - event if further grandchildren are added later following filter setup?

gchen Jun 12 9:47PM 2018

The patterns you showed above should work:

+<parent dir>
+<parent dir>/<subdir>
+<parent dir>/<subdir>/*

Since no exclude patterns are specified so any subdirs not matching these patterns will be excluded. Any files/dirs under `// will be included because of the third pattern.

This wiki page explains how include/exclude patterns work in details.

skidvd Jun 13 11:46AM 2018

My experimentation so far has led me to my next basic question. I have read the wiki page regarding use of -hash relative to the backup command and understand it's purpose and have even experimented with it locally, so I can see the resulting effects.

My question though is in terms of best practice... It seems clear that the safest/most conservative approach would be to always backup with the -hash flag enabled - especially when backing up to cloud storage (wasabi). However, this obviously entails quite a bit more processing overhead - especially for a large repository. What are the best practice recommendations in this regard. Do most people opt for the cautious approach and pay the overhead premium, or do most opt out of -hash and use some other mechanism to increase confidence (periodic checks with -files option or something along those lines)? Or, for modern Linus and/or Windows file systems, is -hash just effectively unnecessary overkill in practice?

kevinvinv Jun 14 11:21AM 2018

Hi skidvd- would you mind if I ask you a question?

What are you thinking -hash will accomplish for you as far as reliability goes? I dont use it as it doesnt seem that important to me. It seems to me that -hash is just a different way to verify a file has changed. I guess there are other issues I am more concerned about reliability wise than this...

Do you think I am understanding things wrong?

Thanks!!

gchen Jun 14 11:36AM 2018

Since the update of the timestamp of a file is automatic, I think it is ok to run the backup without the -hash option. Very rarely would any software modify a file and then deliberately roll back the timestamp to the previous one.

skidvd Jun 14 12:00PM 2018

@kevinvinv, @gchen, no I think that your understanding is in line with mine. I suppose reliability is a very poor word choice. What I had in mind with that post was the potential for frequently (but very minimally so) updated files. I have been concerned that some may occasionally slip through the cracks - especially with frequent backup interval (perhaps every 15 min or so to approximate continuous backup like CrashPlan). I had thought that a full hash computation on each file may offer some greater assurance that nothing was being missed. Perhaps this is just paranoia?

I guess there are other issues I am more concerned about reliability wise than this...

As a rock-solid, 100% reliable backup (and restore) solution is my ultimate concern as well, I'm very curious what you have in mind and what you may be doing address it?

kevinvinv Jun 14 5:58PM 2018

Hi skidvd

My personal opinion is that I just back up once per day. I did that with Crashplan too. I dont personally want the thing crunching on my CPU all day every 15 minutes. But I see why you do and would not criticize that decision...

My reliability concern with duplicacy is server side corruption. It doesnt do anything to make sure the backups are restorable and the chunk files are not corrupted.

You can make sure all the chunk files for a given snapshot exist.. and that is pretty good but you cant easily (from the server side... as in a local NAS etc) make sure the individual chunks have not be corrupted. The only way you can verify a chunk integrity is to download it back and then verify it. That is too costly.

Crashplan could always verify backup integrity b/c it was running a server side app that could always verify checksums or hashes or whatever... but duplicacy doesnt do that... instead it basically trusts that the server wont corrupt the stored backup... pretty good assumption in general but that is indeed what I worry about.

gchen is planning on adding some hooks to make server side chunk integrity verification more possible but they havent arrived yet.

skidvd Jun 14 9:01PM 2018

Hi kevinvinv,

Yes, you raise a very good point. One of the double-edged swords I have been considering in my evaluation process... Duplicati, for example, does download some random chunks with each backup for this express purpose. However, as you noted, this is a rather expensive option - especially if you are charged for egress and/or API calls as is likely the case other than at wasabi.

I am anxious to learn more about gchen's plans and timeline in this regard.

So, if one were to periodically run check commands (with -files option) say perhaps weekly, this will ensure snapshots have all required chunks. However, I'm not clear on how a server (I think you are referring to repository source here - correct?) side only solution could be made to verify that the chunk is in fact safe transported and stored on the remote cloud storage? I supposed it could verify a chunk checksum/hash etc before transport to ensure it was created reliably - is this what you are getting at? However, doesn't that leave an opening for errors to be introduced (and more importantly, missed without any means to verify) during transport and storage on remote destination? Does the wasabi or other cloud-based storage providers provide any mechanism to calc and retrieve checksums/hash of file on their storage for comparison to what was sent?

towerbr Jun 15 8:26AM 2018

IMHO, we should think more about the integrity of the files, and not about the chunks, which would be a consequence.

If you have a file that is broken into 5 chunks, and check (by sampling) 4 of these chunks and it's all ok, but the last one is not (and you didn't check the latter), you've lost the file. But if you check the file (which is what matters), you will know for sure that everything is ok - or not.

I, for example, set up a script to download a few random files and compare them with local files. It is not 100% safe (nothing is), but it lowers the risk. Since I use Wasabi, I have no problem with download or API calls charges.

skidvd Jun 15 8:45AM 2018

@towerbr, nice approach/idea. I'm curious though how you randomly determine file(s) and at approximately what frequency you do this? Would any of this change in your mind as the size of your repository grew? Would you mind sharing your script?

towerbr Jun 15 9:36AM 2018

I run it daily after the prune script. It's a very simple / ugly / hardcoded windows batch (CMD) script. I'm gradually modifying my scripts to something more parameterized and better coded, but time has been a problem ;-). Since my backups are working, I confess I'm not prioritizing this.

It basically has these steps:

(remember that everything between % and ! are variables or constants)

1) Read the files from the local repository, storing name and size in a array:

for /f "tokens=1* delims=\" %%a in ('forfiles /s /m *.* /p %repository% /c "cmd /c echo @relpath"') do (
  for %%f in (^"%%b) do (
    call set /a cont+=1
    set FILES[!cont!].name=%%~f
    set FILES[!cont!].size=%%a
  )
)

2) Mark random files in the array for testing

for /L %%a in (1,1,%num_tests%) do (
    set /a "num=!random! %%max"
    set FILES [!num!].random=YES
)

(num_tests is a parameter with the number of files to be tested)

3) Retrieves the last revision from the last log file:

for /f "tokens=1,2,3,4,5* delims= " %%m in (!LOG_FILE!) do (
    if "%%p"=="revision" (
        call set "var_revision=%%q"
    )  
)

4) Downloads selected files to a temp folder

for /L %%a in (1,1,%cont%) do (
  if !FILES[%%a].random! == YES (
    call set file_to_rest=%%FILES[%%a].nome%%
    call set file_to_rest=!arq_to_rest:\=/!
    call duplicacy restore -ignore-owner -stats -r %var_revision% -storage %storage_name% "!file_to_rest!"
  )
)

5) Generates hashes of downloaded files

for /L %%a in (1,1,%cont3%) do (
      call md5sum.exe "!file_full_path!" > file_MD5_%%a.txt
)

6) Compares the hashes of the downloaded files with the hashes of the repository files

  for /f "tokens=1,2* delims= " %%m in (file_MD5_%%a.txt) do (
    call set hash_downloaded=%%m
  )
  for /f "tokens=1,2* delims= " %%m in (repository_MD5_%%a.txt) do (
    call set hash_repository=%%m  
  )
  if !hash_ downloaded! == !hash_repository! (
    some code for everythng ok
)

I know there are more elegant and optimized ways to do this, but for now it's working very well, in the future I'll make a "2.0 version" using powershell, python or something else.

kevinvinv Jun 15 10:10AM 2018

This is an interesting conversation!!

When I mentioned "Server side" verficiation- I admit the terminology wasnt the best. I am not using any sort of cloud service but instead I have a NAS box at a remote location that I back up to. This is what I call the "server" and I want to be able run a routine on this remote NAS to verify the integrity of all the chunks backed up... if that makes sense.

I do like the random download checker too.. that is a cool idea.

towerbr Jun 15 9:46PM 2018

Yes, but this idea has a weakness: it doesn't apply to verify the backup integrity of large files (a database, for example). It only applies well to "common" files.

skidvd Jun 16 9:13AM 2018

Both the random file download and the chunk verification are interesting and can go a ways towards reliability. Yet, they both feel like attempts to work around something I would hope to be baked-in central features - that are apparently not there and/or being questioned at the root of this discussion. Ideally, in terms of backup, I'd like to be able to:

know at the time of backup (as a direct consequence of the backup operation, not as a follow-on operation) that it either succeeded or failed to successfully transmit the files to storage and the the storage received, confirmed and stored the same
know that if a failure occurred relative to the above, that I know exactly which file(s) were unsuccessful

The random file download and chunk verification both are after the fact. While they may indeed point out problems before a potential future restore request would fail, they do not identify the problem as soon as it happened. Additionally, they are not checking each and every file/chunk as it is stored, so there is always a chance of something being missed.

What I am not clear on, is whether or not the cloud storage APIs provide any means of requesting a cloud side checksum of some sorts? Perhaps @gchen can shed some light here - or perhaps much more than it appears is already happening? What I'd hope, is that the backup operation would in essence be not only looking for errors during source-side operations as well as file transmission; but then taking a further confirmation step to double-check that the cloud-side actually has the correct contents via checksum or similar computation and cross-check with the source. Does this make sense?

kevinvinv Jun 16 9:51PM 2018

I'll just state the obvious I guess. CP could do everything we want in this regard b/c the backup receiver was running an app and could do continuous monitoring and "healing" and all of that.

I suspect that tools like this (and many many others that backup to the cloud) cant really do much on the receiver end b/c they cant run a program on the receiving computer to calculate hashes and checksums and all of that.

The best that can be done is to download the file back and then check to be sure it was without error.

This is one reason I like not backing up to the "cloud" but instead backing up to my own remote computers so that I CAN EVENTUALLY check the backup integrity at the receiver... hopefully :)

skidvd Jun 18 7:42AM 2018

I believe that I have read that the wasabi API is 100% compatable with the S3 API...

While I have not been able to locate something similar (with admittedly brief searches) for wasabi, this link for S3 looks like there may actually be something possible checksum wise? [https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/] excerpt... "To ensure that S3 verifies the integrity of the object and stores the MD5 checksum in a custom HTTP header you must use the --content-md5 and --metadata arguments with the appropriate parameters."

I'm curious what @gchen, @markfeit (or others) may be able to offer in terms of the present state of things and perhaps future plans in this regard? I don't presently see any reference to 'md5' in either duplicacy_wasabistorage.go nor duplicacy_s3storage.go (where most of the implementation appears to be shared, so I'm guessing that this is presently not being utilized - however, I'd be happy to learn I am incorrect as I am certainly no expert on the duplicacy code.

This duplicacy issue appears to be speaking to this issue as well: [https://github.com/gilbertchen/duplicacy/issues/205]

dgcom Jun 19 10:58PM 2018

See about hash here: Overview of cloud storage systems

If you don't trust duplicay, use rclone to sync your local backup storage to cloud provider.

skidvd Jun 20 6:57AM 2018

It's not that I don't trust duplicacy. There are just too many moving pieces so I feel that a "trust, but verify" approach would be advisable.

towerbr Jun 20 7:30AM 2018

There is no 100% error free software, neither Duplicacy nor Rclone, and I use both (for different filegroups).

Changing Duplicacy by Rclone does not mean that it will be more reliable.

I always adopt the practice of "trust, but verify" ;-)

dgcom Jun 20 10:51AM 2018

"Changing Duplicacy by Rclone does not mean that it will be more reliable."

This will make sure that cloud upload will be more reliable. Having backup locally and in the cloud will make sure that backup will be more reliable and accessible. And it also means that you will be able to easily and cheaply verify your backup - first against local backup then local copy against cloud copy.