Resuming interrupted (initial) backup takes hours

Christoph     Jan 21 10:08AM 2018

In the process of creating the initial backup of a large (1TB+) repository, I noticed that whenever I interrupted (i.e. stopped) the backup process, resuming the backup afterwards takes increasingly long time. From what I understand, this is because duplicacy has no index of all previously uploaded files and therefore has to "discover" what chunks already exist, until it finds some that it actually still needs to upload.

But I wonder whether it is correct that this literally takes hours (almost 6 at this point, and still counting). If this is by design, I wonder if there is not room for improvement here (though I have no suggestion as to how).

If this is not by design, I wonder what might be causing this. My only guess, at the moment, is that it has something to do that I tried resuming the backup not via the GUI but via the CLI instead (it was initiated, stopped, and resumed via the GUI several times before without problems, except for the increasingly long waiting time). Are there any pitfalls when switching between CLI and GUI?

My hunch is that this long waiting time is by design after all (my backup archive has now exceeds 900 GB in size). My main reason for believing this is that duplicacy is still actively reading files from the repository, according to the windows task manager.

My main reason for doubting that everything is in order is that my understanding was that whenever duplicacy is skipping chunks to upload, it says so in the backup -stats. But so far, the only output the job produced is this:

Storage set to P:\Backup\duplicacy
No previous backup found
Creating a shadow copy for D:\
Shadow copy {E8237DE6-EC1B-4E41-BF96-660929B660BC} created
Indexing D:\christoph
Loaded 21 include/exclude pattern(s)

In contrast, a different job a couple of days ago produced this when resumed:

Storage set to P:\Backup\duplicacy
No previous backup found
Indexing T:\
Incomplete snapshot loaded from c:\duplicacy\prefs/incomplete
Listing all chunks
Skipped 2299 files from previous incomplete backup
Skipped chunk 1 size 8427437, 8.04MB/s 00:24:38 0.0%
Skipped chunk 2 size 2007431, 9.95MB/s 00:19:54 0.0%
Skipped chunk 3 size 6222934, 15.89MB/s 00:12:28 0.1%
Skipped chunk 4 size 2979547, 18.73MB/s 00:10:34 0.1%
Skipped chunk 5 size 1059019, 19.74MB/s 00:10:02 0.1%
Skipped chunk 6 size 8930050, 28.25MB/s 00:07:00 0.2%
Skipped chunk 7 size 10986522, 38.73MB/s 00:05:06 0.3%
Skipped chunk 8 size 1392134, 40.06MB/s 00:04:56 0.3%
Skipped chunk 9 size 2115144, 42.08MB/s 00:04:42 0.3%
...

The difference is that the latter job never had anything to do with the GUI. And there is no -vss involved, because the repository is on a samba share (accessed from the windows pc).

So this is why I'm confused...


gchen    Jan 21 9:58PM 2018

If you don't see Incomplete snapshot loaded then it means the previous incomplete backup didn't leave an incomplete file. Was the previous backup interrupted or stopped due to an error?


Christoph    Jan 22 1:33AM 2018

Was the previous backup interrupted or stopped due to an error?

No error. Just me pressing the stop button in the GUI.

If you don't see Incomplete snapshot loaded then it means the previous incomplete backup didn't leave an incomplete file.

So you are suggesting that the reason it takes so long to resume some interrupted backup is that there is no incomplete file? What exactly does that file do that saves duplicacy so much time?

If the presence absence of this file explains the why resume takes so long, it leaves us with two questions:

  1. Why is the file sometimes absent?
  2. Why is duplicacy not talking to me (when the file is absent?

I suppose Q2 is a request for UX improvement, i.e. duplicacy should produce a message about what it is doing (when -stats is enabled) since it is obviously doing something (in my case it looks like it is successfully continuing to upload stuff (though I haven't verified that), but without providing any stats).


gchen    Jan 22 9:36PM 2018

The incomplete file records what files have been scanned and their chunks. So when the next backup reads this file, it knows which files to skip without rescanning them.

Why is the file sometimes absent?

This is a flaw of Windows -- there isn't an easy way to interrupt a running process other than killing it. But if the Duplicacy process gets killed it won't get the the chance to save the incomplete file. The CLI version doesn't have this issue when you press Ctrl-C.

Why is duplicacy not talking to me (when the file is absent)?

If the initial backup isn't interrupted or stopped by an error, there won't be such a file. So by default Duplicacy assumes this file doesn't exist.


Christoph    Jan 24 3:39PM 2018

Why is duplicacy not talking to me (when the file is absent)?

If the initial backup isn't interrupted or stopped by an error, there won't be such a file. So by default Duplicacy assumes this file doesn't exist.

No, that's not what I meant. Or not quite.

So you're saying that those Skipped chunk 1 size 8427437... messages are only produced when duplicacy has the incomplete file to check which files to skip? In other words: when duplicacy attempts to upload a file that turns out to already exist (and hence skips it) then this is not reported as Skipped chunk...? Fine.

But my point was: it doesn't talk to me at all. Nada. Even when it reaches the point where it has skipped all the chunks and starts uploading again, there is no message whatsoever.

So that was the situation at my previous post. In the meantime, I noticed that it may be relevant to say that the command duplicacy.exe backup -vss -stats -limit-rate 4000 was issued via a PowerShell script, which in turn was started (manually) via the Windows Task Scheduler (since the script runs in the background, it writes any duplicacy output to a log file). Because when I at some point ended the task in Windows, thousands of logfile entries about uploaded chunks were suddenly written to the logfile.

Under other circumstances, the messages from duplicacy are written to the log file already before the end of the task. So there may be something specific about how duplicacy produces its when it resumes a backup in absence of the incompletefile?


gchen    Jan 24 4:03PM 2018

So you're saying that those Skipped chunk 1 size 8427437... messages are only produced when duplicacy has the incomplete file to check which files to skip? In other words: when duplicacy attempts to upload a file that turns out to already exist (and hence skips it) then this is not reported as Skipped chunk...? Fine.

No, it is the opposite: if there is an incomplete file Duplicacy will know which files to skip so it won't produce many Skipped chunk messages. If that file is absent, it doesn't know which files had been uploaded in the previous backup (nor does it know there was a previous backup), so it will start from the first file and attempt to upload every chunk. But a lot of chunks had indeed been uploaded and that was why you saw so many Skipped chunk messages.


Christoph    Jan 24 4:46PM 2018

Okay. Fine. What about the issue that duplicacy did not log anything until the PowerShell was closed?


gchen    Jan 24 10:57PM 2018

Duplicacy only writes to stdout. It is normal for Windows to create a buffer to take the output from a running process and only when the buffer is full will the content of the buffer be flushed to the file on disk.


Christoph    Jan 25 2:20PM 2018

Ah, thanks for explaining. But I'm afraid this doesn't explain much of what I'm seeing because the first six lines appeared within a few seconds of starting the backup. But the following 236000 (sic!) appeared only when I stopped the task. So if there is a buffer, it is max 6 lines long. Which leaves the question why the other 236000 lines were kept in the buffer (or wherever they were kept)?

I understand that this is probably not a bug in duplicacy, but I suspect that there may be some room for improvement in how it interacts with windows.

Or maybe there is something wrong with my script? Here it is:

# Basic setup (edit here!)
$backupID = "PC_D_christoph"
$repositorypath = "D:\christoph"
# Construct logfile name for the day
$logfiledate = get-date -format yyyy-MM-dd
$logfilename = "backup_$backupID-$logfiledate.log" 
# Go to repository 
Set-Location -Path $repositorypath >> "C:\duplicacy\logs\$logfilename"
$(get-date -Format "yyyy-MM-dd HH:mm:ss") + " *** Starting new backup of " + $(convert-path $(get-location).Path) + " ***" >> "C:\duplicacy\logs\$logfilename"
# Start backup (edit here!)
& "c:\Program Files (x86)\Duplicacy\duplicacy.exe" backup -vss -stats -limit-rate 4000 >> "C:\duplicacy\logs\$logfilename"
$(get-date -Format "yyyy-MM-dd HH:mm:ss") + " *** Backup of " + $(convert-path $(get-location).Path) + " stopped ***" >> "C:\duplicacy\logs\$logfilename"


gchen    Jan 25 9:26PM 2018

Your script looks ok to me although I must admit that I have little experience with PowerShell.


TheBestPessimist    Jan 30 10:28AM 2018

Hey cristoph, maybe this'll help you: https://bitbucket.org/thebestpessimist/duplicacy-utils


Christoph    Jan 30 4:42PM 2018

@TheBestPessimist: wow, thanks for sharing! That looks like what I was hoping to eventually come up with in a couple of years, when I have had the time to learn enough powershell to make things work! I'll have to take a closer look at those scripts this weekend, but based on a quick look, it seems like you are following duplicacy's decentral structure where all the preferences etc are stored in the repository base folder, right? My plan was to have a central folder with all settings so that there is only a .duplicacy file in the repository, which is pointing to the central folder. So this is probably something I'd want to add/modify.

I like the idea of setting the options once and for all in a config file. It's a nice workaround for (part of) a feature request that I was contemplating submitting, i.e. that you can modify the default settings for things like chunk size.


gchen    Jan 30 9:56PM 2018

@TheBestPessimist I added a link to your scripts to the wiki home page. Thanks!


TheBestPessimist    Jan 31 2:03AM 2018

@gchen excuse me (or better: DON'T), but i still am too lazy to make that promised wiki page with the utils on github. TeamFortress 2 just doesn't let me do it :^).

@cristoph create me issues on bitbucket (or ask me to also provide a github if you don't like bitbucket) with what you'd like to see modified, and i will see what could be done. What i think is that i like @gchen's idea better, to just have the folder at the root of the repository, and make multiple Scheduled Tasks for everything. Even that can be modified though, so that the script uses full folder paths, so it can be adapted for your needs.

@gchen: another request: is it possible that i get a mail from this forum when someone replies to me? I would answer much more quickly if i'd be notified (like on github) of everything. (hehe, this is also linked to the discourse forum request where the functionality exists)


Christoph    Jan 31 4:07AM 2018

@TheBestPessimist Thanks a lot for your kind offer to (possibly) accept "feature requests". I have not used bitbucket so far and therefore have no reason to not like it. But I have been on Github for a while and always appreciate when software I'm interested in is hosted there, as I already know how it works (to some degree, at least). So if you don't mind moving (or copying? if that makes sense) the repository to github, it would be great. I think it also makes sense considering that duplicacy is hosted there.

As for the forum request, you are not alone: see here: https://duplicacy.com/issue?id=5696101055201280 and here: https://duplicacy.com/issue?id=5753627847163904. I'd like to use this opportunity to renew my offer of helping with setting up the forum. I really think the best time to make the move is about now. I know that for a busy person (as we all are), one big hurdle is to get started and find out what needs to be done and what potential problems might arise, because - obviously - only then can you actually plan the process. So that is why I'm offering my help (I have set up two forums myself and helped to set up one). My sense is that the demand for duplicacy will continue to grow and that also means more support requests. At some point, one person alone wont be able to handle those anymore, so it will be good to have an active community by then, that can help out. The initial setup of the forum can be done in less than an hour (all you need is a VPS or equivalent, a (sub)domain, and a mailaccount for transactional mails at a provider like Mailgun or so), the main challenge is migrating the existing content to the new forum. But given the simplcity of this "forum" and the fact that posts are already in Markdown, it should not be the biggest of deals...


TheBestPessimist    Jan 31 4:54AM 2018

Here it is: https://github.com/TheBestPessimist/duplicacy-utils


gchen    Jan 31 11:19AM 2018

Christoph, I really appreciate your offer. I'll think about it and check how easy the transition will be and then let you know in a few days.


Christoph    Feb 1 5:53PM 2018

@TheBestPessimist, since you seem to know powershell well, can you explain why the duplicacy logs generated by my script above are sometimes immediately written to the logfile, and sometimes they only show up at the end of the job?

Today, for example, I immediately saw this in the logfile:

2018-02-01 21:47:30 *** Starting new backup of D:\christoph\work\dropbox ***
2018-02-01 21:47:33.240 INFO STORAGE_SET Storage set to P:\Backup\duplicacy-test\

while the remaining entries only appeared in the log after the job completed:

2018-02-01 21:47:30 *** Starting new backup of D:\christoph\work\dropbox ***
2018-02-01 21:47:33.240 INFO STORAGE_SET Storage set to P:\Backup\duplicacy-test\
2018-02-01 21:48:11.403 INFO BACKUP_START Last backup at revision 3 found
2018-02-01 21:48:11.420 INFO VSS_CREATE Creating a shadow copy for D:\
2018-02-01 21:50:01.956 INFO VSS_DONE Shadow copy {7499CB92-6982-4D79-B8E3-A503FEF9408D} created
2018-02-01 21:50:02.001 INFO BACKUP_INDEXING Indexing D:\christoph\work\dropbox
2018-02-01 21:50:03.294 INFO SNAPSHOT_FILTER Loaded 3 include/exclude pattern(s)
2018-02-01 21:51:59.466 INFO UPLOAD_PROGRESS Uploaded chunk 1 size 513595, 10KB/s 08:18:19 0.1%
2018-02-01 21:51:59.977 INFO UPLOAD_PROGRESS Uploaded chunk 2 size 298295, 15KB/s 05:14:54 0.2%
[...]


TheBestPessimist    Feb 3 10:30AM 2018

Hey,

I don't know powershell that well. I am just curious how it works, so i chose it for a small project (afterall: doing real work is the best way to learn anything): duplicacy automation.

Anyhow: i think that your problem has something to do with buffering.

This is a hit i got: https://stackoverflow.com/questions/40487671/is-out-host-buffering (after way more searching that i expected).

Anyhow, in my opinion, you could try to write to a file using Out-File cmdlet (maybe that works).

Get-Date | Out-File -Append -FilePath a.txt

https://docs.microsoft.com/en-us/powershell/module/Microsoft.PowerShell.Utility/Out-File?view=powershell-5.1


Christoph    Feb 4 6:01PM 2018

Thanks for your help with that. As far as I can tell from the Stackoverflow link, the best way forward is to use Start-Process instead of & because it takes control of the output (instead of letting the application (i.e. duplicacy) decide.

The only downside is that it does not allow to append an existing file, but the work-around for that seems to be here: https://stackoverflow.com/questions/8925323/redirection-of-standard-and-error-output-appending-to-the-same-log-file (have yet to try it)


TheBestPessimist    Feb 5 2:51AM 2018

No problem. My question is however, why won't you try and use my implementation for automatic duplicacy? It's already there (and working for my pc & laptop).


Christoph    Feb 5 7:54AM 2018

I intend to use it sometime in the future. The reasons why I don't do it immediately are basically 1. it always takes time to learn and set up a new system, time that I currently don't have. Had I seen this when I started with duplicacy in December, I would have immediately used it... 2. I prefer the centralized setup and need to adopt you scripts to that (or submit an issue/feature request for you, as discussed earlier). But that also requires time (which I don't have).

And to cite a third reason for DIY:

doing real work is the best way to learn anything

meaning that getting my scripts to work (rather than just copying yours) is also a way of learn powershell...

I'll get back with my feature request asap.


Christoph    Feb 15 3:53PM 2018

if the Duplicacy process gets killed it won't get the the chance to save the incomplete file. The CLI version doesn't have this issue when you press Ctrl-C

if there is an incomplete file Duplicacy will know which files to skip so it won't produce many Skipped chunk messages. If that file is absent, it doesn't know which files had been uploaded in the previous backup (nor does it know there was a previous backup), so it will start from the first file and attempt to upload every chunk.

I have to come back to this because I cannot confirm the above. I've had the suspicion for a while but never was entirely sure. Now I am: when I stop an ongoing backup with Ctrl + C in PowerShell and then restart the same backup, I will see skipped chunk message for hours (literally) before it resumes the actual upload.


gchen    Feb 15 10:21PM 2018

If you don't see a log message "Incomplete snapshot saved to", then it means Duplicacy isn't getting the Ctrl-C signal. I'm not sure if PowerShell would pass the signal to Duplicacy when it is running inside a script, but if you run Duplicacy from the command line it should get the signal.


Christoph    Feb 16 2:01AM 2018

I'm not running it as a script. I'm running it directly from the powershell command line.


Christoph    Feb 19 5:50PM 2018

Today the backup failed/stopped because the destination drive became unavailable. I'd assume that in this situation duplicacy had all the time in the world to write the incomplete snapshot, but it didn't:

powershell screenshot

Now it will have to skip chunks again for hours. Is there any way I can get it to save the incomplete snapshots?


gchen    Feb 19 11:28PM 2018

Any change you're not running the latest version?


Christoph    Feb 20 3:42AM 2018

It's one of the v2.1.0 betas. Not sure if it's the latest one though. Does it matter which beta?


gchen    Feb 20 2:01PM 2018

Sorry, I forgot the partial snapshot file is saved for incomplete initial backups only. After the initial backup is done, it always uses the last complete backup as the base to determine which files are new. So if you add a lot of new files after the initial backup, you won't be able to fast-resume after the backup is interrupted.


Christoph    Feb 20 2:11PM 2018

Oh. Well, that is bad news. Why are incomplete snapshots not saved for subsequent backups?

I have about 1 TB of data to back up and for various reasons, that won't go through without interruptions. Without incomplete snapshots, I lose several hours every time during which duplicacy just skips chunks. And as the backup progresses, the longer that waiting time becomes.

May I suggest that in v2.1.0 this limitation to incomplete snapshots is taken away? Or if there are some serious downsides of that, to at least make it available as an option, -save-incomplete or something?


Christoph    Feb 26 4:11AM 2018

This is really becoming a major issue for me.

What if I delete the first two snapshots? As a work around, will duplicacy save the ongoing incomplete snapshot again? If so, would duplicacy prune -r 1-2 -storage <storage url> be the correct way of doing this? Or should I just delete the two files in the snapshot folder? (I obviously don't want the chunks to be deleted, only the snapshots.)


gchen    Feb 26 1:14PM 2018

You should manually delete the two files 1 and 2 under snapshots/snapshot_id in the storage.


Christoph    Feb 26 1:39PM 2018

OK, I have done that, but assuming that duplicacy checks for the existence of previous snapshots at the beginning rather than at the end, this will only become effective the next time I start the backup command, right?

Just to give you an idea how serious a problem this behaviour of duplicacy is: It currently takes around 24 hours for the backup to resume actual uploads, i.e. it is skipping chunks for a full day. I really don't see what the point is of limiting incomplete backups to initial backups only.