Duplicacy keeps stalling.

Danny     Sep 5 10:51AM 2017 CLI

I'm looking into using duplicacy + wasabi as a replacement for CrashPlan. I started with a small directory of about 2 GB which backed up fine. I then added a larger directory of about 500 GB.

The first time I tried on this larger directory, duplicacy stalled after about 1400 chunks. No crash, it just stopped printing output indicating that it was uploading chunks. It stayed like that for at least half an hour. I aborted it, restarted, and went to bed.

This morning I saw it had stopped after 4096 chunks. By looking at the wasabi dashboard I can tell that the newest chunk was uploaded about 4.5 hours before I checked it. It had stalled for that long.

Duplicacy version is 2.0.7. OS is FreeBSD 11.0-STABLE.

It's actually a FreeNAS box and I have duplicacy running in a jail. One of the things that I found appealing about duplicacy is that it has native binaries for FreeBSD. CrashPlan was not easy to get going on FreeNAS because the only binaries you can use are Linux binaries.

I am now running it again with the -debug and -log options. I'll update this issue when I have more data.

Something that makes this kind of annoying is that when I hit ctrl-c it doesn't save the incomplete snapshot, so it has to rescan everything. I know that the incomplete snapshot functionality is there because I have seen it working. Now I'm scratching my head to figure out where. I think it would be on another jail on this machine. I don't know why it would have worked before but it's not working now. Maybe whatever issue is causing wasabi to hang is also preventing ctrl-c from saving the incomplete snapshot.

I saw on another issue that you said in this case you have to "force a crash" and send a stack trace, but I don't know how to do that. I'll gladly do it if you give me a little guidance.


Danny    Sep 5 1:43PM 2017

This might be new ... the program stopped.

2017-09-05 11:34:14.644 INFO UPLOAD_PROGRESS Uploaded chunk 6796 size 3436484, 3.95MB/s 1 day 09:51:25 6.5%
2017-09-05 11:34:15.494 ERROR UPLOAD_CHUNK Failed to upload the chunk 1e7687fa046669d7dcf20956cc0baa1c9bdc4cf50d5654c50528771b5285fddd: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
        status code: 400, request id: 0756E4C61BDEA13A, host id: JT6/IBwG98s1mQdou1/lS/78m5GH62T0UWWUSJ4etb0zIJiR2Yfb6xz06yl2b31qkwzLpCF9gyde


gchen    Sep 5 2:48PM 2017

Was this a new error, or the same error that caused the process to hang? Can it be reproducible?

You can also try s3c:// instead s3://. The default s3 backend uses https://github.com/aws/aws-sdk-go, while s3c uses https://github.com/goamz/goamz.


Danny    Sep 5 4:23PM 2017

There was no error printed when it hung before, but I wasn't using -log or -debug the first couple of times.

I restarted the process and let it run into chunk ~9000 or so. It was still working fine then, but I interrupted it again to see if it saved the incomplete snapshot when it's working fine, or if it would still fail to do so. It still failed to save the incomplete snapshot.

I thought the chunk uploading error might be essentially a transmission checksum error, which is kind of believable because the link to this computer on my LAN should be 1Gbps but is in fact less than 100Mbps at best, making me think the transmission medium was boned. This motivated me to finally troubleshoot that issue properly and I replaced a faulty network cable.

I fired it up again and I'm going to let it run. I'll let you know if I continue to have issues.


Danny    Sep 6 12:20AM 2017

It happened again.

2017-09-05 21:56:36.347 INFO UPLOAD_PROGRESS Uploaded chunk 12835 size 6500322, 10.18MB/s 12:20:19 12.3%
2017-09-05 21:56:39.768 ERROR UPLOAD_CHUNK Failed to upload the chunk 495da1bf38db9b6c11d7e249762c7fe29e8d7936558eed15f7ac6e29328f8946: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
        status code: 400, request id: 7ACFB04835F0BAA4, host id: m816YmLlkE0GLISNIhZte5YVJ5htdmNtDj7iQ5lyiLxSDzLcfxT+y1yOd3yDJMapZRplTKNzOi0I

I'm switching to s3c and I'll see what happens.


Danny    Sep 6 12:29AM 2017

s3c seems much slower than s3. The process of scanning through the first part of the backup (the part where the chunks are already uploaded) is going at ~ 12 MBps and it was > 35 MBps before.


gchen    Sep 6 7:44AM 2017

Another user reported the exactly same error: https://github.com/gilbertchen/duplicacy/issues/160


Danny    Sep 6 9:39PM 2017

Yeah, that's the error I'm getting for sure. Maybe the issue is with Wasabi. I'm considering BackBlaze also but I'm going to bull ahead with this one for now. Wasabi seems to have more sophisticated access controls which would be valuable for a use case I'm considering for work.

I didn't want to let it keep going with the s3c storage, just because it seemed way slower than the s3 storage. It seems odd that it would make a material difference, but it definitely does, at least for the case where the chunk doesn't need uploading. Once it starts uploading new chunks the upload calls probably begin to dominate.

I downloaded the source code and hacked in some extremely crude retry logic in the UploadFile method of the s3 storage. It'll be sufficient for my purposes if it works.


gchen    Sep 8 6:50PM 2017

I checked in an experimental fix: https://github.com/gilbertchen/duplicacy/issues/160

Please let me know if it works or not.


Danny    Sep 9 8:34AM 2017

My backup has been running for a while now with my retry code in place. I'm over 68000 chunks now. No further issues. This is with code I cloned before you committed that fix (197d20f0e07284c9c8a35d9d0800b366d53b2f18). The thing is, it's not exercising my retry code. I have statements to print when it hits the retry code and it hasn't yet had to retry once with my build.

Since The most I was able to go was 4000 chunks before hitting the error before and I've now gone about 45000 without hitting it once, it looks like it's been fixed without any code changes. It seems one of the following is the case:

  1. The core issue was on the Wasabi side and they fixed it, or was transient in some other way.
  2. I could be using different revisions of library code than you because I ran "go get" later.
  3. Maybe there's something about my build environment (go version?)
  4. The problem could still be there and I might just have had a very lucky streak.

2 seems most likely to me, but I've only used go on one small project at work and I might be confused about how go get works. If you pin the revisions it gets I'd be getting the same code as you. I have aws-sdk-go version 1.10.39


gchen    Sep 9 6:36PM 2017

I tend to believe 1 is the case. My aws-sdk-go version is 1.8.39, but I don't see any changes since that version that could have fixed the issue.

It is interesting to see that the error comes from minio: https://github.com/minio/minio/blob/0d154871d50d9a88511b383bc4c96581bc46b659/cmd/api-errors.go#L575


Danny    Sep 10 8:25PM 2017

It was easy to test. I interrupted my transfer and started over with 2.0.7 again. Just over 600 chunks in:

2017-09-10 18:27:39.113 ERROR UPLOAD_CHUNK Failed to upload the chunk 8efc559e8803a7d8c054a333929c8f509ea6ca772314ddaf6809402967df68e8: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
        status code: 400, request id: 18AEDF290E162D59, host id: IMKfSFTr352x6oKqmLw6CTrUThgs7wjLHjVFcZ/ZUaHUWNwEhSZfAtFuWv0hyvMy+Z2xxnRnx3jO
goroutine 14 [running]:
runtime/debug.Stack(0x0, 0x0, 0xc427507900)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/gilbertchen/duplicacy/src.CatchLogException()
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:171 +0x13d
panic(0xc64bc0, 0xc427ad3050)
        /usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/gilbertchen/duplicacy/src.logf(0x2, 0xd0f4f4, 0xc, 0xd22597, 0x21, 0xc427507e48, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:154 +0x1ac
github.com/gilbertchen/duplicacy/src.LOG_ERROR(0xd0f4f4, 0xc, 0xd22597, 0x21, 0xc427507e48, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:97 +0x73
github.com/gilbertchen/duplicacy/src.(*ChunkUploader).Upload(0xc429492780, 0x0, 0xc429492870, 0x15c55, 0x61532c323a6c01)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkuploader.go:139 +0x553
github.com/gilbertchen/duplicacy/src.(*ChunkUploader).Start.func1(0xc429492780, 0x0)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkuploader.go:60 +0x112
created by github.com/gilbertchen/duplicacy/src.(*ChunkUploader).Start
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkuploader.go:65 +0x54

Just to clarify, this is the download from GitHub (your build). I didn't build it.


Danny    Sep 12 10:26AM 2017

I saw that you released 2.0.9 with a later aws-sdk-go. I downloaded that one and I ran the last few thousand chunks with it. It worked. It's hard to be completely sure because I didn't have much of the backup left at that point, but it looks like this issue is fixed in 2.0.9.

Thanks.


Danny    Sep 19 9:37AM 2017

Now I'm testing restore and I keep getting errors like this.

2017-09-19 08:31:25.444 ERROR UPLOAD_CHUNK Failed to decrypt the chunk a08c93db303e740a47dd6405918179199f2fa656866afb168e055da2da3f3070: cipher: message authentication failed
goroutine 46 [running]:
runtime/debug.Stack(0x0, 0x0, 0xc4202756b8)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/gilbertchen/duplicacy/src.CatchLogException()
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:177 +0x13d
panic(0xc6d3a0, 0xc4241ed350)
        /usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/gilbertchen/duplicacy/src.logf(0x2, 0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc420275d40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:160 +0x1bc
github.com/gilbertchen/duplicacy/src.LOG_ERROR(0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc420275d40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:98 +0x73
github.com/gilbertchen/duplicacy/src.(*ChunkDownloader).Download(0xc428c78a00, 0x3, 0x0, 0x26, 0xc423757720, 0x20, 0x1c5622, 0x1, 0x1)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:336 +0x547
github.com/gilbertchen/duplicacy/src.CreateChunkDownloader.func1(0xc428c78a00, 0x3)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:81 +0x161
created by github.com/gilbertchen/duplicacy/src.CreateChunkDownloader
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:86 +0x23f

It's a different chunk every time.

I'm giving up on Wasabi. Their service is interesting but all these errors are problematic. Upload and download speeds are also much, much lower than Backblaze. They're not optimized for backup. It's not worth it to save $0.0011/GB. I'm just posting this update in case others might benefit from my experience.


gchen    Sep 19 11:41AM 2017

This is worrisome. I can't completely rule out that this is Duplicacy's fault.

As for the speed, my experiences are quite the opposite -- Wasabi is about 4x faster than B2, and even faster than Amazon S3. Maybe that is because my test machine is located on the east coast? What is your machine's location?


Danny    Sep 20 11:53AM 2017

If you are finding Wasabi is faster then maybe I'm doing something wrong. For uploads I can easily saturate my 15 Mbps connection with 3 threads. For downloads I couldn't get to use much more than 8 Mbps of my 150 Mbps download bandwidth. By contrast I was able to average 7.25 MBps (capital B) over my test restore from Backblaze and that includes a period where I throttled the bandwidth to 1280kBps for a couple of hours.

How many threads are you using to get 4* B2 performance, and/or are there other settings I should tune? I left the chunk size at the default.

My machine is in Calgary. It's closer to the west than the east coast but it doesn't seem like it should make this much of a difference. Latency will be higher but it doesn't seem like latency should be a major factor. Although I was reading in Backblaze's documentation that their API allows you to store a file with a single API call and the S3 API requires two. Still doesn't seem like it should matter for files averaging 4MB though. Transfer time should dominate.


Danny    Sep 20 12:39PM 2017

I can increase the bandwidth utilization by adding more threads, but not by that much -- only up to ~3 MBps. I see on the traffic graph that the bandwidth utilization is very spiky. It spikes up to 70 Mbps, which is about where it was going with Backblaze. It seems like it's not keeping the pipe full. It should be doing readahead. Or ... if it's file oriented maybe it brings things to a screeching halt while it's extracting the files.

If I use copy instead of restore, the traffic graph is extremely flat holding at around 1MBps no matter how many threads are in play, which seems surprising at least to me.


gchen    Sep 20 1:32PM 2017

I used the linux code base as the backup dataset (https://github.com/gilbertchen/benchmarking)

Here are total running times (12 backups for each storage):

---------------------------- sftp ----------------------------
real    0m45.700s
real    0m6.542s
real    0m22.733s
real    0m3.544s
real    0m23.702s
real    0m6.393s
real    0m21.581s
real    0m22.677s
real    0m27.055s
real    0m26.547s
real    0m30.198s
real    0m31.033s
---------------------------- s3 ----------------------------
real    0m50.356s
real    0m7.203s
real    0m30.001s
real    0m4.991s
real    0m25.225s
real    0m9.322s
real    0m45.123s
real    0m43.682s
real    0m51.769s
real    0m47.605s
real    1m1.010s
real    1m4.767s
---------------------------- wasabi ----------------------------
real    0m34.644s
real    0m5.538s
real    0m16.971s
real    0m3.997s
real    0m20.551s
real    0m5.976s
real    0m21.438s
real    0m22.126s
real    0m28.537s
real    0m29.045s
real    0m34.536s
real    0m35.741s
---------------------------- b2 ----------------------------
real    2m11.109s
real    0m11.306s
real    2m0.587s
real    0m16.727s
real    2m16.777s
real    0m28.311s
real    1m22.846s
real    0m53.006s
real    1m35.601s
real    0m51.963s
real    1m33.964s
real    1m46.014s

The test machine is an ubuntu 16.10 vm running on an ESXi server with a 200Mbps uplink. The sftp server was running on a different vm on the same ESXi host. So Wasabi is even faster than a local SFTP server.

Another test on a VirtualBox image (the other data set in https://github.com/gilbertchen/benchmarking) showed similar results.


Danny    Sep 20 1:59PM 2017

Maybe I'm comparing apples and oranges. The first part of my backup consists of about 80,000 very small files. The rest of it is much larger files. I think performance might improve once I get past this first part but I haven't been letting it get that far. If it works faster on larger files I'm comparing a slow start with a fast finish.

I switched to the s3c provider, cranked up the threads to 128, and let it run. It got about 600 chunks in and then crashed:

2017-09-20 12:50:45.539 ERROR UPLOAD_CHUNK Failed to decrypt the chunk ca9c51caa3e5cae3e48d72b65cd80618b7e27537b51732c54e884efcce9adc88: The storage doesn't seem to be encrypted
goroutine 134 [running]:
runtime/debug.Stack(0x0, 0x0, 0xc42a62d6b8)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/gilbertchen/duplicacy/src.CatchLogException()
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:177 +0x13d
panic(0xc6d3a0, 0xc42834a900)
        /usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/gilbertchen/duplicacy/src.logf(0x2, 0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc42a62dd40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:160 +0x1bc
github.com/gilbertchen/duplicacy/src.LOG_ERROR(0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc42a62dd40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:98 +0x73
github.com/gilbertchen/duplicacy/src.(*ChunkDownloader).Download(0xc42876cf00, 0x47, 0x0, 0x27d, 0xc42abb77e0, 0x20, 0x331d30, 0x1, 0xc4265e4e01)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:336 +0x547
github.com/gilbertchen/duplicacy/src.CreateChunkDownloader.func1(0xc42876cf00, 0x47)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:81 +0x161
created by github.com/gilbertchen/duplicacy/src.CreateChunkDownloader
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:86 +0x23f

This is an interesting one because it had already successfully restored over 77,000 files at this point. Maybe the issue results from the large number of threads I was using.

I also saw a bunch of messages like this.

2017-09-20 12:50:25.182 WARN CONFIG_CHUNK 190 chunks have been allocated

I suppose these are not important? It has to hold the chunks in sequence and they may arrive out of order if multiple threads are in use. I don't know why it's holding > 128 chunks though.


Danny    Sep 20 2:27PM 2017

I switched back to s3 and started over with 16 threads. I get the same error, on the same chunk.


Danny    Sep 20 3:13PM 2017

I think this is a sign of an undetected error when creating the initial backup to Wasabi with duplicacy. This is why:

  1. I backed up directly to Wasabi.
  2. I created a new local storage using duplicacy add -copy.
  3. I copied the config file and the snapshot files from Wasabi to the local storage thus created, so the same encryption keys would be used.
  4. I backed up directly into my new local storage. Since it used the same parameters including the same encryption keys, it created the very same chunks.
  5. Against the local storage, I ran duplicacy check and (manually) copied missing chunks from Wasabi to the local storage until duplicacy check completed successfully.
  6. Against the local storage, I ran duplicacy check with the -files option. It took days but it completed successfully with no issues. I am therefore absolutely confident that the local storage is OK.
  7. As above ... I ran duplicacy restore to restore from Wasabi into a new repository. It failed.
  8. I downloaded the failing chunk from Wasabi and compared it to the chunk from the local storage. They are different. They are exactly the same size, but their contents are completely different.
  9. I uploaded the chunk from the local storage into Wasabi.
  10. I am now running the same restore that failed. It's still going but it is past the chunk that it was getting stuck on.

The chunk on Wasabi was wrong. This could only have happened during the backup. Or after, I guess, if you think it's credible that bit rot did this on the Wasabi end.

It could be a problem with duplicacy or a library it depends on, or a problem with Wasabi.

The working chunk starts with "duplicacy\x00". The corrupt chunk does not. Adjacent chunks on Wasabi also start with "duplicacy\x00". So far this is the only one that doesn't.


gchen    Sep 20 3:13PM 2017

That chunk doesn't have a valid encryption header. Can you download it from wasabi.com and check the first 10 bytes of the file? Do you have a local copy of that chunk (if you copied from local storage to wasabi)?


Danny    Sep 20 3:18PM 2017

Yes, I have both chunks. It's how I know they were different. The bad chunk starts with

46 ac 59 b1 59 f9 17 83 08 b8

Is there a way to tell if it's a valid but unencrypted chunk? It's definitely not encrypted because there are large swaths of repeating sequences which you shouldn't see in ciphertext.