towerbr Feb 5 9:41PM 2018 CLI
I made a test to evaluate the best configuration for the backup of an Evernote folder, using two jobs, one with fixed chunk of 1M and another with variable chunk.
I created a page with the results here: GitHub test repository
I'll improve the texts later.
Any suggestions are welcome!
Christoph Feb 6 3:59AM 2018
Thanks for the detailed report. Now we only need to find out whether this result is specific to database files or whether it is generally recommended to use 1M fixed size chunks...
towerbr Feb 6 6:44AM 2018
I'm running another test with mbox files, I should finish in 1 or 2 days, but only with 1M variables, and it seems to be working fine.
Then I can run another test with mbox files and test the fixed x variable.
gchen Feb 6 11:18AM 2018
Great work! It confirmed that fixed size chunking works better than variable size chunking for databases.
However, I wouldn't recommend fixed size chunking for general purposes, as it is susceptible to insertions and deletions. As a result, Duplicacy doesn't use the pack-and-split approach with fixed size chunking, and files smaller than the chunk size are stored in individual chunks, leading to too many chunks.
There are two places in the full data table where it says
duplicati. Is that a typo?
Christoph Feb 6 11:51AM 2018
I don't know to what extent mbox files are also a special case, but what I am looking for are some pointers regarding the best strategy for mixed backups, i.e. containing "all kinds of stuff", including databases. To start with, the question is:
Duplicacy doesn't use the pack-and-split approach with fixed size chunking, and files smaller than the chunk size are stored in individual chunks, leading to too many chunks.
So what exactly are you saying here? That Duplicacy's approach is "leading to too many chunks"?
towerbr Feb 6 11:51AM 2018
There are two places in the full data table where it says duplicati. Is that a typo?
Yes, it was a typo, thanks for pointing. A copy-and-paste error from the previous worksheet, used in the Duplicacy x Duplicati test.
I splitted the tables to improve the readability, and also created a summary of the tests on the readme page.
towerbr Feb 6 11:56AM 2018
I don't know to what extent mbox files are also a special case
mbox are very strange files: a "stack" of emails in text format, accessed through a brother file with an index.
When you delete a message, it is deleted from the index, but not from the main file. When you compress the file, then it is deleted.
That is, very different from a database.
under what conditions does it make sense to set up separate storages for database backups (i.e. excluded all database files from the "ordinary" backup with variable chunks and include them in a separate backup with fixed chunks?
I had thought of it initially, maybe a new test to be done ;-)
In the case of Evernote it may not be necessary because the database represents more than 90% of the repository, but in the case of mbox files it might be interesting.
And there is also a third subject (in my case): Veracrypt files.
towerbr Feb 16 9:17PM 2018
I published the results of the two new tests:
There's a lot of data there. ;-)
Christoph Feb 17 5:01AM 2018
Why did you use variable chunks when we already know that those don't work with databases? Or maybe I should ask: why did you use a repository with databases for the tests?
I tried to figure out what the the two alternative versions of duplicacy do differently but I couldn't. Do you understand them?
But I noticed that the file boundaries version has some recent "tweaks" added. You might want to check if those affect your tests. In any case, it probably makes sense if you link (also) to the exact version that you used, rather than to the (current version of the) branch.
@gchen I'm wondering what might be a good way of handling duplicacy's strong reactions to certain kinds of changes in the repository (i.e. the kind of changes that a database reindexing causes). What I mean here, as opposed to some of my previous questions, is how to detect and handle extreme situations, i.e. where a new snapshot has caused a significant increase in storage use.
At a very basic level, I wonder if it would be possible to make duplicacy automatically tag such snapshots for easy identification? The point would be to deliberately target the preceding snapshot with the prune command. In other words, what I'm heading for is to optimize pruning instead of (or in addition to) optimizing backup. Does that make any sense?
At a more advanced level, duplicacy could routinely compare the increase of the repositorysize with the volume of uploaded chunks and issue a warning (or tag the snapshot) if that ratio becomes to big.
towerbr Feb 17 10:07AM 2018
Why did you use variable chunks when we already know that those don't work with databases?
Because, as you yourself said, we already know that fixed chunks are better in this case, and the purpose of the test, as I wrote down there, was to compare the efficiency of the new branches. For this they would have to be used with variable chunks, so in order to make the comparison I used variable chunks in all jobs.
Why did you use a repository with databases for the tests?
For this reason I did two tests: with Evernote (~ DB) and with Thunderbird (mixed)
Do you understand them?
Not in depth, only what is written in the #334 issue.
Christoph Feb 17 6:17PM 2018
Very interesting discussion in that issue. Learned a lot. And realized that I assumed that duplicacy was already more advanced than it is, i.e. that it already respects file boundaries.
Note, BTW, that kairisku suggested that you compare official 1M fixed with his branches 1M variable:
That makes more sense to me.
towerbr Feb 18 8:17AM 2018
[Off topic] I didn't know Hypothes.is. Liked it!
Christoph Feb 18 1:49PM 2018
I think it's quite revolutionary, actually. Or rather: the fact that web annotations are now (since about a year) an official W3C standard. The people at and around hypothesis have been the driving this for years.