Yesterday in the Twitterverse, there were some
concerns over whether the tool is accurate. I decided to do a quick test. I did notice today that there was a v1.2 of the DAOS Estimator Tool released on February 19, 2009. So I ran the latest version of this tool on a single database and I can see where the concerns may be coming into play.
The concern was that the Estimator Tool was only looking at the file name to determine if the attachment was a duplicate. Here's what I did to test this concern:
- Created a new mail file
- Created 2 draft messages that had the same attachment (attach.ppt) that was 699Kb
- Created a 3rd draft message with an 880 Kb attachment (also named attach.ppt). This was a completely different file from the start - it just has the same name.
The results were right in the main file analysis section - you can see that it saw 3 files in the DB with 1 duplicated file resulting in 2 DAOS files. So far, so good. But the problem is with the section I have outlined in red below. As you can see, there are two lines that have the same text ("Total Duplicate Attachments found"). This isn't right.

I went back to review some previous results (gathered from v1.1 of the Tool) and can see that the second line should really be "Total DAOS Eligible Attachments" (see image below). This explains some of the confusion with the above image. Using the wording from below in the results above, this is right. There's only 1 attachment that's duplicated but 3 total attachments that will be moved over into DAOS.

I've reported this to IBM, so hopefully this wording will get corrected in a v1.3. I think we can feel safe that the Tool is working just like DAOS will work in properly determining if an attachment is really a duplicate or if it just has the same file name. Though I certainly welcome any feedback otherwise!
Comments (4)
The second "Total Duplicate Attachments found" line does precede the accurate statistic that was formerly labeled "Total DAOS Eligible Attachments." I've created SPR CSCT7PKM8R to correct this. The big thing to remember is that the data is still good. Just remember that the second line is the number of DAOS-eligible attachments.
The problem is not just the wording. The method for determining if an attachment is duplicated IS flawed. You're right that it is using the file name and the size, but that's it. In my case the attachments are auto generated PDFs and almost all of them have the same size but never the same creation data/time and this is not being taken into account.
Here's a line from the verbose output:
Attachment Name Size Compr Size Compr Type Refcount
CUSTOMER.pdf 4218 4218 None 4
All these files are different.
Just a note - I have seen what Vitor is referring to in his comment on my initial post and have verified that there are cases where if the file name is the same as well as the size reported in the DAOS Estimator report that the files are treated as the same object by the Estimator. But after enabling DAOS on the database and compacting it, the two attachments were treated as separate objects by DAOS even though the tool did not see them as different.
This means that you may save less space than the Estimator thinks
due to some edge cases where you may possibly have attachments with
the same name and physical file size but with different internal
data. In Vitor's case with auto-generated files, the results may be
very skewed. Again, each environment is different and ymmv. IBM is
already aware of these findings above. :)
DAOS Estimator only compares the file name and file size to identify duplicates. This was a deliberate decision to maximize performance. The tool is only meant to estimate disk savings and not to provide a 100-percent accurate report.