Deduplication FAQ – Venio Systems

What is Deduplication?

In the realm of eDiscovery, deduplication is a critical process used to eliminate duplicate files or emails from a dataset. The goal is to reduce the volume of data that legal professionals need to review, which can save time and money. Static deduplication is one method used to achieve this.

Understanding Venio Dynamic Deduplication

Venio Dynamic deduplication identifies duplicate items within the current scope of a search without relying on the document's primary status within the system.

Deduplication Priority in Dynamic Dedupe
- The priority is based on the document's group ID, reflecting the order in which they were processed.
- If you've set a custodian priority in your project, Dynamic Dedupe might not be suitable as it doesn't recognize this order.
- Dynamic deduplication does not consider custodian dedupe order (priority) in search.
Flexibility:
- Dynamic deduplication allows for deduping data on-the-fly, adjusting to the specific scope of the search. This means that, unlike static deduplication which remains consistent based on a fixed dataset, dynamic deduplication can dedupe data differently depending on the search parameters or data subset chosen. This can be particularly useful for iterative or evolving reviews.
Adaptability to Data Changes:
- Since dynamic deduplication operates in real-time based on the selected scope, it can easily adapt to changes in the dataset, including new additions or alterations, without the need to rerun the entire deduplication process.
Granular Control:
- Dynamic deduplication can provide different results compared to static deduplication. This can be a benefit when users want more granular control over what gets deduplicated in specific searches, potentially capturing more unique items in the results.
Customizing Deduplication To Better Fit Current Scope:
- Unlike static deduplication, which might hide or flag duplicates even if they are outside of the current scope, dynamic deduplication operates only within the current scope.
Potential for More Comprehensive Results:
- Dynamic deduplication could yield more results than static deduplication, especially if not all duplicates were included in the current scope. Depending on your needs, dynamic deduplication might be more comprehensive in certain contexts, especially when combined with specific filters.

Understanding Static Deduplication

Venio Static deduplication is a process that identifies and hides duplicates from a fixed dataset at a particular point in time.

Static Deduplication factors in the system level primary document within the scope.

Once deduplication occurs, regardless of any new data additions or changes, the deduplicated state remains consistent unless the process is rerun.

Family Preservation:
- One of the essential features of static deduplication is that it doesn't break up document families. This means that when a document (like an email) with associated attachments is identified as a duplicate, the entire family (the email and its attachments) is treated as a unit. This ensures that related content isn't dispersed or separated during the deduplication process.
Scope:
- Static deduplication is typically run across the entire dataset, and when a user searches the entire project after deduplication, they should ideally see a consistent, deduplicated set of results.
Comparison with Dynamic Deduplication:
- When searching the entire project, results from static and dynamic deduplication may not always match due to the inherent differences in search behavior between the two methods. Additionally, the introduction of other filters, such as hascontrolnumber, can influence search results, potentially leading to discrepancies between static and dynamic deduplication outputs.
Consistency:
- Since static deduplication is performed on a fixed dataset, users can expect consistent results from search queries, provided no other intervening factors alter the dataset.
Efficiency:
- By removing duplicates, static deduplication can significantly reduce the volume of data that needs to be reviewed, leading to time and cost savings
Family Integrity:
- The fact that it maintains the integrity of document families ensures that related items remain together, facilitating context-aware reviews.

How Venio Calculates Deduplication Hash

Standard MD5 and SHA-1 hashing algorithms are employed. For non-email documents, the entire file undergoes hashing. For emails, selected fields like 'subject', 'to', and 'body' determine the hash.

MD5 (Message Digest Algorithm 5) and SHA-1 (Secure Hash Algorithm 1) are cryptographic hash functions. They play a crucial role in various aspects of computer security and data verification. For non-email data, every non-venio platform which calculates an MD5 or SHA-1 hash should always result in the exact same hash which venio calculates, provided the binary data of the file has not been modified in any way.

For email data, the venio hash may differ from other platforms, as each platform may utilize different methods of combining the selected email metadata to calculate the final hash value.

Venio Hashes for Non-Email Files:
- For non-email data, every non-venio platform which calculates an MD5 or SHA-1 hash should always result in the exact same hash which venio calculates, provided the binary data of the file has not been modified in any way.
- Let's say you have a photo, a document, or a song. When you put this file through Venio, it utilizes an industry standard MD5 or SHA-1 hash, you get a specific hash (the "digital barcode"). Every time you run that exact file through the machine, regardless of the file's metadata (name, date modified, etc...) if the contents of the file are unmodified, you'll always get the same hash. However, if you make even a tiny edit to the file – say, change one pixel in the photo, change one letter in the document, or edit a millisecond of sound in the song – and then run it through the machine, the resulting hash will look entirely different. This consistent, yet highly sensitive behavior makes these algorithms very useful for checking the integrity of files. If two files have the same hash, you can be highly confident they are identical in content.
Choosing a Hashing Algorithm
- - While setting up a project, you can opt for either MD5 or SHA-1. Additionally, a secondary hash algorithm can be calculated for flexibility.
  - In advanced settings, you can also specify which email fields to use for hash generation.
Fields for Generation of Email File Hashes
- Attachment Name
- BCC
- CC
- From
- Subject
- To
- Sent Date
- Attachment CRC Hash:
  - A cyclic redundancy check (CRC) is similar to MD5 in that it reveals alterations in digital data. When hashing email file metadata, attachments are not hashed with MD5 or SHA-1. Instead, a CRC32 algorithm is utilized to create a hash of the attachment. This CRC hash result of the attachment is mixed in with the other fields selected for hash generation of email files when calculating the final MD5 or SHA-1 hash of the email.
    - Please note: The CRC32 hash is only used in the specific case of generating an email hash. When an attachment is extracted and ingested in Venio, it is also still hashed with either MD5 or SHA-1 depending on the project settings.

Potential Drawbacks

Potential Drawbacks of Static Deduplication

1. Inflexibility: Once static deduplication is applied to a dataset, it remains consistent and doesn't easily adapt to changes or additions. If new data is added to the project that might contain duplicates not originally deduplicated, it won't automatically be deduplicated unless the process is rerun.

2. Rerun Requirement: If the dataset undergoes significant changes the entire deduplication process might need to be rerun, which can be time-consuming.

3. Potential Overlook of Relevant Data: Static deduplication could, in some cases, inadvertently hide or remove documents that could become relevant later, especially if they were deemed as duplicates during an initial assessment but gain importance in a different context.

4. Inconsistencies in New Searches: When new data is added to a dataset, and if a new deduplication process isn't performed, it can lead to inconsistencies in search results, especially if duplicates exist within the new data.

Potential Drawbacks of Dynamic Deduplication:

1. Lack of Consistency: Because dynamic deduplication operates based on the current scope of a search, results can vary depending on search parameters. This variability might confuse users or lead to perceived inconsistencies.

2. Complexity: Due to its nature, dynamic deduplication might require a deeper understanding of the dataset, the search scope, and search parameters.

3. Custodian Priority Issues: As mentioned in the article, dynamic deduplication doesn't always recognize custodian dedupe order (priority). This can be problematic in cases where custodian priority is critical to the eDiscovery process.

4. Potential for Misinterpretation: Given that dynamic deduplication can yield different results based on search parameters, there's a risk that users might misinterpret or misunderstand the results, especially if they're not familiar with the intricacies of dynamic deduplication.