Near Duplicates and Similar Documents – Venio Systems

Near Duplicates and Similar Documents are two different options available in Venio. The article below explains how they get computed and what are their differences.

Near Duplicates

Near Duplicates in Venio is launched once we run the 'Compute Near Duplicate' from quick links. A job to queue files for near duplicate documents is created once we click the 'Compute Near Duplicate' option.

Now in order to figure out which Documents are near duplicate, you have to look for columns 'Near Duplicate Group Internal Fileid', 'Near Duplicate Similarity Percentage', 'IsCentroid Ndd'.

Now let us see the screenshot below:

If we see the document which has the internal fileid is 2, the 'Near Duplicate Similarity Percentage' is 0, and Near Duplicate Group Internal file id is 2 which means it is the source document which is compared with other documents to compute near dedupe against.

Now if we observe the near duplicate similarity percentage for internal file id 3, we can see that the 'Near Duplicate Similarity Percentage' is 99 and the 'Near Duplicate Group Internal Fileid' is 2. This means that the file with internal file id 3, is the near duplicate of file with internal fileid 2. Fileid 3 is 99% similar to file with internal file id 2.

In addition to this , if we observe files with file ids 9, 10, 12 13 in the above screenshot the values are not populated at all for the three fields 'Near Duplicate Group Internal Fileid', 'Near Duplicate Similarity Percentage', 'IsCentroid Ndd'. This is because these documents do not meet the threshold to compute the near duplicates.

The table we look up to see for the near duplicates files is: tbl_jb_NDDJob

Similar Documents

Similar documents in Venio are computed using a third party tool called 'Lucene'. Unlike Near Duplicates we do not have to run anything for it to populate. After we ingest and index the files, we can see the similar documents in a project.

If we click the settings icon under similar, we can slide to view only the documents with minimum similarity score percentage either within all the documents in the project or within the selected scope.

While calculating the similar documents the Lucene engine, uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of Boolean logic in the Query specification.

Note: There can be documents which are similar but does not have near duplicates. This is because in order to compute Near Duplicate of a document, the document needs to meet a threshold to be queued up for Near Duplicate job. Screenshot attached for reference:

Near Duplicates

Similar Documents

Related articles