In VenioFPR, the process of searching involves looking for a specific term or phrase within the designated media scope. Depending on the query, VenioFPR may execute the search within the Lucene index or in the SQL project database. VenioFPR employs Lucene version 2.1.0.3 for indexing both document full text and metadata.
A query is deconstructed into terms and operators. There are two categories of terms:
Single Terms: A single term constitutes a solitary word, such as "test" or "hello."
Phrases: A phrase comprises a group of words enclosed in double quotes, like "hello dolly." Multiple terms or phrases can be combined using Boolean operators to form more intricate queries.
VenioFPR employs a standard analyzer for indexing. Common words like "a," "the," "an," and others, which are typically regarded as noise words by the standard analyzer, are not omitted during indexing and thus remain searchable. Nevertheless, certain symbols or characters are omitted during indexing and, as a result, are not searchable.
Characters Excluded During Indexing
Venio FPR™ treats the following characters as stop words and excludes them during indexing, rendering them unsearchable. Below is the current list of special characters:
Notes Regarding Special Characters
1. Acronyms
In the case of “A.B.C”, periods are not skipped and indexed as “A.B.C”. However, “A.B.C.” is not treated as an acronym and therefore indexed as “abc”. The difference in how the indexer interprets these examples is determined by the period at the end.
2. Web Addresses
Web addresses such as www.google.com, are indexed as www.google.com
3. Email Addresses
Venio FPR™ will not skip the underscore (_), hyphen (-) or period (.) in an email address. An email address such as name_surname@veniosystems.com will be indexed as is.
However, if the email addresses contain characters other than “_”, “-” OR “.” in the local part of the email address and characters other than “.” and “-” in the domain name, those addresses are tokenized separately.
For example:
“name/surname@hotmail.com” will be indexed as “name” and “surname@hotmail.com” separately.
4. Punctuation Ending Sentences
Punctuation ending a sentence will not be included in the search results. For example, if a document contains the sentence “And?”, only “And” will be indexed and the question mark (?) will be skipped.
5. Numbers & Punctuation
A. Decimal Points: Numbers containing decimal points, such as in “1.2”, are NOT treated as periods and such numbers are indexed as “1.2”
B. Other Numerical Related Characters: Except decimal points, all other numerical related characters will be skipped during indexing. For example, the character “=” is not indexed while searching the query contents (“1=2”). Therefore, the contents of that search will be reduced to (1 2). Please also note that searches containing other characters with numbers, such as “1&2”, will also be skipped, and therefore its search content will also be (1 2).
6. Company Name
Words in the format below are treated as company name and indexed as is.
<Alphabet> @ <alphabet> and <Alphabet> & <alphabet>
Example 1:
AT&T or C@t are indexed as it is.
Numbers or a mix of alpha-numeric characters are not treated the same way and are split by the indexer.
Example 2:
a@2 is indexed as 2 and a (as two separate tokens). 1@2 is indexed as 1 and 2 separately. A3@a is indexed as a3 and a separately.
However, this behavior does not apply to terms in the format <alphabet>@ or <alphabet>&
Example 3:
abc@ is indexed as it is At& is treated as it is
If there is a space between the characters, the analyzer will split the terms in accordance to the white space.
Example 4:
“at & t” is indexed as “at” and “t” (two separate tokens).
7. Letters and numbers
Letters and numbers separated by characters “_" or "-" or "/" or "." or ",” are indexed as is.
For example:
A-2 is indexed as it is 2-A is indexed as it is A/2 or a-2 are indexed as it is
BUT
a/b is indexed with “a” and “b” as separate tokens ab/ is indexed as “ab” 2-4 is indexed as it is 2*4 is indexed as “2” and “4” (separate tokens)
8. Internal apostrophes
Internal apostrophes are not skipped except for <apostrophe>s.
For Example:
O’reilly is indexed as o’reilly
BUT
O’reilly’ is indexed as o’reilly
oreilly’s is indexed as “oreilly” only (‘s is skipped entirely)
You’re is indexed as it is.
Summary of Special Character Handling
| Term | Indexed As | Comments | |
| Acronyms | A.B.C | A.B.C | |
|
A.B.C. |
ABC | A.B.C. is not treated as an acronym | |
|
Web Addresses |
www.google.com | www.google.com |
Web addresses are indexed as they are (with periods not skipped). |
| www.google.com. | wwwgooglecom |
Periods are skipped due to trailing period. |
|
| http://www.google. com |
http and www.google.com are indexed separately |
||
| Email Addresses | name_surname@ve niosystems.com | name_surname@venios ystems.com |
The underscore (_), hyphen (-) or period (.) in an email address is not skipped so the email addresses are indexed as they are. |
| Punctuations | text! | text |
Punctuation ending a sentence will not be included in the index |
| Decimal Points | 1.2 | 1.2 |
Numbers containing decimal points are NOT treated as periods and are not skipped |
|
Other Numerical Related Characters |
1=2 1+2 1&2 |
1 and 2 are indexed separately in all cases | (=), (+ )and (&) are skipped |
| 1/2 | 1/2 |
Considered as fraction so indexed as it is |
|
|
1. 1-2 2. A-B |
1. 1-2 2. A and B are indexed separately |
|
|
| Numbers and letters indexing, special cases |
|
|
1.Indexed as it is since this is considered as a single word. 2. Numbers are split in this case. 3. Considered as fraction so indexed as is 4. Slash splits the characters in case of letters 5. Numbers with “\” are not fraction and so “\” being escape character it is skipped. |
Summary of Search Examples Containing Special Characters:
| Search Term | Result | Comments |
| “abc” |
|
|
| “a b c” |
|
|
| “a?b?c” |
|
|
| “a.b.c.” |
|
|
| “a.b.c” |
|
|
| http://www.google.com |
|
|
| “ab&” |
|
|
Comments
0 comments
Please sign in to leave a comment.