Search
in Results
There are three types of queries you can use to search in your
results—that is in and across the articles or documents that have been
retrieved by QUOSA. The three types are Boolean,
Regular Expression, and Left Truncation—detailed syntax for each type of
query is described lower down in the document.
Which query type is best for my task?
Boolean: use it
for basic, and advanced, keyword and phrase searching. It supports use of logic operators in the
query—AND, OR, NOT plus wildcards, proximity limits, brackets to group query
words, sub-scripts and super-scripts, plus others. For multiple word queries, Boolean assumes an OR between terms
unless you specify otherwise. Boolean
search is the fastest way to search through multiple documents.
For example, “statin
extracted”~5 will find all the documents with
words “statin” and “extracted” in a distance of 5 words or less from each
other—in either direction.
Regular Expression: unlike
Boolean, which searches on whole words only, Regular Expression searches
through the text character by character. It can be very powerful, but is may be the least familiar to
you.
Use it to search for a specific
character string—say, an amino acid string, that may be found in an article as
part of a longer strand—which therefore would be missed by Boolean type query.
Or, you can search on symbols—for example, FcRN-/- can be found by Regular
Expression, but not by Boolean. You can
also use it to find all articles where more than, say, 500 people are enrolled
in a clinical trial. Further, a
regular expression, often called a pattern, is an expression that
describes a set of strings. They are usually used to give a concise description
of a set, without having to list all elements. For example, the set containing the
three strings Handel, Händel, and Haendel can be described
by the pattern "H(ä|ae?)ndel" (or alternatively, it is said that the
pattern matches each of the three strings).
Left Truncation: use it
when you know the end
of the word that you seek, but the beginning of the word is unknown, ambiguous
or simply can be varied.
For example, to find beta
blockers ”praprandolol” “atenirolol” etc. the following query can
be used: *olol
How do
I construct my query?
A Boolean query is made up of
terms and operators, and can be made up of single terms and/or phrases.
A single
term is a single word, such as "protein" or "acid"
A phrase
is a group of words surrounded by double quotes, such as "heat shock"
Multiple terms can be combined
with Boolean operators to form complex queries. The OR operator is the default
conjunction operator. This means that if there is no Boolean operator between
two terms, the OR operator is used. The OR operator links two terms and finds a
document if either of the terms exists in it.
The terms in a query are NOT
case-sensitive.
1.1.
Operators
Terms in Boolean queries can be
combined using logic operators (OR, AND, "+", NOT, and "-").
Note: The
OR, AND, and NOT operators must be entered in CAPS.
OR (or the ||
symbol)
The OR operator is the default
conjunction operator. This means that if there is no Boolean operator between
two terms, the OR operator is used. The OR operator links two terms and finds a
document if either of the terms exists in it. The symbol || (two bars) can be used in place of the word OR.
For example, to search for
documents that contain either "heat shock" or just "heat,"
specify the search expression as follows:
“heat shock” heat
or
“heat shock” OR heat
AND (or the &&
symbol)
The AND operator finds
documents where both terms exist anywhere in the text. The symbol && (two ampersands) can be used
in place of the word AND.
For example, to search for documents that contain
"heat shock" and "heat protein," specify the search
expression as follows:
"heat shock" AND "heat protein"
+ (plus sign)
The "+" operator (known as the required operator) requires that the
term after the plus sign exist somewhere in a document.
For example, to search for
documents that must contain
"heat" and may contain "shock," specify the search expression
as follows:
+heat shock
NOT (or the !
symbol)
The NOT operator excludes
documents that contain the term after NOT. The symbol “!” (exclamation point) can be used in place of the word NOT.
To search for documents that contain "heat
shock" but not "heat
protein," specify the search expression as follows:
“heat shock” NOT “heat protein”
Note: The NOT operator cannot be used with just one term. For
example, the following search will return no results:
NOT “heat shock”
As a workaround QUOSA adds one
word exclusionpattern to each and
every document it indexes. As a result to find all documents that do not
contain phrase “heat shock” you can use this query:
exclusionpattern NOT “heat shock”
- (minus sign)
The "-" (minus sign) or prohibit operator excludes documents that
contain the term after the minus symbol.
For example, to search for
documents that contain "heat shock" but not "heat protein," specify the search expression as
follows:
“heat shock” - “heat protein”
Parentheses can be used to
group terms to form sub-queries, which can be very useful to control Boolean
logic in a query.
For example, to search for
either "heat" or "shock" and "protein," specify
the search expression as follows:
(heat OR
shock) AND protein
There are two wildcard
characters that can be used in Boolean queries. They are as follows:
* (asterisk symbol)
An asterisk (*) may be used to specify zero or more alphanumeric characters. For example, searching for the term h*s would find results that contain words such as “his,” “homes,” and “herbaceous.”
? (question mark symbol)
The question mark (?) may be used to represent a single alphanumeric character in a search expression. For example, searching for the term “ho?se” would find results that contain words such as “house” and “horse.”
Note:
You cannot use * or ? as the first
character in any term in a search expression. Following two examples are
showing queries that can not be used
in Boolean Search:
*ice
capsule AND *activity
Please see ‘Left Truncation’ query type below if you want to use this approach.
1.4.
Fuzzy
Searches
A fuzzy search can be used to
find words similar in spelling. To create a fuzzy search, add the "~" (tilde) symbol at the end of a single-word term. For example, to search for a term similar in
spelling to "roam," specify the fuzzy search as follows:
roam~
This search will find terms
such as “foam” and “roams.”
1.5.
Proximity
Searches
A proximity search can be used
to find words that are within a specific distance to other words. To create a
proximity search, add the "~"
(tilde) symbol at the end of the words. For example, to search for the words
"heat" and "shock" within 10 words of each other in a
document, specify the search as follows:
“heat shock”~10
1.6.
Searching
for an expression with sub or super script
Documents containing
expressions with sub and super script can be found the following way:
If keyword of interest has a
subscript it has to be surrounded by [SB]xxx[SB] tags in the search query.
If keyword of interest has a
superscript it has to be surrounded by [SP]xxx[SP] tags in the search query.
For example, you need to enter
[SP]14[SP]C to search for 14C
10[SP]4[SP] to search for 104
CO[SB]2[SB] to search for CO2
p27[SP]kip1[SP] to search for p27kip1
Sub and super scripts are most
reliably searchable in html versions of full-articles, rather than PDFs.
1.7.
Escaping
Special Characters
Boolean search supports
escaping special characters that are part of the search syntax. The current
list of special characters includes the following:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
To escape these characters, use
the “\” (backslash symbol) before
the character. For example, to search for (1+1):2, specify the search
expression as follows:
\(1\+1\)\:2
Regular expressions are made up
of normal characters and metacharacters.
Normal characters include
upper- and lowercase letters and digits. In QUOSA, regular expressions are
case-insensitive.
Metacharacters are symbols
(such as the dollar sign) that have special meanings (described below).
In the simplest case, a regular
expression looks like a standard search string. For example, the regular
expression “testing”
contains no metacharacters. It will match “testing,” “123testing,” and “Testing,” but it will not match “sting.”
The following metacharacters
can be used with regular expressions:
|
. |
|
Matches any single character. For example, the
regular expression r.t would
match the strings rat, rut, rot, but not root or r
t. |
|
^ |
|
Matches the beginning of a word. For example, the
regular expression ^the would
match the word therefore or on "the" in the string
"in the event" but would not match "otherwise." |
|
$ |
|
Matches the end of a word. For example, the
regular expression weasel$ would
match the word weasel but not the word weasels. |
|
* |
|
Matches zero or more occurrences of the character
immediately preceding. For example, the regular expression .* means
match any number of any characters. |
|
+ |
|
Matches one or more occurrences of the character
or regular expression immediately preceding. For example, the regular
expression 9+ matches 9, 99, 999. |
|
? |
|
Matches 0 or one occurrence of the character or
regular expression immediately preceding. |
|
\ |
|
This is the quoting character that is used to
treat the character that follows as an ordinary character. For example, \$ is
used to match the dollar-sign character ($) rather than the end of a word.
Similarly, the expression \. is
used to match the period character rather than any single character. |
|
[
] |
|
Matches any one of the characters between the
brackets. For example, the regular expression r[aou]t
matches rat, rot, and rut, but not ret. Ranges of
characters can be specified by using a hyphen. For example, the regular expression
[0-9] means match any digit. Multiple ranges can be
specified as well. The regular expression [A-Za-z] means
match any upper- or lowercase letter. To match any character except
those in the range, the complement range, use the caret as the first character
after the opening bracket. For example, the expression [^269A-Z] will
match any characters except 2, 6, 9, and uppercase letters. |
|
( ) |
|
Treats the expression between the left and right parentheses as a
group. Use with the quantity modifiers (*, +, ?, {}) and
with |. |
|
| |
|
“Or” two conditions together. For example, t(ry|op)
matches try and top but not toy. |
|
{i} |
|
Matches a specific number of instances or
instances within a range of the preceding character. For example, the
expression A[0-9]{3} will match
"A" followed by exactly three digits (that is, it will match A123
but not A1234). The expression [0-9]{4,6}
matches any sequence of 4, 5, or 6 digits. |
To match multiple-word phrases,
separate each word with a single space. For example, the regular expression th.*
.*s f.n.? will match “this is fine” and “that was fun,” but
not “the cat was found.”
Examples:
The simplest metacharacter is
the dot. It matches any one character (excluding the new-line character).
Consider a file named test.txt
consisting of the following lines:
he is a rat
he is in a rut
the food is Rotten
I like root beer
The regular expression r.t matches
an r followed by any character followed by a t. It
will match rat and rut. It
will also match the Rot in Rotten because
regular expressions in QUOSA are case-insensitive.
To match characters at the beginning of a word, use
the circumflex character “Ù” (sometimes called a caret). For example, to find
the words containing the string "he" at the beginning of each word in
test.txt, you might first think of
using the simple expression he.
However, this would match the in the
third line. The regular expression ^he, however, would only match the h at the
beginning of a word.
Sometimes it’s easier to indicate something that
should not be matched rather than all the cases that should be matched. When
the circumflex is the first character between square brackets, it means to
match any character that is not in the range. For example, to match he when it
is not preceded by t or s, the
following regular expression can be used:
[^st]he.
Character ranges can be specified between the
square brackets. For example, the regular expression [A-Z] matches
any letter in the alphabet, upper- or lowercase. The regular expression [a-z] is
equivalent. The regular expression [A-Z][A-Z]* matches
a letter followed by zero or more letters. You can use the +
metacharacter to do the same thing, that is, the regular expression [A-Z]+ means
the same thing as [A-Z][A-Z]*.
To specify the number of
occurrences matched, use braces. For example, to match all instances of 100 and 1000 but not
10 or 10000, use the following:
10{2,3}.
This regular expression matches
the digit 1 followed by either two or
three 0's. A useful variation is to
omit the second number. For example, the regular expression 0{3,} will
match three or more successive 0's.
You can
perform searches to find words that end in certain letters. This type of search
is called left truncation, meaning part of the word to the left is ignored when
searching for words with a common ending. An asterisk is used as part of this
search.
For
example, to find all words ending in "olol"
in a set of articles, enter
*olol
as the
left truncation search term. QUOSA will find words such as ”praprandolol” “atenirolol” and so on.