The BioVelo Query Language

Mario Latendresse

1. Introduction

The BioVelo language is designed to let the user write complex queries against Pathway/Genome Databases created using the Pathway Tools software.

By writing queries in BioVelo you can extract lists of objects (e.g., proteins) satisfying specific conditions.

The language is based on a computer science concept called list comprehension, which is popular in functional languages such as Haskell, Miranda, and Python. BioVelo is rich enough to let the user specify complex queries without relying on procedural side effects. This property eases writing queries by starting with simple ones and composing them to create more complex ones. Functional languages like BioVelo offer this advantage over procedural languages.

2. Database Schema

The databases queried by BioVelo contain objects belonging to various classes: metabolic pathways, reactions, proteins, genes, and so on. Each class has a set of attributes associated with it. For example, the class Proteins has attributes that include pI (its isoelectric point), and Gene (the gene encoding the protein). That means that each protein object (instance of the class) has the attribute Gene, although in some objects, the attribute may have no value.

The list of classes, their attributes, and the type of each attribute are part of what is called the schema (or ontology) of the databases. All PGDBs share one schema, although that schema changes with new versions of Pathway Tools. The schema is not part of BioVelo. The graphical user interface provided with BioVelo gives some details of the schema and the list of accessible databases -- this list will change with time.

The Pathway Tools schema is described in several documents. The most comprehensive is the Pathway Tools User's Guide, which is available as part of the Pathway Tools software download package. See also several publications listed on the BioCyc publications page. The better you know the Pathway Tools schema, the more adept you will be at writing BioVelo queries, because you need to know what classes to base your queries on, and which attributes to filter in your queries.

By applying relations (e.g., is equal to) to attributes you can constrain your search to specific attribute values. Several relations on multiple attributes can be used in the same query (or subquery).

Each attribute has a type. For example, a type can be a string, a number, a Boolean, an enumerated list of values (aka ``a controlled vocabulary"), or some specific class. For each type, a specific list of relational operators is allowed. For example, addition can be done only on numbers, not strings. Knowing these types is important to write valid queries.

3. Syntax of BioVelo

BioVelo is based on List Comprehension, a notation that was first created in mathematics as Set Comprehension. The syntax is similar, but BioVelo emphasizes lists over sets. Note: a list may contain duplicates of some elements, whereas a set does not contain duplicates. BioVelo can process sets, but using lists is usually more efficient.

Table 1 presents the formal syntax of a valid BioVelo expression -- that is the element called Expr can be provided as a query. The notation used to define this syntax is described in the caption of the table although informal details are presented in the next paragraphs.

Typically, a query starts and ends with brackets (that is [ ] or { }). From Table 1 a bracketed expression is either ListExpr or SetExpr. We introduce BioVelo with such queries.

A bracketed query has the form

[ e : q1,..., qn]

or

{ e : q1,..., qn}

where e -- called the head -- is an expression, and the qi are qualifiers that are either generators, binders, or filters. The result of a query is an ordered list (i.e., duplicates may exist) if square brackets [ ] are used; but it is a set (i.e., no duplicates exist) if curly braces { } are used. We will detail the syntax and semantics of such expressions in the following paragraphs.

The head expression e is either a single expression (a SingleExpr from Table 3) or a tuple of single expressions (e1,..., en). In the simplest and most common case, the e is a single identifier that is assigned (aka bound) by one of the generators or binders.

3.1 Generators to Iterate Through Lists of Objects

A generator has the form p<-X where p is a variable or a tuple of variables and X is an expression of type list. The scope of the variables in p extends to all qualifiers to the right of that generator -- that is all the variables in p can be referenced by qualifiers on the right of this generator. The variables are iteratively bound to every element of X -- for each iteration the right qualifiers are evaluated. The generators are similar to ``loops" in programming languages. If p is a tuple, the number of variables in it must be the same as the tuples that are in X.

Here is a simple example with a single variable:


[r : r <- ecoli^^reactions]

This gives the list of all reactions in the database E. coli.

3.2 Double caret ^^ to Extract All Objects of a Class

The double caret ^^ operator, applied to a database and a class name of this database, as in the previous example, specifies the list of all objects of that class for this database. The left operand of ^^ must be either the name of a database accessible from the server or the name of a bound variable to a database name accessible from the server. For example ecoli^^proteins gives the list of all protein objects of the database E. coli. The underlying schema decides the exact capitalization of the class names. Although, most of the time, a class name does not have to be exactly capitalized, that is, you can most of the time enter the class name all in lower case. Indeed, we provide the following heuristic to help the user avoid entering the exact capitalization: we first try to find the class name in the schema as you enter it; if this fails, we capitalize the class name provided by capitalizing all the words separated by a dash (e.g., Super-Pathways). For example, if you enter proteins, it is first tried as is in the schema; since it does not exist in the current schema, the capitalization Proteins is then tried, which is correct in the current schema. Similarly with all-genes, it is capitalized to All-Genes. On the other hand, the class DNA-Binding-Sites must be written exactly that way since our heuristic capitalization would give Dna-Binding-Sites -- which is incorrect in the current schema.

3.3 Single caret ^ to Reference the Value of an Attribute

The single caret ^ operator references the value of an attribute of an object of a class. The left operand must be an expression evaluating to an object. The right operand must be the name of a queryable attribute. The name of an attribute is not case sensitive. For example, assume that identifier x is bound to a protein object; then x^dna-footprint-size gives the value of the attribute dna-footprint-size of that protein. In this case, this is an integer value. An attribute of any type can be referenced using ^. If the attribute does not belong to the class of the object bound to the given variable, its value will be the empty list.

The ^? operator is essentially the same as ^ but it always generates a string that represent an HTML link to the object on its left operand. This is typically used in the head of the result to get clickable links in a browser. For example,

[r^?name : r <- ecoli^^reactions]
would give the list of reaction names with a clickable link to get more information for each reaction. Using the ^ operator would only give the name.

The special attribute FRAME-ID exists for all objects (it has the general type THING). Although it is not a real attribute of the schema, it is available to make it explicit that each object has a unique ID that can be referenced. If an expression evaluates to an object, it is actually the FRAME-ID that is used to represent it. For example, in r <- ecoli^^reactions the r value is actually the same as r^FRAME-ID. Therefore, the ^FRAME-ID is redundant in this case, although it may be specified to make a query clearer. (The FRAME-ID is used in the graphical web user interface to let the user select it when creating a query or specifying an output specification.)

3.4 Filters to Apply Conditions to Objects Extracted from Generators

A filter is a Boolean expression formed by connectives, bound variables, comparison operators, arithmetic operators, and/or class predicates. The connective for logical "and" is &, for "or" it is |. The comparison operators <, >, <=, >=, =, != pertain to numeric values. The priority (order of grouping) of these operators is given in Table 2.

For example, in the following we have a simple Boolean expression 3 > #r^left that is true if and only if the number of elements in the left attribute is smaller than 3:


[(r^?name, #r^left) : r <- ecoli^^reactions, 3 > #r^left ]

This expression "loops" through each reaction of E. coli, verifies for each one that the left attribute has less than 3 elements, and returns only these reactions along with their number of elements in left. Notice that in 3 > #r^left, the ^ has higher priority than #, and > has lower priority than #. So it is interpreted as: reference the attribute left from r, then take the length of left, then compare it with 3. The order is important, otherwise, in this case, these operations would not make sense because of invalid types (e.g., you cannot take the length of an object, but only of a list or set).

Several generators and filters can be specified as qualifiers in one bracketed query. The filters are simply evaluated from left to right as if a logical "and" operation was between them. Generators are also evaluated from left to right. They are nested "loops". For example, in


{r^?name : r <- ecoli^^reactions, product <- r^left, product isa proteins}

we have two generators. The first one loops through all the reactions of E. coli. For each reaction, the second generator loops through the products on the left attribute of this reaction. The filter verifies that product is a protein. Notice that we have used the curly braces so that the same reactions will not be returned twice. This will return the list of all reactions that have at least one protein on its left side. The similar query with square brackets:


[r^?name : r <- ecoli^^reactions, product <- r^left, product isa proteins]

would return a similar list of reactions but a reaction that has more than one protein in its left attribute would be repeated.

The following example contains three generators. This query generates a list of all pathways of E. coli with every possible pair of their reactions.


[(p^?name, r1^?name, r2^?name) : 
         p <- ecoli^^pathways, r1 <- p^reaction-list, 
         r2 <- p^reaction-list, r1 != r2]

The monadic operator # gives the number of elements of a list or set. For example, #ecoli^^proteins would give the number of protein objects in database ecoli. It can be applied to any expression that returns a list or set. For example the following expression gives the number of reactions that has only one object in its left attribute.


#[r : r <- ecoli^^reactions, 1 = #r^left ]

3.5 Binding a Variable to a Value of an Expression

It is possible to bind a variable to any expression value. For example, the qualifier e := x - y binds the variable e to the value of the expression x - y. In general, a binder has the form p:=e where p is a pattern and e is an expression. The symbol := is used to bind one or several identifiers to the value of the expression e. If the pattern p is a tuple with n variables, then the expression e must generate a tuple of n values. If the pattern p is a single variable, the expression e can be of any type. This is useful to avoid recalculating a value (e.g., computation time is saved) or simply shortening a query. For example, in


[r^?name : r <- ecoli^^reactions, p := #r^left, 2 < p, p < 5]

we avoid doing a reference to #r^left twice by using the identifier p.

A tuple of variables can be used in some cases to bind multiple variables with multiple values. For example, in


[ (p^?name, e^?name) : p <- ecoli^^proteins, 
  e := [(c1, c2) : (c1, c2) <- protein-to-components p, c2 > 1],
  #e > 4]

there is a tuple of variable (c1, c2) bound for each element returned by protein-to-components -- this function returns a tuple of two values for components of the protein p, the component c1 and the number of times c2 it occurs in p. This query would return all the proteins of E. coli that has more than four components that repeat at least once.

3.6 The List of Available Databases

The list of possible databases is given by the special identifier dbs. For example, the generator db <- dbs would bind the variable db to each available database. A specific database is specified by using its name - for example, ecoli identifies the E. coli K-12 database, EcoCyc. The complete list of available databases is provided by the graphical user interface to BioVelo. It is also possible to have this list by entering the query dbs -- it will return the list of databases currently accessible from the server.

3.7 Arithmetic Expressions

Arithmetic expressions can be formed by using arithmetic operators +, -, /, *, quotient (integer division), remainder, and abs. The subtraction (i.e., -) operation is dyadic as well as monadic (i.e., a minus sign). Moreover, lexically, it is necessary to have a delimiter between a - and an adjacent operator or identifier. For example, writing a-b does not mean a subtraction between identifier a and b, but rather the identifier a-b, since the dash can be used to create identifiers.

In the following query, a subtraction is done between the end and start positions of every gene of E. coli associated with an RNA. This query find all RNAs in E. coli that have less than 100 nucleotides.


[(r^?name, l^?name, r^?gene) : r <- ecoli^^RNA, g <- r^gene, 
  l := abs(g^left-end-position - g^right-end-position), 
  l < 100 ]

3.8 Constraining the Extracted Objects to a Class

The form x isa c -- where x is an expression returning an object, and c a class name -- is a Boolean expression, or filter, that tests the membership of the object bound to x in class c. The expression is true if and only if x is an object of class c. Notice that this includes the cases where the class of the object is a subclass of class c. For example, r isa binding-reactions is true if variable r is bound to an object of class binding-reactions or to one of its subclasses.

The previous example could have been written in the following way using isa:


[(r^?name, l^?name, g^?name) : g <- ecoli^^genes, 
  #(g^product) = 1,
  r <- g^product, r isa RNA,
  l := abs(g^left-end-position - g^right-end-position), 
  l < 100 ]

The following query returns all the proteins of E. coli, with their DNA binding sites, if any; one binding site per line when displayed as a table.


[(p^?name, e^?name) : p <-ecoli^^proteins, c <- p^component-of, 
  c isa Protein-DNA-Complexes,
  e := [c2 : c2 <- c^components, c2 isa DNA-Binding-Sites], 
  #e > 0]

This last expression will actually repeat the same protein for each of its binding sites. To have one protein with all its binding sites "next to it" when displayed in a table, one could write


[(p^?name, e^?name) : p <-ecoli^^proteins, 
 e := [c2 : c <- p^component-of,  c isa Protein-DNA-Complexes,
            c2 <- c^components, c2 isa DNA-Binding-Sites], 
 #e > 0]

3.9 String, Set, and List Operations

The operator in, such as x in X, returns true if and only if element x is in list X. The type of x can be a number, string, or database object.

The Boolean expression s instring S is true if and only if the string s is a substring of string S; or if S is a list of strings, that s is a substring of these concatenated strings; or if S is a list of lists, each list is converted into a string, removing all spaces and parentheses, and s is a substring of these concatenated strings. This expression returns false otherwise. There is also a case insensitive version called instringci.

The monadic operator set, applied to a list, returns the same list but without duplicated elements. Curly braces can be used instead of the square brackets -- for the whole query -- to get a set instead of a list.

The dyadic operator ++ concatenates two lists or sets (it does not remove duplicates). (The union of two sets can be obtained with ++ and set as in (set A ++ B)) As usual, it is used as an infix operator, as in A ++ B. The operator ** does an intersection between two lists or two sets; it does not remove duplicates.

3.10 Order of Operations (Priority and Associativity of Operations)

When forming complex expressions, it is important to consider the priority of the operators used. For example, in the expression y+3*x > 1, the multiplication of 3 to x is done before the addition to y. Likewise, the addition is done before the comparison operator > is applied. That is, the priority of operator * is higher than + which is higher than >. Table 2 gives the relative priority of all operators: an operator higher up in the table has higher priority. As usual, parentheses can be used to apply the operators in a different order; for example in (y+3)*x > 1, the addition is done before the multiplication to x. Adding extra or redundant parentheses is not an error -- in many cases it is better to over-specify the order of evaluation so that the expression is clearer.

The associativity of an operator, or group of operators with the same priority, is either right or left. The operators * and / have the same priority and are left-associative; for example the expression x*y/z*w is interpreted as ((x*y)/z)*w. Note that an expression such as x/y/z is interpreted as (x/y)/z. Table 2 gives the associativity of all operators. In some cases, the associativity does not apply as the operator cannot be composed with itself. For example, an expression such as db^^c^^d is invalid since the second ^^ cannot be applied to a class -- that is, neither db^^(c^^d) nor (db^^c)^^d has a valid interpretation.

The expression [exp1 @ exp2] represents a list of numbers from exp1 up to at most exp2. For example, [1 @ 10] represents the list of integers from 1 to 10. The starting expression exp1 and ending expression exp2 must be of type numbers; they can be integer or non-integer expressions. The list of numbers generated starts at the value of exp1 and, by increasing by 1, goes to a maximum of exp2. If the ending value is smaller than the starting value, the generated list is empty.

Similarly, the expression [exp1 @ exp2, exp3] represents a list of numbers from exp1 up to at most exp2 in steps specified by exp3. For example, [1 @ 10, 2] represents the list of odd integers from 1 to 9. The starting expression exp1 and ending expression exp2 must be of type numbers; they can be integer or non-integer expressions. The list of numbers generated starts at the value of exp1 and, by increasing by exp3, goes to a maximum of exp2. If exp3 is zero, the list has only one element, the value of exp1. In the case that exp3 is negative, if the starting value exp1 is greater than the ending value exp2 than the list of values will be decreasing; otherwise the empty list will be generated.

List of numbers can also be specified by simply using a comma separated list of expressions between square brackets. For example, [1, -1, 100, 9] is a list of four numbers. General numerical expressions can be used as in [n, 2*n, m-1], assuming that the variables n and m are bound to numerical values.

Indexing can be done on list of any type by using the list syntax [...]. For example, [r : r<-ecoli^^reactions][1 @ 5] selects the first five reactions returned by the query on E. coli reactions. The indices do not have to be in numerical sequence, e.g. [3, 5, 7] is also valid. If the list of indices is only one element, the result will be the type of the extracted element, otherwise it is a list of the elements. Applying an indexing operation to a non-list generates an empty list.

3.11 Queries Generating Numbers and Multiple Tables

A BioVelo query can also be an expression that does not start with a bracket -- a query is in general an Expr according to the formal syntax of Table 1. For example, a tuple (#dbs, [c^?name:c<-ecoli^^genes]) is a legal query. This gives the number of accessible databases and the list of genes of E. coli. The expression #[p:p<-ecoli^^pathways] is also valid as a query. It gives the number of pathways in E. coli. The following would return two lists, each one interpreted as a table when displayed. The first table contains all E. coli reactions that have two objects on the right attribute; and the second table contains all E. coli reactions that have three.


([r?^name : r <- ecoli^^reactions, 2 = #r^right ],
 [r?^name : r <- ecoli^^reactions, 3 = #r^right ]
)

So, in general, arithmetic expressions and tuples are also valid queries -- in the former case they return single values; in the latter case they return multiple values of possibly complex values (e.g., lists).

3.12 Library of Special Functions

Table 3 presents a list of functions that can be applied to various objects. In general, these functions return a list of objects of a specific type. For example, the function enzyme-to-genes takes an enzyme as input and returns the list of genes that produce this given enzyme. You can use these functions anywhere an expression is allowed. For example, [(x, reaction-to-genes x): x <- ecoli^^reactions] gives a list of tuples where the first element is a reaction from E. coli and the second element is a list of genes involved in this reaction.

3.13 Sorting the Result of a Query

The result of a query can be sorted using the operator sort. For exampe, sort [x^?name: x<-ecoli^^proteins] would return the sorted list of proteins based on their name. If the result is a list of tuples (i.e., several columns), by default the sorting is done on the first element of the tuple (i.e., the first column). This can be changed by specifying two arguments to the function sort. The first argument is the query, the second is an integer specifying the index of the element to sort on. For example, sort ([(x^name,x^frame-id): x<-ecoli^^proteins],2) sorts on the second element, the frame-id.

3.14 How To Submit BioVelo Queries

There are two ways you can submit BioVelo queries, either via the Web using a browser (e.g., Internet Explorer) or from a desktop version of Pathway Tools. The latter can be done from a Lisp prompt or the Pathway Tools prompt by calling the Lisp function biovelo. For example you can enter (biovelo "dbs") which would return the list of accessible databases. In general, the query is provided as a string to the biovelo function. The Web version can be accessed at Advanced Query Page. The Web version has a GUI interface (actually two GUI interfaces). It also include its own documentation accessible via the previous Web link.

3.15 Syntax Tables


Table 1: Syntax of expressions in BioVelo. How to read this notation: each element on the left is defined by what is presented on the right of ::=; a vertical bar separates various possibilities (i.e., it can be read as the connective or); a character in bold represents itself; a word in italic is defined in this table or the other two tables. For example, an Expr is defined as either a SingleExpr or a Tuple; a ListExpr starts with a square bracket typed as [, then an Expr, as defined in this table, then a colon, and so on. An identifier is a sequence of letters, digits, -, _ or ? that starts with a letter or an underscore (i.e., _). A string constant may contain a double quote or a backslash by escaping them with a backslash.
Expr ::= Tuple | SingleExpr
Tuple ::= (Exprs)
Exprs ::= Expr , Exprs | Expr , Expr
SingleExpr ::= ConstNumber | ConstString | ListExpr | SetExpr
    | Var | (SingleExpr) | SingleExpr Dop SingleExpr
    | Mop SingleExpr
ListExpr ::= [ Expr : Qualifiers] |[Expr:]
SetExpr ::= {Expr:Qualifiers} |{Expr:}
Qualifiers ::= Qualifier,Qualifiers |Qualifier
Qualifier ::= Generator |Filter | Binder
Generator ::= Pattern <- Expr
Filter ::= Expr
Binder ::= Pattern := Expr
Pattern ::= Var | (Vars)
Vars ::= Var,Vars | Var
Var ::= an identifier
ConstNumber ::= integer or real numbers
ConstString ::= ¨any printable character¨
Mop ::= Any monadic operator of Table 2
Dop ::= Any dyadic operator of Table 2



Table 2: Monadic and dyadic operators with their associativity. Operators higher in the table have higher priority. The instringci operator is the case insensitive version of instring. The =ci operator is the case-insensitive version of =.
Operators Meaning Associativity
^^ Reference all objects of a class n/a
^ Reference the value of an attribute of an object left
^? Generates a URL string of the value of an attribute of an object. left
sort Sort a query result right
set Convert a list into a set n/a
# length of a list, or cardinality of a set n/a
++ concatenate two lists left
** intersection of two lists or two sets left
- monadic minus n/a
/,*, remainder, quotient division, multiplication, remainder (integer) and quotient (integer) left
+,- addition, subtraction left
abs absolute value n/a
in, instring instringci element in list/set?, substring in string or list/set of strings? n/a
=, =ci, !=, <,>,<=, >= equal, not equal, smaller-than, greater-than, smaller-or-equal-than, greater-or-equal-than n/a
! logical not n/a
& logical and left
| logical or left
evenp, oddp Is it an even number? Is it an odd number? n/a
special functions see Table 3 n/a



Table 3: Special functions provided in BioVelo.
Function name Value(s) returned
   
reaction-to-genes Given a reaction, returns the list of genes involved in this reaction.
enzyme-to-genes Given an enzyme, returns the list of genes that produce this enzyme. The enzyme can be a complex of proteins.
pathway-to-genes Given a pathway, returns the list of genes involved in this pathway.
pathway-to-reactions Given a pathway, returns the list of reactions in this pathway.
protein-to-components Given a protein, returns a list of pairs (component coefficient) where coefficient is an integer representing the number of occurrences of this component in the protein.
maximum Given a list of numbers, return the largest one.
minimum Given a list of numbers, return the smallest one.
mean Given a list of numbers, return their mean.
sum Given a list of numbers, return their sum.
prod Given a list of numbers, return their product.




Mario Latendresse 2006-11-2