While the logical model of information in PathSys is graph based, there are in reality a variety of informa- tion sources that provide different components of the integrated graph. A source may contribute only node properties. For example, the Yeast GFP Fusion Localization Database (Huh et al. (2003), Ghaemmaghami et al. (2003)) provides, for each gene or ORF, a set of locations where the gene is expressed. The location is considered a multi-valued property of the gene. A second kind of source contributes graphs. All protein-protein interaction sources like Gavin et al. (2002), for example, may con- tribute an interaction graph from co-immunoprecipitation experiments using a matrix model. In this case, the type of the edge is physical interaction and the label on the edge includes the type of experiment (co-IP). Yet a third category of sources, may explicitly provide a set of attributes qualifying the nature of the relationship. The protein-DNA interaction data provided by Lee et al. (2002) records a probability value for every interaction between a protein and a DNA region, both for genes and for intergenic regions. Note that this graph is directed while the protein-protein interaction graphs are not. Given this heterogeneity in the type of information that a source may provide, PathSys uses a very generic internal model to accommodate different kinds of sources, such that a new source, providing a new set of nodes, edges, or node/edge properties can be dynamically incorporated into an existing integrated database.
Source Definition.
Currently, the external data source can be:
a relational database schema
a tree-structured XML document
an RDF-styled triplet that describes an edge set of a graph
a DAG structured OWL (http://www.w3.org/TR/owl-features) document.
Typically, a new ontology or a node/attribute type hierarchy, such as the phenotype classification tree from MIPS:
Schema Mapping.
Schema mapping specifies how an element of the imported source should be interpreted as an element of the internal schema of PathSys. We considering the tree-structured and relational data sources. In the first case, we would like the OWL schema to populate the node type hierarchy. The mapping declarations are:
IMPORT NODE TYPE FROM yeast phenotype ( Class as name,
) GRAPH phenotype tree
IMPORT RELATIONSHIP FROM yeast phenotype(
subClassOf as child of
) GRAPH phenotype tree
The second example of source integration imports a relational schema (a fragment of the MIPS database) into the graph elements of the internal model. Figure 2 shows the relational schema. In the MIPS complex relation, the attribute complex references the cid of the MIPS complex category relation. Notice that gene name is explicitly mapped to the pre-existing attribute name. The expression (orf1, gene1) AS SOURCE REFERENCE states that for the edge the source node uses the attribute pair {orf1, gene1} as a foreign key. Also notice how the protein complexes and their members are defined explicitly as hypernodes and hypernode members.
CREATE DATA_SOURCE mips
(
fullname ‘MIPS CYGD’,
reference ‘http://mips.gsf.de’,
description ‘The MIPS Comprehensive Yeast Genome Database aims to …’
);
CREATE GRAPH mips_graph
(
version STRING VALUE ‘mips-11-2004’,
… other meta-data …
) SOURCE mips;
IMPORT NODE FROM MIPS_genes
(
orf AS ATTRIBUTE,
gene_name AS ATTRIBUTE name,
coordinates AS ATTRIBUTE,
classification AS ATTRIBUTE,
description AS ATTRIBUTE
)
TYPE gene // node type
GRAPH mips_graph
PRIMARY IDENTIFIER (orf)
SECONDARY IDENTIFIER (gene_name);
IMPORT EDGE FROM MIPS_interaction
(
// skip column ‘id’
(orf1, gene1) AS SOURCE REFERENCE,
(orf2, gene2) AS TARGET REFERENCE,
relationship AS ATTRIBUTE,
description AS ATTRIBUTE,
reference AS ATTRIBUTE,
evidence AS ATTRIBUTE
)
DIRECTION undirected
GRAPH mips_graph;
IMPORT NODE ATTRIBUTE FROM MIPS_gene_alias
(
(orf, gene) AS NODE REFERENCE,
alias AS ATTRIBUTE synonym
) GRAPH mips_graph;
IMPORT HYPERNODE FROM MIPS_complex_category
(
cid AS ATTRIBUTE,
name AS ATTRIBUTE
) GRAPH mips_graph
PRIMARY IDENTIFIER (cid)
SECONDARY IDENTIFIER (name);
IMPORT HYPERNODE MEMBER FROM MIPS_complex
(
complex AS NODE REFERENCE,
entry AS MEMBER NODE REFERENCE,
ref AS MEMBER ATTRIBUTE,
evidence AS MEMBER ATTRIBUTE
) GRAPH mips_graph;
Sample Source Schema : Gene Ontology
// two kinds of primary nodes: term and gene_product
Term (
id INTEGER PRIMARY KEY,
name VARCHAR,
term_type VARCHAR,
definition VARCHAR
);
Gene_product (
id INTEGER PRIMARY KEY,
name VARCHAR,
species_id INTEGER REFERENCES SEPECIES(ID)
);
// species table is used to show many-to-one relationship.
Species (
id INTEGER PRIMARY KEY,
lineage VARCHAR,
genus VARCHAR,
species VARCHAR
);
// gene_product_properties is another kind of property table. The property is already
// stored in name-value pair.
Gene_product_properties (
gene_product_id INTEGER REFERENCES gene_product (id),
property_name VARCHAR,
property_value VARCHAR
);
// one-to-many relationship
Gene_product_synonyms (
gene_product_id INTEGER REFERENCES gene_product (id),
synonym VARCHAR
);
// analogous to complex data. A family has one or more members.
// column gene_product_id refers to member’s id.
Gene_product_family (
gene_product_family_name VARCHAR,
gene_product_id INTEGER REFERENCES gene_product (id),
putative BOOLEAN // it is a property on the member
);
// edge table
Term2term (
term1_id INTEGER REFERENCES term (id),
term2_id INTEGER REFERENCES term (id),
label VARCHAR
);
// another edge table
Term2GeneProduct (
term_id INTEGER REFERENCES term (id),
gene_product_id INTEGER REFERENCES gene_product (id)
);
Mapping directives for GO:
CREATE DATA_SOURCE go
(
fullname ‘gene ontology’,
reference ‘http://www.geneontology.org’,
description ‘gene ontology is a controlled vocabulary …’
);
CREATE GRAPH go_graph
(
version STRING VALUE ‘go-11-2004’,
… other meta-data …
) SOURCE go;
IMPORT NODE FROM term
(
id AS ATTRIBUTE,
name AS ATTRIBUTE,
term_type AS ATTRIBUTE,
definition AS ATTRIBUTE
) GRAPH go_graph // indirectly, it suggests that this term is from source GO.
PRIMARY IDENTIFIER (id, name);
// primary identifier is used to compare two nodes which may be from different sources.
// In this example, if two nodes have the same id and name, then merge them.
// store species as a separate table.
// skip field ‘lineage’. Rename ‘genus’ to ‘genus_name’.
IMPORT TABLE go_species FROM species (id, genus AS genus_name, species);
// if we want to import table ‘species’ as it is, just say:
// IMPORT TABLE FORM species;
IMPORT NODE FROM gene_product
(
id AS ATTRIBUTE,
name AS ATTRIBUTE entity_name,
// rename the property name. It is especially useful during node merge.
// for example, genes from MIPS have a property called ‘orf’, and genes from
// SGD has a property called ‘orfname’. Both ‘orf’ and ‘orfname’ refer to the open
// reading frame name of a yeast gene. By renaming ‘orfname’ to ‘orf’, it indicates
// that the mapping tool should merge these two fields.
species_id AS ATTRIBUTE REFERENCES go_species (id)
// this reference association will be stored in system catalog.
) GRAPH go_graph
PRIMARY IDENITIFER (name);
// The edge still means traditional edge. Internally it will be split and stored as
// connectorNode and two links.
IMPORT EDGE FROM term2term
(
term1_id AS SOURCE REFERENCE,
term2_id AS TARGET REFERENCE,
label AS ATTRIBUTE relationship // edge attribute
) DIRECTION directed
GRAPH go_graph;
// Here, REFERENCE suggests to follow the foreign key specified in the relational
// schema to identify the node in term table.
// edge table
IMPORT EDGE FROM term2GeneProduct
(
term1_id AS SOURCE REFERENCE,
gene_product_id AS SOURCE REFERENCE
) DIRECTION directed
GRAPH go_graph;
// simple one-to-many property.
IMPORT NODE ATTRIBUTE FROM gene_product_synonyms
(
gene_product_id AS NODE REFERENCE, // node identifier
synonym AS ATTRIBUTE
) GRAPH go_graph;
// properties are already in name-value pair.
IMPORT NODE ATTRIBUTE FROM gene_product_properties
(
gene_product_id AS NODE REFERENCE,
property_name AS ATTRIBUTE KEY,
property_value AS ATTRIBUTE VALUE
) GRAPH go_graph;
// hypernode for complex type of data.
IMPORT HYPERNODE FROM gene_product_family
(
gene_product_family_name AS ATTRIBUTE,
gene_product_id AS MEMBER NODE REFERENCE,
putative AS MEMBER ATTRIBUTE
// internally, ‘putative’ becomes a property of ‘member-of’ edge.
) GRAPH go_graph
PRIMARY IDENTIFIER (gene_product_family_name);