PathSys

Data Integration

While the logical model of information in PathSys is graph based, there are in reality a variety of informa- tion sources that provide different components of the integrated graph. A source may contribute only node properties. For example, the Yeast GFP Fusion Localization Database (Huh et al. (2003), Ghaemmaghami et al. (2003)) provides, for each gene or ORF, a set of locations where the gene is expressed. The location is considered a multi-valued property of the gene. A second kind of source contributes graphs. All protein-protein interaction sources like Gavin et al. (2002), for example, may con- tribute an interaction graph from co-immunoprecipitation experiments using a matrix model. In this case, the type of the edge is physical interaction and the label on the edge includes the type of experiment (co-IP). Yet a third category of sources, may explicitly provide a set of attributes qualifying the nature of the relationship. The protein-DNA interaction data provided by Lee et al. (2002) records a probability value for every interaction between a protein and a DNA region, both for genes and for intergenic regions. Note that this graph is directed while the protein-protein interaction graphs are not. Given this heterogeneity in the type of information that a source may provide, PathSys uses a very generic internal model to accommodate different kinds of sources, such that a new source, providing a new set of nodes, edges, or node/edge properties can be dynamically incorporated into an existing integrated database.

Source Definition.

Currently, the external data source can be:

a relational database schema
a tree-structured XML document
an RDF-styled triplet that describes an edge set of a graph
a DAG structured OWL (http://www.w3.org/TR/owl-features) document.

Typically, a new ontology or a node/attribute type hierarchy, such as the phenotype classification tree from MIPS:

Schema Mapping.

Schema mapping specifies how an element of the imported source should be interpreted as an element of the internal schema of PathSys. We considering the tree-structured and relational data sources. In the first case, we would like the OWL schema to populate the node type hierarchy. The mapping declarations are:

IMPORT  NODE  TYPE  FROM yeast phenotype ( Class as name,
) GRAPH phenotype tree
IMPORT  RELATIONSHIP  FROM yeast phenotype(
subClassOf as child of
) GRAPH phenotype tree

The second example of source integration imports a relational schema (a fragment of the MIPS database) into the graph elements of the internal model. Figure 2 shows the relational schema. In the MIPS complex relation, the attribute complex references the cid of the MIPS complex category relation. Notice that gene name is explicitly mapped to the pre-existing attribute name. The expression (orf1, gene1) AS SOURCE REFERENCE states that for the edge the source node uses the attribute pair {orf1, gene1} as a foreign key. Also notice how the protein complexes and their members are defined explicitly as hypernodes and hypernode members.

Sample Source Schema. MIPS

MIPS_genes (
        orf                     VARCHAR PRIMARY KEY,
        gene_name               VARCHAR,
        coordinates             VARCHAR,
        classification          INTEGER,
        description             VARCHAR
);

MIPS_interaction (
        id                      INTEGER,
        orf1                    VARCHAR,
        gene1                   VARCHAR,
        relationship            VARCHAR,
        orf2                    VARCHAR,
        gene2                   VARCHAR,
        description             VARCHAR,
        reference               VARCHAR,
        evidence                VARCHAR,
        FOREIGN KEY (orf1, gene1) REFERENCES MIPS_genes (orf, gene_name),
        FOREIGN KEY (orf2, gene2) REFERENCES MIPS_genes (orf, gene_name)
);

MIPS_gene_alias (
        orf                     VARCHAR,
        gene                    VARCHAR,
        alias                   VARCHAR
);

MIPS_complex_category (
        cid                     VARCHAR,
        name                    VARCHAR
);      

MIPS_complex (
        complex         VARCHAR 
REFERENCES MIPS_complex_category (cid),
        entry                   VARCHAR,
        ref                     VARCHAR,
        evidence                VARCHAR
);

Mapping directives for MIPS:

CREATE DATA_SOURCE mips
(
        fullname        ‘MIPS CYGD’,
        reference       ‘http://mips.gsf.de’,
        description     ‘The MIPS Comprehensive Yeast Genome Database aims to …’
);

CREATE GRAPH mips_graph
(
        version         STRING VALUE ‘mips-11-2004’,
        … other meta-data …
) SOURCE mips;

IMPORT NODE FROM MIPS_genes 
(
        orf             AS ATTRIBUTE,
        gene_name       AS ATTRIBUTE name,
        coordinates     AS ATTRIBUTE,
        classification  AS ATTRIBUTE,
        description     AS ATTRIBUTE
)
TYPE gene    // node type 
GRAPH mips_graph
PRIMARY IDENTIFIER (orf)
SECONDARY IDENTIFIER (gene_name);

IMPORT EDGE FROM MIPS_interaction 
(
        // skip column ‘id’
        (orf1, gene1)           AS SOURCE REFERENCE,
        (orf2, gene2)           AS TARGET REFERENCE,
        relationship            AS ATTRIBUTE,
        description             AS ATTRIBUTE,
        reference               AS ATTRIBUTE,
        evidence                AS ATTRIBUTE
)
DIRECTION undirected
GRAPH mips_graph;

IMPORT NODE ATTRIBUTE FROM MIPS_gene_alias
(
        (orf, gene)             AS NODE REFERENCE,
        alias                   AS ATTRIBUTE synonym
) GRAPH mips_graph;

IMPORT HYPERNODE FROM MIPS_complex_category
(
        cid                     AS ATTRIBUTE,
        name                    AS ATTRIBUTE
) GRAPH mips_graph
PRIMARY IDENTIFIER (cid)
SECONDARY IDENTIFIER (name);

IMPORT HYPERNODE MEMBER FROM MIPS_complex
(
        complex                 AS NODE REFERENCE,
        entry                   AS MEMBER NODE REFERENCE,
        ref                     AS MEMBER ATTRIBUTE,
        evidence                AS MEMBER ATTRIBUTE
) GRAPH mips_graph;

Sample Source Schema : Gene Ontology

// two kinds of primary nodes: term and gene_product
Term (
        id              INTEGER PRIMARY KEY,
        name            VARCHAR,
        term_type       VARCHAR,
        definition      VARCHAR
);

Gene_product (
        id              INTEGER PRIMARY KEY,
        name            VARCHAR,
        species_id      INTEGER REFERENCES SEPECIES(ID)
);

// species table is used to show many-to-one relationship.
Species (
        id              INTEGER PRIMARY KEY,
        lineage         VARCHAR,
        genus           VARCHAR,
        species         VARCHAR
);

// gene_product_properties is another kind of property table. The property is already 
// stored in name-value pair. 
Gene_product_properties (
        gene_product_id INTEGER REFERENCES gene_product (id),
        property_name   VARCHAR,
        property_value  VARCHAR
);

// one-to-many relationship
Gene_product_synonyms (
        gene_product_id INTEGER REFERENCES gene_product (id),
        synonym         VARCHAR
);

// analogous to complex data. A family has one or more members. 
// column gene_product_id refers to member’s id.
Gene_product_family (
        gene_product_family_name        VARCHAR,
        gene_product_id         INTEGER REFERENCES gene_product (id),
        putative                        BOOLEAN  // it is a property on the member
);

// edge table
Term2term (
        term1_id                INTEGER REFERENCES term (id),
        term2_id                INTEGER REFERENCES term (id),
        label                   VARCHAR
);

// another edge table
Term2GeneProduct (
        term_id         INTEGER REFERENCES term (id),
        gene_product_id INTEGER REFERENCES gene_product (id)
);

Mapping directives for GO:

CREATE DATA_SOURCE go
(
fullname ‘gene ontology’,
reference ‘http://www.geneontology.org’,
description ‘gene ontology is a controlled vocabulary …’
);

CREATE GRAPH go_graph
(
version STRING VALUE ‘go-11-2004’,
… other meta-data …
) SOURCE go;

IMPORT NODE FROM term
(
id AS ATTRIBUTE,
name AS ATTRIBUTE,
term_type AS ATTRIBUTE,
definition AS ATTRIBUTE
) GRAPH go_graph // indirectly, it suggests that this term is from source GO.
PRIMARY IDENTIFIER (id, name);
// primary identifier is used to compare two nodes which may be from different sources.
// In this example, if two nodes have the same id and name, then merge them.

// store species as a separate table.
// skip field ‘lineage’. Rename ‘genus’ to ‘genus_name’.
IMPORT TABLE go_species FROM species (id, genus AS genus_name, species);

// if we want to import table ‘species’ as it is, just say:
// IMPORT TABLE FORM species;

IMPORT NODE FROM gene_product
(
id AS ATTRIBUTE,
name AS ATTRIBUTE entity_name,
// rename the property name. It is especially useful during node merge.
// for example, genes from MIPS have a property called ‘orf’, and genes from
// SGD has a property called ‘orfname’. Both ‘orf’ and ‘orfname’ refer to the open
// reading frame name of a yeast gene. By renaming ‘orfname’ to ‘orf’, it indicates
// that the mapping tool should merge these two fields.
species_id AS ATTRIBUTE REFERENCES go_species (id)
// this reference association will be stored in system catalog.
) GRAPH go_graph
PRIMARY IDENITIFER (name);

// The edge still means traditional edge. Internally it will be split and stored as
// connectorNode and two links.
IMPORT EDGE FROM term2term
(
term1_id AS SOURCE REFERENCE,
term2_id AS TARGET REFERENCE,
label AS ATTRIBUTE relationship // edge attribute
) DIRECTION directed
GRAPH go_graph;
// Here, REFERENCE suggests to follow the foreign key specified in the relational
// schema to identify the node in term table.

// edge table
IMPORT EDGE FROM term2GeneProduct
(
term1_id AS SOURCE REFERENCE,
gene_product_id AS SOURCE REFERENCE
) DIRECTION directed
GRAPH go_graph;

// simple one-to-many property.
IMPORT NODE ATTRIBUTE FROM gene_product_synonyms
(
gene_product_id AS NODE REFERENCE, // node identifier
synonym AS ATTRIBUTE
) GRAPH go_graph;

// properties are already in name-value pair.
IMPORT NODE ATTRIBUTE FROM gene_product_properties
(
gene_product_id AS NODE REFERENCE,
property_name AS ATTRIBUTE KEY,
property_value AS ATTRIBUTE VALUE
) GRAPH go_graph;

// hypernode for complex type of data.
IMPORT HYPERNODE FROM gene_product_family
(
gene_product_family_name AS ATTRIBUTE,
gene_product_id AS MEMBER NODE REFERENCE,
putative AS MEMBER ATTRIBUTE
// internally, ‘putative’ becomes a property of ‘member-of’ edge.
) GRAPH go_graph
PRIMARY IDENTIFIER (gene_product_family_name);