Data Integration

  1. Source Definition.
  2. Schema Mapping.
  3. Sample Schema Mappings.

While the logical model of information in PathSys is graph based, there are in reality a variety of informa- tion sources that provide different components of the integrated graph. A source may contribute only node properties. For example, the Yeast GFP Fusion Localization Database (Huh et al. (2003), Ghaemmaghami et al. (2003)) provides, for each gene or ORF, a set of locations where the gene is expressed. The location is considered a multi-valued property of the gene. A second kind of source contributes graphs. All protein-protein interaction sources like Gavin et al. (2002), for example, may con- tribute an interaction graph from co-immunoprecipitation experiments using a matrix model. In this case, the type of the edge is physical interaction and the label on the edge includes the type of experiment (co-IP). Yet a third category of sources, may explicitly provide a set of attributes qualifying the nature of the relationship. The protein-DNA interaction data provided by Lee et al. (2002) records a probability value for every interaction between a protein and a DNA region, both for genes and for intergenic regions. Note that this graph is directed while the protein-protein interaction graphs are not. Given this heterogeneity in the type of information that a source may provide, PathSys uses a very generic internal model to accommodate different kinds of sources, such that a new source, providing a new set of nodes, edges, or node/edge properties can be dynamically incorporated into an existing integrated database.

Source Definition.

Currently, the external data source can be:

  • a relational database schema
  • a tree-structured XML document
  • an RDF-styled triplet that describes an edge set of a graph
  • a DAG structured OWL (http://www.w3.org/TR/owl-features) document.

Typically, a new ontology or a node/attribute type hierarchy, such as the phenotype classification tree from MIPS:

 

Schema Mapping.

Schema mapping specifies how an element of the imported source should be interpreted as an element of the internal schema of PathSys. We considering the tree-structured and relational data sources. In the first case, we would like the OWL schema to populate the node type hierarchy. The mapping declarations are:

IMPORT  NODE  TYPE  FROM yeast phenotype ( Class as name,
) GRAPH phenotype tree
IMPORT  RELATIONSHIP  FROM yeast phenotype(
subClassOf as child of
) GRAPH phenotype tree

The second example of source integration imports a relational schema (a fragment of the MIPS database) into the graph elements of the internal model. Figure 2 shows the relational schema. In the MIPS complex relation, the attribute complex references the cid of the MIPS complex category relation. Notice that gene name is explicitly mapped to the pre-existing attribute name. The expression (orf1, gene1) AS SOURCE REFERENCE states that for the edge the source node uses the attribute pair {orf1, gene1} as a foreign key. Also notice how the protein complexes and their members are defined explicitly as hypernodes and hypernode members.

Sample Source Schema. MIPS

MIPS_genes (
	orf			VARCHAR PRIMARY KEY,
	gene_name		VARCHAR,
	coordinates		VARCHAR,
	classification		INTEGER,
	description		VARCHAR
);

MIPS_interaction (
	id			INTEGER,
	orf1			VARCHAR,
	gene1			VARCHAR,
	relationship		VARCHAR,
	orf2			VARCHAR,
	gene2			VARCHAR,
	description		VARCHAR,
	reference		VARCHAR,
	evidence		VARCHAR,
	FOREIGN KEY (orf1, gene1) REFERENCES MIPS_genes (orf, gene_name),
	FOREIGN KEY (orf2, gene2) REFERENCES MIPS_genes (orf, gene_name)
);

MIPS_gene_alias (
	orf			VARCHAR,
	gene			VARCHAR,
	alias			VARCHAR
);

MIPS_complex_category (
	cid			VARCHAR,
	name			VARCHAR
);	

MIPS_complex (
	complex		VARCHAR 
REFERENCES MIPS_complex_category (cid),
	entry			VARCHAR,
	ref			VARCHAR,
	evidence		VARCHAR
);

Mapping directives for MIPS:

CREATE DATA_SOURCE mips
(
	fullname	‘MIPS CYGD’,
	reference	‘http://mips.gsf.de’,
	description	‘The MIPS Comprehensive Yeast Genome Database aims to …’
);

CREATE GRAPH mips_graph
(
	version		STRING VALUE ‘mips-11-2004’,
	… other meta-data …
) SOURCE mips;

IMPORT NODE FROM MIPS_genes 
(
	orf		AS ATTRIBUTE,
	gene_name	AS ATTRIBUTE name,
	coordinates	AS ATTRIBUTE,
	classification	AS ATTRIBUTE,
	description	AS ATTRIBUTE
)
TYPE gene    // node type 
GRAPH mips_graph
PRIMARY IDENTIFIER (orf)
SECONDARY IDENTIFIER (gene_name);

IMPORT EDGE FROM MIPS_interaction 
(
	// skip column ‘id’
	(orf1, gene1)		AS SOURCE REFERENCE,
	(orf2, gene2)		AS TARGET REFERENCE,
	relationship		AS ATTRIBUTE,
	description		AS ATTRIBUTE,
	reference		AS ATTRIBUTE,
	evidence		AS ATTRIBUTE
)
DIRECTION undirected
GRAPH mips_graph;

IMPORT NODE ATTRIBUTE FROM MIPS_gene_alias
(
	(orf, gene)		AS NODE REFERENCE,
	alias			AS ATTRIBUTE synonym
) GRAPH mips_graph;

IMPORT HYPERNODE FROM MIPS_complex_category
(
	cid			AS ATTRIBUTE,
	name			AS ATTRIBUTE
) GRAPH mips_graph
PRIMARY IDENTIFIER (cid)
SECONDARY IDENTIFIER (name);

IMPORT HYPERNODE MEMBER FROM MIPS_complex
(
	complex 		AS NODE REFERENCE,
	entry			AS MEMBER NODE REFERENCE,
	ref			AS MEMBER ATTRIBUTE,
	evidence		AS MEMBER ATTRIBUTE
) GRAPH mips_graph;

Sample Source Schema : Gene Ontology

// two kinds of primary nodes: term and gene_product
Term (
	id		INTEGER PRIMARY KEY,
	name		VARCHAR,
	term_type	VARCHAR,
	definition	VARCHAR
);

Gene_product (
	id		INTEGER PRIMARY KEY,
	name		VARCHAR,
	species_id	INTEGER REFERENCES SEPECIES(ID)
);

// species table is used to show many-to-one relationship.
Species (
	id		INTEGER PRIMARY KEY,
	lineage		VARCHAR,
	genus		VARCHAR,
	species		VARCHAR
);

// gene_product_properties is another kind of property table. The property is already 
// stored in name-value pair. 
Gene_product_properties (
	gene_product_id	INTEGER REFERENCES gene_product (id),
	property_name	VARCHAR,
	property_value	VARCHAR
);

// one-to-many relationship
Gene_product_synonyms (
	gene_product_id	INTEGER REFERENCES gene_product (id),
	synonym		VARCHAR
);

// analogous to complex data. A family has one or more members. 
// column gene_product_id refers to member’s id.
Gene_product_family (
	gene_product_family_name	VARCHAR,
	gene_product_id		INTEGER REFERENCES gene_product (id),
	putative			BOOLEAN  // it is a property on the member
);

// edge table
Term2term (
	term1_id		INTEGER REFERENCES term (id),
	term2_id		INTEGER REFERENCES term (id),
	label			VARCHAR
);

// another edge table
Term2GeneProduct (
	term_id		INTEGER REFERENCES term (id),
	gene_product_id	INTEGER REFERENCES gene_product (id)
);

Mapping directives for GO:

CREATE DATA_SOURCE go
(
	fullname	‘gene ontology’,
	reference	‘http://www.geneontology.org’,
	description	‘gene ontology is a controlled vocabulary …’
);

CREATE GRAPH go_graph
(
	version		STRING VALUE ‘go-11-2004’,
	… other meta-data …
) SOURCE go;	

IMPORT NODE FROM term
(
	id		AS ATTRIBUTE,
	name		AS ATTRIBUTE,
	term_type	AS ATTRIBUTE,
	definition	AS ATTRIBUTE 
 ) GRAPH go_graph	// indirectly, it suggests that this term is from source GO.
PRIMARY IDENTIFIER (id, name);
// primary identifier is used to compare two nodes which may be from different sources. 
// In this example, if two nodes have the same id and name, then merge them. 

// store species as a separate table.
// skip field ‘lineage’. Rename ‘genus’ to ‘genus_name’.
IMPORT TABLE go_species FROM species (id, genus AS genus_name, species);

// if we want to import table ‘species’ as it is, just say:
// IMPORT TABLE FORM species; 

IMPORT NODE FROM gene_product
(
	id		AS ATTRIBUTE,
	name		AS ATTRIBUTE entity_name, 
// rename the property name. It is especially useful during node merge. 
// for example, genes from MIPS have a property called  ‘orf’, and genes from 
// SGD has a property called ‘orfname’. Both ‘orf’ and ‘orfname’ refer to the open 
// reading frame name of a yeast gene. By renaming ‘orfname’ to ‘orf’, it indicates 
// that the mapping tool should merge these two fields.
species_id	AS ATTRIBUTE REFERENCES go_species (id)  
// this reference association will be stored in system catalog. 
) GRAPH go_graph
  PRIMARY IDENITIFER (name);

// The edge still means traditional edge. Internally it will be split and stored as 
// connectorNode and two links. 
IMPORT EDGE FROM term2term
(
term1_id	AS SOURCE REFERENCE,
term2_id	AS TARGET REFERENCE,
label		AS ATTRIBUTE relationship    // edge attribute	
) DIRECTION directed
  GRAPH go_graph;
// Here, REFERENCE suggests to follow the foreign key specified in the relational 
// schema to identify the node in term table.

// edge table
IMPORT EDGE FROM term2GeneProduct
(
	term1_id	AS SOURCE REFERENCE,
	gene_product_id AS SOURCE REFERENCE
) DIRECTION directed
  GRAPH go_graph;

// simple one-to-many property.
IMPORT NODE ATTRIBUTE FROM gene_product_synonyms
(
	gene_product_id	AS NODE REFERENCE,  // node identifier
	synonym		AS ATTRIBUTE
) GRAPH go_graph;

// properties are already in name-value pair.
IMPORT NODE ATTRIBUTE FROM gene_product_properties
(
	gene_product_id	AS NODE REFERENCE, 
	property_name	AS ATTRIBUTE KEY,
	property_value	AS ATTRIBUTE VALUE
) GRAPH go_graph;

// hypernode for complex type of data. 
IMPORT HYPERNODE FROM gene_product_family
(
	gene_product_family_name	AS ATTRIBUTE,
	gene_product_id		AS MEMBER NODE REFERENCE,
	putative			AS MEMBER ATTRIBUTE  
	// internally, ‘putative’ becomes a property of ‘member-of’ edge.
) GRAPH go_graph
PRIMARY IDENTIFIER (gene_product_family_name);

 

Biological Articles: