Pathway diagrams from PubMed and World Wide Web contain valuable highly curated information difficult to reach without specialized tools. There is currently no search engine or tool that can analyse pathway images, extract their pathway components (genes, proteins, cells, organs, etc.), and indicate their relationships.
We present a resource of pathway diagrams retrieved from article and web-page images through optical character recognition (OCR), in conjunction with data-mining and data integration methods. The recognized pathways are integrated into the BiologicalNetworks research environment linking them to a wealth of data available in the IntegromeDB knowledgebase, which integrates data from >100 public data sources and the biomedical literature. Multiple search and analytical tools are available that allow the recognized cellular pathways, molecular networks and cell/tissue/organ diagrams to be studied in the context of integrated knowledge, experimental data and the literature.
We scanned a collection of >150 journals, 50,000 articles, and 150,000 figures (new articles are downloaded daily) available in PubMed Central and World Wide Web. The downloaded figures are stored on a remote server and the Lucene open-source search engine is used to index, retrieve, and rank the image text descriptions (using the default statistical ranking). In case of publication, the image description is the image legend, whereas in the case of a web page, the specifically designed algorithm retrieves the most appropriate description from the web page text surrounding the image. Image publication date and source journal are stored as separate fields that can also be used to sort the results. The constantly growing ‘Imaging Pathways’ repository currently contains 1,025 pathways, which is more than in any existing public repositories, e.g. BioCarta contains 354 and the KEGG contains 345 reference pathways. Taking into account that BiologicalNetworks’ back-end database IntegromeDB integrates Reactome, KEGG, BioCarta, NCI-Nature pathways, WikiPathways and HumanCyc this makes the BiologicalNetworks the richest compendium of currently available pathways.