Research > Software > Pygr

Pygr

Pygr is an open source software project used to develop graph database interfaces.

Features

  • <p>Allows easy representation of data as graph structures, as well as easy querying of these structures.</p>
  • <p>Base classes for representing sequences and sequence intervals.</p>
  • <p>Pygr interface to sequence alignment and scalable storage for multigenome alignments.</p>
  • <p>Pyge interface to sequence databases stored in FASTA, BLAST or relational databases.</p>
  • <p>Framework for running subtasks distributed over many computers, in a pythonic way, using SSH for secure process invocation and XMLRPC for message passing. Also provides simple interface for queuing and managing any number of such "batch jobs".</p>

Description

The basic idea of Pygr is that all Python data can be viewed as a graph whose nodes are objects and whose edges are object relations (in Python, references from one object to another). This has a number of advantages.

  • All data in a Python program become a database that can be queried through simple but general graph query tools. In many cases the need to write new code for some task can be replaced by a database query.
  • Graph databases are more general and flexible in terms of what they can represent and query than relational databases, which is very important for complex bioinformatics data.
  • Indeed, in Pygr, a query is itself just a graph that can be stored and queried in a database, opening paths to automated query construction.
  • Pygr graphs are fully indexed, making queries about edge relationships (which are often unacceptably slow in relational databases) fast.
  • The interface can be very simple and pythonic: it's just a Mapping. In Python "everything is a dictionary", also known as "the Mapping protocol": a dictionary maps some set of inputs to some set of outputs. e.g. m[a]=b maps a onto b, as a unique relation. In Pygr, if we want to be able to map a node to multiple target nodes (i.e. allow it to have multiple edges), we simply add another layer of mapping: m[a][b]=edgeInfo (where edgeInfo is optional edge info.)

Pygr provides one base class representing both sequences and sequence intervals (SeqPath), from which all sequence classes are derived (Sequence, SQLSequence, BlastSequence etc.). Full details are provided in the documentation.

Pygr provides a general model for interfacing with any kind of sequence alignment, and also a uniquely scalable storage system for working with huge multiple sequence alignments such as multigenome alignments. Specifically, it lets you work with an alignment both in the traditional Row-Column model (each row is a sequence, each column is a set of individual letters from different sequences, that are aligned; we will refer to this as the RC-MSA model), and also as a graph structure (known as a Partial Order Alignment, which we will refer to as the PO-MSA model). This supports ``traditional'' alignment analysis, as well as graph-algorithms, and even graph query of alignments.

The seqdb module provides a simple, consistent interface to sequence databases from a variety of different storage sources such as FASTA, BLAST and relational databases. Sequence databases are modeled (like other Pygr container classes) as dictionaries, whose keys are sequence IDs and whose values are sequence objects. Pygr sequence objects use the Python sequence protocol in all the ways you'd expect: a subinterval of a sequence object is just a Python slice (s[0:10]), which just returns a sequence object representing that interval; the reverse complement is just -s; the length of a sequence is just len(s); to obtain the actual string sequence of a sequence object is just str(s). Pygr sequence objects work intelligently with different types of back-end storage (e.g. relational databases or BLAST databases) to efficiently access just the parts of sequence that are requested, only when an actual sequence string is needed.

The coordinator module provides a simple system for running a large collection of tasks on a set of cluster nodes. Full details are provided in the documentation.

Usage

System Requirements

  • OS: n/a
  • Processor: n/a
  • Memory: data-dependent
  • Other: Python 2.3+

Installation

Installation - Unzip, untar, run 'python setup.py install'

Purpose

Pygr can be used to represent data as a graph structure that is easily queried. For example, finding a set of exons that satisfy the following relationship (exon 1 is either connected directly to exon 3, or connected to exon 2 [which is then connected to exon 3]) using a traditional SQL database schema might require a six-way (or more) JOIN, which can inflate computation times to infeasible amounts. Using Pygr, the same query can be represented as {1:{2:None, 3:None}, 2:{3:None}}. Other included modules provide powerful and convenient interfaces for working with sequence alignments, sequence data stored in databases, and managing large jobs on cluster nodes.