The program package RetroTector (formerly RetroSpector)
is designed to identify and characterize entire or fragmented
endogenous retroviruses (ERVs) in genomic material, in a fashion robust
to mutations and with considerable flexibility.
It relies on a database of motifs and their
properties, and alignments of known retroviral proteins. It is at
present oriented primarily towards searching for ERVs in the human
genome, but can be adapted to other species etc.
It also utilizes a considerable number of adjustable
parameters. Adjusting them may, among other things, affect the balance
between speed, sensitivity and selectivity as required. The optimal
settings for many of these have not yet been determined.
The architecture is flexible and allows plug-in of
new motifs, new procedures etc. It may be operated to go through an
entire genome automatically.
The program is written in Java and quite portable.
It is in use under Windows, MacOS X and Linux.
A full version is available to those seriously
interested (see below). Small jobs (<10 Mbases) can be run at
The RetroTector online URL
Algorithms
For RetroTector three types of algorithms have been
developed:
1. “Fragment threading”
whereby characteristic motifs are combined into chains, satisfying
distance criteria.
The fragment
threading algorithm depends on a collection of Motifs,
conserved features of various kinds. At present, the bulk of the Motifs
are consensus amino acid sequences, but there are a number of others,
including motifs based on neural networks, weight matrices etc. For a
"Motif hit", perfect fulfilment of the criteria is not needed. However,
the hit is assigned a score depending on the degree of fulfilment.
Also vital is information about acceptable distances
between Motif
hits. The acceptable ranges are usually set rather wide, to account for
the possibility that unknown ERVs may have hitherto unknown distances.
The fragment threading algorithm constructs
candidate ERVs as
chains of Motf hits fulfilling the distance criteria and assigns each a
score, depending on the hit scores, the number of hits and some other
criteria such as reading frame consistency. "Broken" chains, violating
a small number of the distance
criteria are also acceptable.
In practice, the number of possible chains is so
large that an
exhaustive search is not feasible. Procedures to make a semi-exhaustive
search without serious loss have been devised, one of them being a
two-stage process whereby Motif hits are first threaded into "subgene
hits", which are then threaded into chains.
2a. A fast dynamic programming
Needleman-Wunsch type algorithm for checking similarity
between two DNA base sequences.
2b. A dynamic programming
Needleman-Wunsch type algorithm for fitting an amino
acid sequence to a DNA base sequence, taking into account known related
peptides and other factors suggesting the preferred reading frame.
Procedure
RetroTector may be used to analyze short sequences, but
should then not be expected to perform optimally. It is designed to
search in large DNA sequences such as chromosomes and genomes. The
typical procedure is:
1. The SweepDNA module cuts the sequence into
handier "chunks", and removes ALUs and L1 fragments using algorithm 2a.
2. The LTRID module identifies possible LTR pairs
(reasonably well, though rather uncritically) and single LTRs (not very
satisfactorily at present). In principle, the Polyadenylation signal or
equivalent is identified, a number of LTR markers are evaluated in its
vicinity, and a pair companion is sought using algorithm 2a.
3. The RetroVID module searches for hits by, at
present, about 275 Motifs, and makes chains out of them and the LTRs
using fragment threading.
4. "Puteins", i e attempted reconstructions of gag,
pol, pro and env proteins are made by the ORFID module, using algorithm
2b. Information about actual ORFs is also provided.
5. Possible other exons may be
suggested by the XonID module.
A number of modules for graphic
display of the results are available.
RTShell
This is a separate program written in Visual FoxPro. It
is useful in organizing and presenting the RetroTector results,
collecting them in a database, relating them to RepBase etc. As it is
effectively limited to Windows, some of its functions are being
transferred to the platform-indepedent RetroTector.
Contacts
For further information, you may turn to
Jonas.Blomberg@medsci.uu.se, the leader of the project (to whom requests should be addressed), or
Goran.Sperber@neuro.uu.se who wrote RetroTector.