RepeatMap

User Documentation

The documentation is currently incomplete, but will be done someday soon... For the moment being, a very brief description of how to use the program is below.

What is RepeatMap?

The RepeatMap program is able to quickly calculate the number of times a kmer (i.e. a sequence of length k) occurs in a large sequence. For example, the D. melanogaster genome contains the sequence "ACAGTCGATCGA" 24 times (14 on forward strand, 10 on reverse strand).

We're able to perform hundreds of thousands of such lookups in seconds. We currently provide this service for several genomes (see program for current availability) and provide all the tools necessary to add more genomes to the list.

Sample Session

If you'd like to test the program, you can click the button "Load Test Settings" and then look at the "Repeat Graph" or "Get Probes" or both. A sample session is

Copy and paste a sequence into the large text area. Alternatively browse to a fasta file on your desktop.
Select which organism you'd like to use. Then select the kmer cut off to use when looking for repeats. If doing a microarray or other hybridization protocol, it be worthwhile to look at the literature to determine and appropriate value for k.
If you just want to see the repeat structure of the sequence, look at the repeat graph (click the button).
If you want to look at possible hybridization probe targets in the sequence tweak the parameters in the "Probe Constraints" section. There are three parameters to tweak: probe length, total number of repeats, maximum repeated kmer. A standard cDNA probe may be very long (1Kbp or longer) whereas oligo probes are typically shorter (50-300 bp). If you'd like to only look at one region of the sequence, put in start/stop indices into the sequence, otherwise these can be left as default.

Notes on Use

The program is still in beta version, meaning it's usable but prone to behave unexpectedly. If you have a problem with the program, don't hesitate to contact the maintainers.

Common problems and fixes

Problem: Nothing happening when you ask for potential probes.
Fix: If you're looking at a large sequence (20Kbp or more), you probably ran out of memory. You can run java with more memory by going to the directory of the program and running
java -Xmx500M -jar ProbeDesigner.jar
This tells java to use 500MB of memory. You can use a larger/smaller value depending on the size of your sequence and specs of your machine.
Problem: Crashing randomly
Fix: Just email the maintainers and tell them the problem. Try to include as much detail as possible in your description of the crash.

Program Documentation

The program is documented in the publications:-------. We provide a brief summary of the results in the paper.

To create the dictionary, we take a sample across all 20mers and estimate a distribution function across 2^K bins, where K is a number in the range of 5-12 depending on the size of the genome. We then go through all 20mers in the genome and put them into one of the bins. We then sort each of the bins. After each bin is sorted, we just go through the bins and read how many times each 20mer is repeated. We then write out the 20mers and the counts to a file.

When the server is started, it reads in the 20mer and counts file. It then creates a very large table with all the counts. The counts can be determined by a binary search. To determine counts on the forward strand, we just search for the DNA entry (after converting the byte sequence into a double). To determine backward counts, we search for the reverse compliment of the sequence of interest.

Let's say our genome is:

  TATTGGACTTACGGCATTAC
3'--------------------5'  Reverse strand
5'--------------------3'  Forward strand
  ATAACCTGAATGCCGTAATG

We get the sequence ATGCATT.  We're interested in 2mers.  The forward counts:

ATGCATT
331030-

The reverse counts (look 5'->3' on the reverse strand):

ATGCATT
301332-

The reverse counts via the reverse compliment on the forward strand
(look 5'->3' on the forward strand):

AATGCAT
233103-

Leading to an output file of 

5'    3'
A 3 - T
T 3 3 A
G 1 0 C
C 0 1 G
A 3 3 T
T 0 3 A
T - 2 A
3'    5'