COSI


The cosi package contains two executable programs. One ("coalescent") generates simulated genetic data using a coalescent model. The other ("recosim") generates a random map of variable recombination rates across a region; its output can be used as one of the inputs to coalescent.

For SeqSIMLA, you only need to use the program "coalescent" to generates simulated genetic data. Please follow below steps,

  • Download cosi package from http://www.broadinstitute.org/~sfs/cosi/
  • Compile cosi in the cosi folder following the instructions in the README file. This is generally done by using the command 'make'.
  • Switch to 'examples/bestfit'.
  • Customize options in params file(see 'Parameter file').
  • Run run.pl by using command 'perl run.pl'.
  • Get the simulated genetic data file which named as 'out.hap-<pop id>'.
  • Format the file by using the following linux command scripts
    • sed -i -r 's/\S+//1' out.hap-1
    • sed -i -r 's/\S+//1' out.hap-1
    • sed -i 's/\t\t*/ /g' out.hap-1
    • sed -i 's/ //g' out.hap-1
    • sed -i 's/'1'/'0'/g' out.hap-1
    • sed -i 's/'2'/'1'/g' out.hap-1
  • Follow the tutorial 'Generate reference sequence files: Step2' to convert your files to SeqSIMLA format.
Cosi saved the recombination rate in model.test. To generate the recombination fraction file for SeqSIMLA, use the command:

awk '{print NR,$2}' model.test > output.rec

Parameter file

The parameter file defines the population structure and other controlling parameters for the run, using keywords. The parameter file in the 'bestfit' folder provides parameters to generate realistic human populations. Comments are indicated by "#" at the beginning of a line.

To define the size of the simulated region, modify the following parameter.
# in bp.
length 10000

Any population that appears in the simulation, either as a source of samples or in the history of those samples, must be defined in the file; at least one sampled population is required.

pop id (Population):
1 european
3 african-american
4 asian
5 african

The syntax for defining a population is
pop_size <pop id> <size>
sample_size <pop id> <n sample>

For example, the following entries

pop_define 1 European
pop_size 1 10000
sample_size 1 10000

The parameters above define population 1 as the European population, set the effective population size to be 10000 and the number of sampled chromosomes to be 10000. If you don't want to generate a specific population, just set the sample_size of that population to '0'.

To generate SeqSIMLA reference sequences, generally you will only need to modify the 'length' and 'sample_size' parameters.

Notice: SeqSIMLA can only accept the sample size up to 10000.