Generate reference sequence files


We provide several sets of reference sequences generated based on the 1000 Genomes Project data. In order to save disk space and improve the efficiency of file reading, the reference sequence file for SeqSIMLA is a binary zip file. Therefore, if you would like to use your own reference sequence files, the files need to be converted to the binary format first. In the SeqSIMLA package, we provide a tool called "convert" to do the job. In the bin directory, you can find a binary executable file convert. If you would like to compile from the source, enter the convert directory and type make to generate a binary executable file. There are three steps to generate the reference sequence files for SeqSIMLA.

Step 1. Format your reference file:

Each row in the reference file is a sequence and each column is a site. The alleles should be coded as 0 and 1. There is no space in the file.
For example,

100110
111001
000101

There are 3 sequences and 6 sites in the file.

Step2. Convert your reference file to a binary file:

./convert your_file_name output_file_name
For example, if your file name is ref.txt and you would like to output to be ref.bed, type: ./convert ref.txt ref.bed

Step3. GZIP the binary file.

On LINUX, type
gzip output_file_from_step2

For example,

gzip ref.bed

An output file ref.bed.gz will be generated. And this reference file can then be used for SeqSIMLA.