Preprocessing of the data

  • Step 1. Downloading the source data from GEO.
  • Step 2. Converting data from sra format to fastq format.
  • Step 3. Mapping paired-end reads to the reference sequence.
  • Step 4. Removing PCR duplicates from the .bam file.
  • Step 5. Splitting the .bam file to separate the two reads in a pair.
  • Step 6. Creating the fragment-end (fend) related .bed file.

If you want to run from step 2 to step 5 you can use HiCtool_preprocessing.bash. You need to copy and paste the code inside a new bash file that you open from the unix shell as following:

$ vi preprocessing.bash
# paste the code inside HiCtool_preprocessing.bash, save and exit

# to make the code executable
$ chmod u+x preprocessing.bash

# to run the bash
$ ./preprocessing.bash
You need first to download the data (step 1). You need also to build the index for the reference genome (see step 3). This because it is needed to compute this step only once for each reference genome. Then, you need to compute also step 6 related to the fend .bed file. Again, this step is needed only once for each restriction enzyme on a reference genome.

1. Downloading the source data from GEO

The source data in sra format are downloaded from GEO, via GEO accession number.

To download the data use Linux command wget as following:

wget <http://link_to_the_file_to_be_downloaded>

In the following process, we assume that the downloaded data are stored into two files: HiCfile1.sra, HiCfile2.sra. If you have more than two files, you need only to run the following commands also including the additional files.

To produce our final results, use this GEO accession number: GSM862723.

2. Converting data from sra format to fastq format

After downloading the source data, we convert the files from .sra to .fastq and dump each read into separate files using SRA Toolkit:

fastq-dump HiCfile1.sra --split-3
fastq-dump HiCfile2.sra --split-3

As output we have these .fastq files:

  • HiCfile1.fastq
  • HiCfile1_1.fastq
  • HiCfile1_2.fastq
  • HiCfile2.fastq
  • HiCfile2_1.fastq
  • HiCfile2_1.fastq

where paired-end reads in HiCfile1.sra are split and stored into HiCfile1_1.fastq and HiCfile1_2.fastq, paired-end reads in HiCfile2.sra are split and stored into HiCfile2_1.fastq and HiCfile2_2.fastq. HiCfile1.fastq and HiCfile2.fastq contain reads with no mates.

3. Mapping paired-end reads to the reference sequence

After we have obtained the fastq files, Bowtie 2 is used for mapping the paired-end reads. To align the reads, first we build a corresponding index for the reference genome (only one time for each reference genome):

bowtie2-build hg38.fa index

hg38.fa is the reference sequence in fasta format, the output files in bt2 format are named with the prefix ‘index’.

Now we can align the reads to the reference sequence:

bowtie2 -p 8 -x index -1 HiCfile1_1.fastq,HiCfile2_1.fastq -2 HiCfile1_2.fastq,HiCfile2_2.fastq -S HiCfile.sam


  • The -p argument refers to a specified number (which is 8 in our case) of parallel search threads. This can be useful to decrease the processing time in aligning the reads.
  • The -x argument specifies the basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final ‘/.1.bt2’, ‘/.2.bt2’, etc.
  • The -1 argument specifies the files with the mate 1s, while -2 specifies the files with the mate 2s. This causes Bowtie 2 to take the paired nature of the reads into account when aligning them.
  • The -S argument specifies the output file in sam format (HiCfile.sam in this case).

Then, SAMtools is used to convert the file from .sam to .bam and sort it by chromosomal coordinates:

samtools view -bS HiCfile.sam > HiCfile.bam
samtools sort -m 5000000000 HiCfile.bam HiCfile.sort
The output bam file is HiCfile.sort.bam (the extension .bam is inserted automatically).

4. Removing PCR duplicates from the bam file

To remove potential duplicates from the bam file using SAMtools, first we have to sort the bam file by reads name (sort -n) and fill in the mate coordinates, size and mate-related flags into the bam file (fixmate). Lastly, after sorting again by chromosomal coordinates (sort), we can remove duplicates from the bam file (rmdup):

samtools sort -m 5000000000 -n HiCfile.sort.bam HiCfile.namesort
samtools fixmate HiCfile.namesort.bam HiCfile.fixmate_namesort.bam
samtools sort -m 5000000000 HiCfile.fixmate_namesort.bam HiCfile.fixmate_sort
samtools rmdup HiCfile.fixmate_sort.bam HiCfile_noDup.sort.bam

5. Splitting the bam file to separate the two reads in a pair

After removing potential PCR duplicates in the bam file, we split it into separate files for each pair of reads via SAMtools:

samtools view -h -f 0x40 HiCfile_noDup.sort.bam > HiCfile_pair1.bam
samtools view -h -f 0x80 HiCfile_noDup.sort.bam > HiCfile_pair2.bam

The output files are HiCfile_pair1.bam, HiCfile_pair2.bam and each contains one mate for each pair.