Preprocessing of the data

  • Step 1. Downloading the source data from GEO.
  • Step 2. Converting data from sra format to fastq format.
  • Step 3. Mapping paired-end reads to the reference sequence.
  • Step 4. Removing PCR duplicates from the .bam file.
  • Step 5. Splitting the .bam file to separate the two reads in a pair.
  • Step 6. Creating the fragment-end (fend) related .bed file.

If you want to run from step 2 to step 5 you can download HiCtool_preprocessing.sh and do the following:

# to make the code executable
$ chmod u+x HiCtool_preprocessing.sh

# to run the bash
$ ./HiCtool_preprocessing.sh
Note!
You need first to download the data (step 1). You need also to build the index for the reference genome (see step 3). This because it is needed to compute this step only once for each reference genome. Then, you need to compute also step 6 related to the fend .bed file. Again, this step is needed only once for each restriction enzyme on a reference genome. Before starting, you can update the number of cores used in the alignment task by using the -p option on bowtie2.

1. Downloading the source data from GEO

The source data in sra format are downloaded from GEO, via GEO accession number using fastq-dump.

Before proceeding, you may need to setup the output directory where the sra files will be saved. After having installed SRA Toolkit, you need to go to the path where the software has been installed, under the subfolder “bin”, and run the following command line:

./vdb-config -i

This will open an interface that will allow you to setup/change your output directory.

To download the data related to a GEO accession number, go to the bottom of that page and click on the SRA number under the section “Relations”. After that, under the section “Runs” you will find the codes to download the SRA file, then run the following:

fastq-dump SRRXXXXXXX

Where SRRXXXXXXX has to be replaced with the specific code of the run you want to download.

In the following process, we assume that the dataset contains two runs (two SRR files) and the downloaded data are stored into two files: HiCfile1.sra, HiCfile2.sra. If you have more than two files, you need only to run the following commands also including the additional files. If you have only one file, you can assume to run the following only using HiCfile1.sra.

Note!
To produce our final results, use this GEO accession number: GSM862723.

2. Converting data from sra format to fastq format

After downloading the source data, we convert the files from .sra to .fastq and dump each read into separate files using SRA Toolkit:

fastq-dump HiCfile1.sra --split-3
fastq-dump HiCfile2.sra --split-3

As output we have these .fastq files:

  • HiCfile1.fastq
  • HiCfile1_1.fastq
  • HiCfile1_2.fastq
  • HiCfile2.fastq
  • HiCfile2_1.fastq
  • HiCfile2_1.fastq

where paired-end reads in HiCfile1.sra are split and stored into HiCfile1_1.fastq and HiCfile1_2.fastq, paired-end reads in HiCfile2.sra are split and stored into HiCfile2_1.fastq and HiCfile2_2.fastq. HiCfile1.fastq and HiCfile2.fastq contain reads with no mates.

3. Mapping paired-end reads to the reference sequence

After we have obtained the fastq files, Bowtie 2 is used for mapping the paired-end reads. To align the reads, first we build a corresponding index for the reference genome (only one time for each reference genome):

bowtie2-build hg38.fa index

hg38.fa is the reference sequence in fasta format, the output files in bt2 format are named with the prefix ‘index’.

Now we can align the reads to the reference sequence:

bowtie2 -p 8 -x index -1 HiCfile1_1.fastq,HiCfile2_1.fastq -2 HiCfile1_2.fastq,HiCfile2_2.fastq -S HiCfile.sam

where:

  • The -p argument refers to a specified number (which is 8 in our case) of parallel search threads. This can be useful to decrease the processing time in aligning the reads.
  • The -x argument specifies the basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final ‘/.1.bt2’, ‘/.2.bt2’, etc.
  • The -1 argument specifies the files with the mate 1s, while -2 specifies the files with the mate 2s. This causes Bowtie 2 to take the paired nature of the reads into account when aligning them.
  • The -S argument specifies the output file in sam format (HiCfile.sam in this case).

Then, SAMtools is used to convert the file from .sam to .bam and sort it by chromosomal coordinates:

samtools view -bS HiCfile.sam > HiCfile.bam
samtools sort -m 5000000000 HiCfile.bam -o HiCfile.sort.bam

4. Removing PCR duplicates from the bam file

To remove potential duplicates from the bam file using SAMtools, first we have to sort the bam file by reads name (sort -n) and fill in the mate coordinates, size and mate-related flags into the bam file (fixmate). Lastly, after sorting again by chromosomal coordinates (sort), we can remove duplicates from the bam file (rmdup):

samtools sort -m 5000000000 -n HiCfile.sort.bam -o HiCfile.namesort.bam
samtools fixmate HiCfile.namesort.bam HiCfile.fixmate_namesort.bam
samtools sort -m 5000000000 HiCfile.fixmate_namesort.bam -o HiCfile.fixmate_sort.bam
samtools rmdup HiCfile.fixmate_sort.bam HiCfile_noDup.sort.bam

5. Splitting the bam file to separate the two reads in a pair

After removing potential PCR duplicates in the bam file, we split it into separate files for each pair of reads via SAMtools:

samtools view -h -f 0x40 HiCfile_noDup.sort.bam > HiCfile_pair1.bam
samtools view -h -f 0x80 HiCfile_noDup.sort.bam > HiCfile_pair2.bam

The output files are HiCfile_pair1.bam, HiCfile_pair2.bam and each contains one mate for each pair.