Topological domain analysis

Overview

Regions at the periphery of the topological domains are highly biased in their interaction frequencies: the most upstream portion of a topological domain is highly biased towards interacting downstream, and the downstream portion of a topological domain is highly biased towards interacting upstream. To determine the directional bias at any given bin in the genome the Directionality Index (DI) is used, quantifying the degree of upstream or downstream bias of a given bin (Jesse R. Dixon et al., 2012).

This is the formula to calculate the DI:

\[DI = \left(\frac{B-A}{|B-A|}\right)\left(\frac{\left(A-E\right)^2}{E}+\frac{\left(B-E\right)^2}{E}\right)\]

where:

  • \(A\) is the the number of reads that map from a given 40 kb bin to the upstream 2 Mb.
  • \(B\) is the the number of reads that map from a given 40 kb bin to the downstream 2 Mb.
  • \(E\) is the expected number of contacts for each bin and it equals \(\frac{A+B}{2}\).

To compute DI, we need the fend normalized contact data at a bin size of 40 kb. Since the bin size is 40 kb (here we used 40 kbpb which stands for 40 kb per bin), hence the detection region of upstream or downstream biases 2 Mb is converted to 50 bins (2 Mb / 40 kbpb = 50 bins).

Tutorial

The script HiCtool_DI.py contains all the functions to calculate and plot DI and calculate topological domains coordinates (see API Documentation). To use this code, fend normalized contact data must be provided (see Data analysis and visualization). The following steps should be executed in the presented order, since each function generates the input for the following (except plots).

Before going through the tutorial, download and execute the script:

>>> execfile('HiCtool_DI.py')

DI values, HMM states and topological domain coordinates can be also computed in parallel using the script HiCtool_DI_parallel.py. Open the script, update the parameters on the top and save. Then, just execute the script:

>>> execfile('HiCtool_DI_parallel.py')

1. Calculate and plot the observed DI

To calculate the DI for a chromosome of a species use calculate_chromosome_DI. Here we calculate and plot the DI for chromosome 6. input_file is the normalized contact matrix at 40 kb calculated using the function normalize_chromosome_fend_data (see previous section):

>>> calculate_chromosome_DI(input_file='HiCtool_chr6_40kb_normalized_fend.txt', a_chr='6')

A txt file named HiCtool_chr6_DI.txt is saved containing the DI values.

To plot the DI distribution for a specific portion of the chromosome use plot_chromosome_DI:

>>> plot_chromosome_DI(input_file='HiCtool_chr6_DI.txt', a_chr='6', start_pos=50000000, end_pos=54000000, species='hg38', plot_legend=True, plot_grid=True)

Result figure:

_images/HiCtool_chr6_DI.png

This figure is saved as HiCtool_chr6_DI.pdf.

2. Calculate the true DI and plot full DI distributions

We use a Hidden Markov Model (HMM) based on the Directionality Index to identify biased ‘states’ (true DI).

For true DI calculation, we consider the Emission Sequence as the observed DI values and the Transition Matrix, Emission Matrix and initial State Sequence as unknown. We have three emissions 1, 2, 0 corresponding to a positive (1), negative (2) or zero (0) value of the observed DI. In our analysis, we associate to the emission ‘0’ all the absolute DI values under a threshold of 0.4. So, first we estimate the model parameters and then the most probable sequence of states using the Viterbi algorithm. All these steps are performed using calculate_chromosome_true_DI, which calculates true DI values and save the output to a txt file (HiCtool_chr6_hmm_states.txt):

>>> calculate_chromosome_true_DI(input_file='HiCtool_chr6_DI.txt', a_chr='6')

To plot the full DI distributions use plot_chromosome_DI_full:

>>> plot_chromosome_DI_full(input_file_DI='HiCtool_chr6_DI.txt', input_file_hmm='HiCtool_chr6_hmm_states.txt', a_chr='6', start_pos=50000000, end_pos=54000000, species='hg38', plot_legend=True, plot_grid=True)

Result figure:

_images/HiCtool_chr6_DI_full.png

This figure is saved as HiCtool_chr6_DI_full.png.

3. Calculate topological domain coordinates

The true DI values allow to infer the locations of the topological domains in the genome. A domain is initiated at the beginning of a single downstream biased HMM state (red color in the above figure). The domain is continuous throughout any consecutive downstream biased state. The domain will then end when the last in a series of upstream biased states (green color in the above figure) is reached, with the domain ending at the end of the last HMM upstream biased state.

To calculate the topological domains coordinates, first we extract all the potential start and end coordinates according to the definition, and then we evaluate a list of conditions to take into account the possible presence of gaps between a series of positive or negative states values. The figure below shows a summary of the procedure:


_images/HiCtool_topological_domains_flowchart.png

To calculate the topological domains coordinates for a chromosome use calculate_chromosome_topological_domains:

>>> calculate_chromosome_topological_domains(input_file='HiCtool_chr6_hmm_states.txt', a_chr='6')

Our result for chromosome 6 is saved in HiCtool_chr6_topological_domains.txt. If you want to load the topological domains data, you can do so using load_topological_domains.

Note!
The end coordinate of each domain is saved as the start position of the last bin (40 kb) belonging to each domain.

API Documentation

calculate_chromosome_DI(input_file, a_chr)
Function to calculate the DI values for a chromosome of and save them in a txt file.
Parameters:
  • input_file (str) – normalized fend contact matrix at a bin size of 40kb.
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
Returns:

DI values.

Return type:

list


load_DI_values(input_file)
Load a DI txt file generated with “calculate chromosome DI”.
Parameters:input_file (str) – input file name in txt format (generated by the function “calculate_chromosome_DI”)
Returns:DI values loaded.
Return type:list

plot_chromosome_DI(input_file, a_chr, start_pos, end_pos, species=’hg38’, plot_legend=True, plot_grid=True)
Plot the DI values for a chromosome of species and save the figure to file.
Parameters:
  • input_file (str) – txt file of the DI values generated with the function “calculate_chromosome_DI”.
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
  • start_pos (int) – start coordinate for the plot in bp.
  • end_pos (int) – end coordinate for the plot in bp.
  • species (str) – species name (hg38, mm10, etc.).
  • plot_legend (bool) – if True, plot the legend.
  • plot_grid (bool) – if True, plot the grid.

calculate_chromosome_true_DI(input_file, a_chr)
Calculate the true DI values for a chromosome and save the output in a txt file. The observed DI values are required.
Parameters:
  • input_file (str) – txt file of the DI values generated with the function “calculate_chromosome_DI”.
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
Returns:

true DI values (HMM states).

Return type:

list


load_hmm_states(input_file)
Load a hmm txt file generated with “calculate_chromosome_true_DI”.
Parameters:input_file (str) – input file name in txt format (generated by the function “calculate_chromosome_true_DI”)
Returns:true DI values loaded.
Return type:list

plot_chromosome_DI_full(input_file_DI, input_file_hmm, a_chr, start_pos, end_pos, species=’hg38’, plot_legend=True, plot_grid = True)
Plot the DI values (observed and true DI states) for a chromosome of a species and save the figure to file.
Parameters:
  • input_file_DI (str) – txt file of the DI values generated with the function “calculate_chromosome_DI”.
  • input_file_hmm (str) – txt file of the true DI values generated with the function “calculate_chromosome_true_DI”.
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
  • start_pos (int) – start coordinate for the plot in bp.
  • end_pos (int) – end coordinate for the plot in bp.
  • species (str) – species name (hg38, mm10, etc.).
  • plot_legend (bool) – if True, plot the legend.
  • plot_grid (bool) – if True, plot the grid.

calculate_chromosome_topological_domains(input_file, a_chr)
Calculate the topological domains coordinates of a chromosome. The true DI values are required. Topological domains are stored in each line with tab separated start and end coordinates.
Parameters:
  • input_file (str) – txt file of the true DI values generated with the function “calculate_chromosome_true_DI”.
  • a_chr (str) – hromosome number (example for chromosome 1: ‘1’).
Returns:

list of lists with topological domain coordinates.

Return type:

list


load_topological_domains(input_file)
Load the topological domains coordinates from txt file.
Parameters:input_file (str) – input file name generated with “calculate_topological_domains”.
Returns:list of lists with topological domain coordinates.
Return type:list

compute_parallel_DI(a_chr)
Compute DI values, HMM states and topological domain coordinates in parallel. Only to be used by running the script “HiCtool_DI_parallel.py”.
Parameters:a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
Returns:For each chromosome: 1) txt file containing DI values. 2) txt file containing HMM states. 3) txt file containing topological domains coordinates.
Return type:txt