Topological Domains analysis

Prerequirements

  • This section requires python libraries: csv, numpy, matplotlib, matplotlib.pyplot.
  • This section requires python packages: HiFive, hmmlearn.

Overview

Regions at the periphery of the topological domains are highly biased in their interaction frequencies: the most upstream portion of a topological domain is highly biased towards interacting downstream, and the downstream portion of a topological domain is highly biased towards interacting upstream. To determine the directional bias at any given bin in the genome the Directionality Index (DI) is used, quantifying the degree of upstream or downstream bias of a given bin (Jesse R. Dixon et al., 2012).

This is the formula to calculate the DI:

\[DI = \left(\frac{B-A}{|B-A|}\right)\left(\frac{\left(A-E\right)^2}{E}+\frac{\left(B-E\right)^2}{E}\right)\]

where:

  • \(A\) is the the number of reads that map from a given 40 kb bin to the upstream 2 Mb.
  • \(B\) is the the number of reads that map from a given 40 kb bin to the downstream 2 Mb.
  • \(E\) is the expected number of contacts for each bin and it equals \(\frac{A+B}{2}\).

To compute DI, we need the fend normalized contact data at a bin size of 40 kb. Since the bin size is 40 kb (here we used 40 kbpb which stands for 40 kb per bin), hence the detection region of upstream or downstream biases 2 Mb is converted to 50 bins (2 Mb / 40 kbpb = 50 bins).

Tutorial

The script HiCtool_DI.py contains all the functions to calculate and plot DI and calculate topological domains coordinates (see API Documentation). To use this code, fend normalized contact data must be provided (see Data analysis and visualization). The following steps should be executed in the presented order, since each function generates the input for the following (except plots).

Before going through the tutorial, download and run the code into your workspace. If you are using the Python console or iPython on a remote server use the command %cpaste:

  • Copy the entire code
  • Open the Python console or iPython and run %cpaste
  • Paste the code
  • Enter -- on a line to finish

Or execute the files using execfile(filename).

1. Calculate and plot observed DI

To calculate the DI for a chromosome of a species use calculate_chromosome_DI. Here we calculate and plot the DI for chromosome 6 of hg38:

DI_chr6 = calculate_chromosome_DI(a_chr='6', species='hg38')

Now a txt file (HiCtool_chr6_DI.txt) is saved containing the DI values calculated.

To plot the DI distribution for a specific portion of the chromosome use plot_chromosome_DI; with the parameter show you can decide if saving the plot (default) or showing it (show=True):

plot_chromosome_DI(a_chr='6', start_pos=50000000, end_pos=54000000, species='hg38', show=False)

Result figure:

_images/HiCtool_chr6_DI_50_54.png

This figure is saved as HiCtool_chr6_DI.png.

2. Calculate the true DI and plot full DI distributions

We use a Hidden Markov Model (HMM) based on the Directionality Index to identify biased ‘states’ (true DI).

For true DI calculation, we consider the Emission Sequence as the observed DI values and the Transition Matrix, Emission Matrix and initial State Sequence as unknown. We have three emissions 1, 2, 0 corresponding to a positive (1), negative (2) or zero (0) value of the observed DI. In our analysis, we associate to the emission ‘0’ all the absolute DI values under a threshold of 0.4. So, first we estimate the model parameters and then the most probable sequence of states using the Viterbi algorithm. All these steps are performed using calculate_chromosome_true_DI, which calculates true DI values and save the output to a txt file (HiCtool_chr6_hmm_states.txt):

DI_chr6_true = calculate_chromosome_true_DI(a_chr='6')

To plot the full DI distributions use plot_chromosome_DI_full:

plot_chromosome_DI_full(a_chr='6', start_pos=50000000, end_pos=54000000, species='hg38', show=False)

Result figure:

_images/HiCtool_chr6_DI_50_54_full.png

This figure is saved as HiCtool_chr6_DI_full.png.

3. Calculate topological domains coordinates

The true DI values allow to infer the locations of the topological domains in the genome. A domain is initiated at the beginning of a single downstream biased HMM state (red color in the above figure). The domain is continuous throughout any consecutive downstream biased states. The domain will then end when the last in a series of upstream biased states (green color in the above figure) are reached, with the domain ending at the end of the last HMM upstream biased state.

To calculate the topological domains coordinates, first we extract all the potential start and end coordinates according to the definition, and then we evaluate a list of conditions to take into account the possible presence of gaps between a series of positive or negative states values. The figure below shows a summary of the procedure:


_images/HiCtool_topological_domains_flowchart.png

To calculate the topological domains coordinates for a chromosome use calculate_chromosome_topological_domains:

topological_domains_chr6 = calculate_chromosome_topological_domains(a_chr='6')

Our result for chromosome 6 is saved in HiCtool_chr6_topological_domains.txt. If you want to load the topological domains data, you can do so using load_topological_domains.

Note! The end coordinate of each domain is saved as the start position of the last bin (40 kb) belonging to each domain.


API Documentation

save_list(a_list, output_file)
Save a list in a txt file.
Parameters:
  • a_list (list) – list to be saved
  • output_file (txt) – output filename in txt format

load_matrix(input_file)
Load a list from a txt file and reshape it into a square matrix.
Parameters:input_file (txt) – input filename in txt format
Returns:array containing all the reshaped values stored in the input txt file
Return type:numpy array

calculate_chromosome_DI(a_chr, species=’hg38’)
Calculate the DI values for a chromosome of a species and save the output in a txt file.
Parameters:
  • a_chr (str) – chromosome number
  • species (str) – hg38 or mm10
Returns:

DI values

Return type:

list


plot_chromosome_DI(a_chr, start_pos, end_pos, species=’hg38’, show=False)
Plot the DI values for a chromosome of species and save the figure to file.
Parameters:
  • a_chr (str) – chromosome number
  • start_pos (int) – start coordinate for the plot in bp
  • end_pos (int) – end coordinate for the plot in bp
  • species (str) – hg38 or mm10
  • show (bool) – If false (default), the figure is saved to file. If true, the figure is shown.

calculate_chromosome_true_DI(a_chr)
Calculate the true DI values for a chromosome and save the output in a txt file. The observed DI values are required.
Parameters:a_chr (str) – chromosome number
Returns:true DI values (HMM states)
Return type:list

plot_chromosome_DI_full(a_chr, start_pos, end_pos, species=’hg38’, show=False)
Plot the DI values (observed and true DI states) for a chromosome of a species and save the figure to file.
Parameters:
  • a_chr (str) – chromosome number
  • start_pos (int) – start coordinate for the plot in bp
  • end_pos (int) – end coordinate for the plot in bp
  • species (str) – hg38 or mm10
  • show (bool) – If false (default), the figure is saved to file. If true, the figure is shown.

save_topological_domains(a_matrix, output_file)
Save the topological domains coordinates to txt file. Each topological domain coordinates (start and end) occupy one row and are tab separated.
Parameters:
  • a_matrix (numpy matrix) – matrix to be saved with topological domain coordinates
  • output_file (txt) – output file name in txt format

load_topological_domains(input_file)
Load the topological domains coordinates from txt file.
Parameters:input_file (txt) – input file name in txt format
Returns:list of lists with topological domain coordinates
Return type:list

calculate_chromosome_topological_domains(a_chr)
Calculate the topological domains coordinates of a chromosome. The true DI values are required.
Parameters:a_chr (str) – chromosome number
Returns:list of lists with topological domain coordinates
Return type:list