Data analysis and visualization

Prerequirements

  • This section requires python libraries: math, numpy, matplotlib, matplotlib.pyplot, PIL.
  • This section requires python package: HiFive.

Tutorial

The script HiCtool_hifive.py is used to run only HiFive functions (steps 1-6 below), whose outputs are .hdf5 files. For more information about these functions, please see HiFive’s API documentation.

The script HiCtool_normalization_visualization.py contains all the functions to normalize the data and plot contact maps, histogram and colorbar for a selected chromosome (see API Documentation). To use this code, fend correction values must be provided (i.e. you have already run HiCtool_hifive.py or performed steps 1-6).

Before going through the tutorial, download and run the code into your workspace. If you are using the Python console or iPython on a remote server use the magic command %cpaste:

  • Copy the entire code
  • Open the Python console or iPython and run %cpaste
  • Paste the code
  • Enter -- on a line to finish

1. Creating a Fend object

A Fragment-end (Fend) object (hdf5 format) contains information about the fragments created by digestion of a genome by a specific restriction enzyme (RE). In our script, this information is supplied in the form of a BED-formatted file (restrictionsites.bed) containing the information about the restriction enzyme sites for the target genome and GC content.

To create a Fend object use the function hifive.fend:

import hifive

fend = hifive.Fend('fend_object.hdf5', mode='w')
fend.load_fends('HindIII_hg38_gc.bed', re_name='HindIII', format='bed')
fend.save()

2. Creating a HiCData object

HiC dataset (hdf5 format) created from a Fend object and mapped data in bam format. The original bam file has to be splitted into two bam files (see preprocessing) each containing the information of one of the two mapped reads, and then they are passed to the function as elements of a list. The ‘maxinsert’ parameter (int.) is a cutoff for removing paired-reads whose total distance to their respective restriction sites exceeds this value. According to Yaffe and Tanay, we choose a value of 500 bp to remove spurious ligation products.

To create a HiCData object use the function hifive.HiCData:

import hifive

data = hifive.HiCData('HiC_data_object.hdf5', mode='w')
data.load_data_from_bam('fend_object.hdf5',
                        ['HiCfile_pair1.bam', 'HiCfile_pair2.bam'],
                        maxinsert=500)
data.save()

3. Creating a HiC project object

The HiC project object (hdf5 format) contains links to the HiCData and Fend object, information about which fends to include in the analysis, model parameters and learned model values. This is the standard way of working with Hi-C data in HiFive and this object will be used for learning the correction model and downstream analysis.

To create a HiC project object use the function hifive.HiC:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5', 'w')
hic.load_data('HiC_data_object.hdf5')
hic.save()

where HiC_project_object.hdf5 specifies the location to save the HiC object to and HiC_data_object.hdf5 is the data object.

4. Filtering HiC fends

According to Yaffe and Tanay, to do not take into account of biased regions up to ~400 kb upstream or downstream of an active TSS or CTCF-binding site, we filter out fragments that interact within a distance of 500 kb before learning the correction parameters related to fend biases (length and GC content). We choose 500 kb to be more confident of not considering biased regions, however the user is able to change the filtering parameter according to his preference.

To filter out fends use the function hifive.filter_fends:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5')
hic.filter_fends(mininteractions=1, mindistance=500000, maxdistance=0)
hic.save()

5. Finding HiC distance function

Estimation of the distance-dependence relationship from the data prior to normalization, in order to avoid biases that may result due to restriction site distribution characteristics or the influence of distance/signal relationship.

Restriction sites throughout the genome are unevenly distributed. This results in greatly varying sets of distances between fragments and their neighbors. Because interaction signal is strongly inversely-related to inter-fragment distance, this unequal distribution means that fragments with lots of shorter adjacent fragments have a nearby neighborhood of higher interaction values than fragments surrounded by longer restriction fragments, simply due to cutsite variation.

To find HiC distance function use hifive.find_distance_parameters:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5')
hic.find_distance_parameters(numbins=90, minsize=200, maxsize=0)
hic.save('HiC_distance_function.hdf5')

6. Learning the correction model

Algorithm to learn the correction model for Hi-C data. For the normalization, we take into account of fragments length and GC content biases, according to the information provided by the Fend object.

To normalize the data using binning algorithm (Yaffe E. and Tanay A., 2011) use hic.find_binning_fend_corrections:

import hifive

hic = hifive.HiC('HiC_distance_function.hdf5')
hic.find_binning_fend_corrections(max_iterations=1000,
                                  mindistance=500000,
                                  maxdistance=0,
                                  num_bins=[20,20],
                                  model=['len','gc'],
                                  parameters=['even','even'],
                                  usereads='cis',
                                  learning_threshold=1.0)
hic.save('HiC_norm_binning.hdf5')

7. Normalization of the data

For the normalization, observed data and correction parameters to remove the biases to obtain the corrected read counts are needed. In order to perform this, the observed contact matrix and the fend expected contact matrix are calculated. In addition, the enrichment expected contact matrix is calculated to compute the observed over expected enrichment values, considering also the distance between fends.

For each chromosome, the following five matrices are computed at a bin size of 40 kb (the bin size can be changed with a function parameter). Every contact matrix is saved in txt format using the function save_matrix.

  • The observed data contain the observed reads count for each bin.
  • The fend expected data contain the learned correction value to remove biases related to fends for each bin.
  • The enrichment expected data contain the expected reads count for each bin, considering the distance between fends and the learned correction parameters.
  • The normalized fend data contain the corrected reads count for each bin.
  • The normalized enrichment data contain the enrichment value (O/E) for each bin.

To calculate and save the normalized fend data use normalize_chromosome_fend_data:

normalize_chromosome_fend_data(a_chr='6', bin_size=40000, species='hg38', save_obs=True, save_expect=False)

To calculate and save the normalized enrichment data use normalize_chromosome_enrich_data:

normalize_chromosome_enrich_data(a_chr='6', bin_size=40000, species='hg38', save_obs=True, save_expect=False)

Note! If you need only the normalized contact matrices, there is no need to calculate also the enrichment data. If you do not need the expected data, do not save it since they are the biggest files and the process may take time.

8. Visualizing the normalized data

This is to plot the heatmap, histogram and colorbar for the fend normalized data. To have a better visualization (i.e. setting a proper heatmap color range), only the 98th percentile of the non-zero data is plotted. Then, in order to obtain a heatmap whose colors span from white (RGB[255,255,255]) to red (RGB[255,0,0]), the data are normalized between 0 and 255 before plotting.

To plot and save the heatmap, histogram and colorbar in png format use plot_normalized_chromosome_fend_data:

plot_normalized_chromosome_fend_data(a_chr='6', start_coord=50000000, end_coord=54000000, bin_size=40000, species='hg38', plot_histogram=False, plot_colorbar=False)

Result figures

  heatmap_1 colorbar_1 histogram_1  

Explanation:

This result figure shows the fend normalized contact matrix of chromosome 6 from 50 Mb to 54 Mb with a bin size of 40 kb and the respective histogram. The values in the heatmap are the 98th percentile of the non-zero data.


  heatmap_3 colorbar_3 histogram_3  

Explanation:

This result figure shows the fend normalized contact matrix of chromosome 6 from 0 Mb to 171 Mb with a bin size of 1 Mb and the respective histogram. The values in the heatmap are the 98th percentile of the non-zero data. These results can be obtained by changing the plotting function parameters.


This is to plot the heatmap, histogram and colorbar for the enrichment normalized data. The log2 of the data are plotted to quantify the positive enrichment (red) and the negative enrichment (blue). The zero values before performing the log2 are shown in gray. The 99th percentile of the data is plotted, computed either on positive logs and negative logs.

To plot and save the heatmap, histogram and colorbar in png format plot_normalized_chromosome_enrich_data:

plot_normalized_chromosome_enrich_data(a_chr='6', start_coord=50000000, end_coord=54000000, bin_size=40000, species='hg38', plot_histogram=False, plot_colorbar=False)

Result figures

  heatmap_2 colorbar_2 histogram_2  

Explanation:

This result figure shows the log 2 of the enrichment normalized contact matrix of chromosome 6 from 50 Mb to 54 Mb with a bin size of 40 kb and the respective histogram. The values in the heatmap are the 99th percentiles of the positive and negative log2 values.


  heatmap_4 colorbar_4 histogram_4  

Explanation:

This result figure shows the log2 of the enrichment normalized contact matrix of chromosome 6 from 0 Mb to 171 Mb with a bin size of 1 Mb and the respective histogram. The values in the heatmap are the 99th percentiles of the positive and negative log2 values.


API Documentation

save_matrix(a_matrix, output_file)
Format and save an intra-chromosomal contact matrix in a txt file.
1) The upper-triangular part of the matrix is selected (including the diagonal). 2) Data are reshaped to form a vector. 3) All the consecutive zeros are replaced with a “0” followed by the number of times zeros are repeated consecutively. 4) Data are saved to a txt file.
Parameters:
  • a_matrix (numpy matrix) – input contact matrix to be saved
  • output_file (txt) – output filename in txt format

load_matrix(input_file)
Load formatted contact matrix from a txt file and parse it.
Parameters:input_file (txt) – input filename in txt format (generated by the function “save_matrix”)
Returns:array containing the parsed values stored in the input txt file to build a contact matrix
Return type:numpy array

normalize_chromosome_fend_data(a_chr, bin_size, species=’hg38’, save_obs=True, save_expect=False)
Normalize the contact data by calculating the corrected reads count for each bin. Observed data, expected fend data and normalized fend data are saved into txt files.
Parameters:
  • a_chr (str) – chromosome number
  • bin_size (int) – bin size in bp of the contact matrix
  • species (str) – hg38 or mm10
  • save_obs (bool) – if true, save the observed contact data
  • save_expect (bool) – if true, save the expected contact data

normalize_chromosome_enrich_data(a_chr, bin_size, species=’hg38’, save_obs=True, save_expect=False)
Calculate the enrichment data as observed/expected, where the expected reads count is for each bin considering the distance between fends and the learned correction parameters. Observed, expected and enrichment contact data are saved into txt files.
Parameters:
  • a_chr (str) – chromosome number
  • bin_size (int) – bin size in bp of the contact matrix
  • species (str) – hg38 or mm10
  • save_obs (bool) – if true, save the observed contact data
  • save_expect (bool) – if true, save the expected contact data

make_cmap(colors)
Take a list of tuples that contain RGB values and return a cmap with equally spaced colors. Arrange your tuples so that the first color is the lowest value for the colorbar and the last is the highest.
Parameters:colors – List of tuples containing RGB colors from the lowest to the highest value of the colorbar
Returns:cmap object

plot_normalized_chromosome_fend_data(a_chr, start_coord, end_coord, bin_size, species=’hg38’, plot_histogram=False, plot_colorbar=False)
Plot normalized fend contact map, colorbar and histogram.
Parameters:
  • a_chr (str) – chromosome number
  • start_coord (int) – start coordinate for the plot in bp
  • end_coord (int) – end coordinate for the plot in bp
  • bin_size (int) – bin size of the contact matrix in bp
  • species (str) – hg38 or mm10
  • plot_histogram (bool) – if true, plot the histogram
  • plot_colorbar (bool) – if true, plot the colorbar

plot_normalized_chromosome_enrich_data(a_chr, start_coord, end_coord, bin_size, species=’hg38’, plot_histogram=False, plot_colorbar=False)
Plot the log2 of the normalized enrichment contact map, colorbar and histogram.
Parameters:
  • a_chr (str) – chromosome number
  • start_coord (int) – start coordinate for the plot in bp
  • end_coord (int) – end coordinate for the plot in bp
  • bin_size (int) – bin size of the contact matrix in bp
  • species (str) – hg38 or mm10
  • plot_histogram (bool) – if true, plot the histogram
  • plot_colorbar (bool) – if true, plot the colorbar