Data analysis and visualization

Tutorial

  • 1. Creating the Fend object.
  • 2. Creating the HiCData object.
  • 3. Creating the HiC project object.
  • 4. Filtering HiC fends.
  • 5. Estimating the HiC distance function.
  • 6. Learning the correction model.
  • 7. Normalizing the data.
  • 8. Visualizing the data.

The script HiCtool_hifive.py can be used to run all the HiFive steps (1-6), whose outputs are .hdf5 files. For more information about these functions, please see HiFive’s API documentation. To run together steps 1-6, open the script HiCtool_hifive.py, update the parameters on the top and save. Then just execute the script:

>>> execfile('HiCtool_hifive.py')

The script HiCtool_normalization_visualization.py contains all the functions to normalize the data and plot contact maps and histogram for a selected chromosome (see API Documentation). To use this code, fend correction values must be provided (i.e. you have already run HiCtool_hifive.py or performed steps 1-6).

Since the normalization process for all the chromosomes may be highly time consuming, we provide also a multiprocessing procedure to normalize the data for each chromosome in parallel (see step 7 below).

1. Creating the Fend object

A Fragment-end (Fend) object (hdf5 format) contains information about the fragments created by digestion of a genome by a specific restriction enzyme (RE). In our script, this information is supplied in the form of a BED-formatted file (restrictionsites_gc_map_valid.bed) containing information about the fragment ends like coordinates, GC content and mappability score (see preprocessing, step 5).

To create a Fend object use the function hifive.fend:

import hifive

fend = hifive.Fend('fend_object.hdf5', mode='w')
fend.load_fends('restrictionsites_gc_map_valid.bed', re_name='MboI', format='bed')
fend.save()

2. Creating the HiCData object

HiC dataset (hdf5 format) created from a Fend object and mapped data in bam format. The two bam files are passed to the function as elements of a list. The ‘maxinsert’ parameter (int.) is a cutoff for removing paired-reads whose total distance to their respective restriction sites exceeds this value. According to Yaffe and Tanay, we choose a value of 500 bp to remove spurious ligation products. In addition, when the HiCData object is created, PCR duplicates are removed and reads with ends mapping to the same fragment and reads with ends mapping to adjacent fragments on opposite strands are also excluded, to consider the possibility of incomplete restriction enzyme digestion and fragment circularization.

To create a HiCData object use the function hifive.HiCData:

import hifive

data = hifive.HiCData('HiC_data_object.hdf5', mode='w')
data.load_data_from_bam('fend_object.hdf5',
                        ['HiCfile_pair1.bam', 'HiCfile_pair2.bam'],
                        maxinsert=500,
                        skip_duplicate_filtering=False)
data.save()

3. Creating the HiC project object

The HiC project object (hdf5 format) links the HiCData object with information about which fends to include in the analysis, model parameters and learned model values. This is the standard way of working with Hi-C data in HiFive and this object will be used for learning the correction model and downstream analysis.

To create a HiC project object use the function hifive.HiC:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5', 'w')
hic.load_data('HiC_data_object.hdf5')
hic.save()

where HiC_project_object.hdf5 specifies the location to save the HiC object to and HiC_data_object.hdf5 is the data object.

4. Filtering HiC fends

At this step we filter out fragments that do not have at least one interaction before learning correction parameters. Fragment ends within a distance of 500 kb are filtered out in step 6. This will allow to normalize data without confounding technical biases with features associated to biological-relevant structures (Yaffe and Tanay).

To filter out fends use the function hifive.filter_fends:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5')
hic.filter_fends(mininteractions=1, mindistance=0, maxdistance=0)
hic.save()

5. Estimating the HiC distance function

Estimation of the distance-dependence relationship from the data prior to normalization, in order to avoid biases that may result due to restriction site distribution characteristics or the influence of distance/signal relationship.

Restriction sites over the genome are unevenly distributed and this results in a large set of distances between fragments and their neighbors. Since the interaction frequency is strongly inversely-related to inter-fragment distance, this means that fragments surrounded by shorter ones will show higher nearby interactions than those with longer adjacent fragments, due to the uneven distribution of the restriction sites position.

To estimate the HiC distance function use hifive.find_distance_parameters:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5')
hic.find_distance_parameters(numbins=90, minsize=200, maxsize=0)
hic.save()

6. Learning the correction model

Algorithm to learn the correction model for Hi-C data. For the normalization, we take into account of fragments length, inter-fragment distance, GC content and mappability score biases, according to the information included in the Fend object. We also consider a minimum distance of 500 kb between fragments to take into account of the effect of biological biases (TSSs and CTCF bound sites) while learning the correction parameters.

To normalize the data using the binning algorithm (Yaffe E. and Tanay A., 2011) use hic.find_binning_fend_corrections:

import hifive

hic = hifive.HiC('HiC_project_object.hdf5')
hic.find_binning_fend_corrections(max_iterations=1000,
                                  mindistance=500000,
                                  maxdistance=0,
                                  num_bins=[20,20,20,20],
                                  model=['len','gc','mappability','distance'],
                                  parameters=['even','even','even','even'],
                                  usereads='cis',
                                  learning_threshold=1.0)
hic.save('HiC_norm_binning.hdf5')

7. Normalizing the data

For the normalization, observed data and correction parameters to remove the biases to obtain the corrected read counts are required. Therefore, the observed contact matrix and the fend expected contact matrix are calculated. In addition, the enrichment expected contact matrix is calculated to compute the observed over expected enrichment values, considering also the distance between fends.

For each chromosome, the following five matrices are computed at a bin size of 40 kb (the bin size can be changed with a function parameter). Every contact matrix is AUTOMATICALLY saved in txt format using the function save_matrix.

  • The observed data contain the observed reads count for each bin.
  • The fend expected data contain the learned correction value to remove biases related to fends for each bin.
  • The enrichment expected data contain the expected reads count for each bin, considering the linear distance between read pairs and the learned correction parameters.
  • The normalized fend data contain the corrected reads count for each bin.
  • The normalized enrichment data (“observed over expected” matrix) contain the enrichment value (O/E) for each bin.

First execute the script HiCtool_normalization_visualization.py:

>>> execfile('HiCtool_normalization_visualization.py')

To calculate and save the normalized contact matrix use the function normalize_chromosome_fend_data (see API Documentation):

>>> fend_normalized_chr6 = normalize_chromosome_fend_data(a_chr='6', bin_size=40000, input_file='HiC_norm_binning.hdf5', species='hg38', chr_size=0, save_obs=True, save_expect=False)

To calculate and save the “observed/expected” contact matrix use the function normalize_chromosome_enrich_data (see API Documentation):

>>> enrich_normalized_chr6 = normalize_chromosome_enrich_data(a_chr='6', bin_size=40000, input_file='HiC_norm_binning.hdf5', species='hg38', chr_size=0, save_obs=True, save_expect=False)
Note!
If you need only the normalized contact matrices, there is no need to calculate also the enrichment data. If you do not need the expected data, do not save it since they are the biggest files and the process may take time.

Data are compressed in a format to reduce storage occupation and improving saving and loading time. To load a previously generated contact matrix use the function load_matrix:

>>> my_contact_matrix = load_matrix('my_contact_matrix.txt')

where my_contact_matrix.txt is a contact matrix file saved using either normalize_chromosome_fend_data or normalize_chromosome_enrich_data.

To calculate and save the normalized contact matrices in parallel use the script HiCtool_normalize_fend_parallel.py. Open the script, update the parameters on the top and save. Then, just execute the script:

>>> execfile('HiCtool_normalize_fend_parallel.py')

To calculate and save the “observed/expected” contact matrices in parallel use the script HiCtool_normalize_enrich_parallel.py. Open the script, update the parameters on the top and save. Then, just execute the script:

>>> execfile('HiCtool_normalize_enrich_parallel.py')

8. Visualizing the data

This part is to plot the heatmap and histogram for the normalized contact data.

To plot and save the heatmap and histogram use the function plot_chromosome_data:

>>> plot_chromosome_data('HiCtool_chr6_40kb_normalized_fend.txt', a_chr='6', bin_size=40000, full_matrix=False, start_coord=50000000, end_coord=54000000, species='hg38', data_type="normalized_fend", my_colormap=['white', 'red'], cutoff_type='percentile', cutoff=95, max_color='#460000', my_dpi=1000, plot_histogram=True)

Instead of the filename, the fend contact matrix generated above can be passed as well:

>>> plot_chromosome_data(fend_normalized_chr6, a_chr='6', bin_size=40000, full_matrix=False, start_coord=50000000, end_coord=54000000, species='hg38', data_type="normalized_fend", my_colormap=['white', 'red'], cutoff_type='percentile', cutoff=95, max_color='#460000', my_dpi=1000, plot_histogram=True)

Result figures

Normalized contact matrix of chromosome 6 from 50 Mb to 54 Mb with a bin size of 40 kb and the respective histogram of the contact distribution.

heatmap_1 histogram_1

This part is to plot the heatmap and histogram for the enrichment normalized data (“observed over expected”). The log2 of the data is plotted to quantify the positive enrichment (red) and the negative enrichment (blue). Loci (pixels) equal to zero before performing the log2 (deriving from zero observed contacts) are shown in gray. Loci (pixels) where enrichment expected contact was zero before performing the ratio (observed / expected) are shown in black.

To plot and save the heatmap and histogram use the function plot_chromosome_enrich_data:

>>> plot_chromosome_enrich_data('HiCtool_chr6_40kb_normalized_enrich.txt', a_chr='6', bin_size=40000, full_matrix=False, start_coord=50000000, end_coord=54000000, species='hg38', my_dpi=1000, plot_histogram=True)

Instead of the filename, the enrichment contact matrix generated above can be passed as well:

>>> plot_chromosome_enrich_data(enrich_normalized_chr6, a_chr='6', bin_size=40000, full_matrix=False, start_coord=50000000, end_coord=54000000, species='hg38', my_dpi=1000, plot_histogram=True)

Result figures

This result figure shows the log 2 of the enrichment normalized contact matrix of chromosome 6 from 50 Mb to 54 Mb with a bin size of 40 kb and the respective histogram.

heatmap_2 histogram_2

API Documentation

load_matrix(input_file)
Load formatted contact matrix from a txt file and parse it.
Parameters:input_file (txt) – input filename in txt format (generated automatically by the function “save_matrix”).
Returns:array containing the parsed values stored in the input txt file to build a contact matrix.
Return type:numpy array

normalize_chromosome_fend_data(a_chr, bin_size, input_file=’HiC_norm_binning.hdf5’, species=’hg38’, chr_size=0, save_obs=True, save_expect=False)
Normalize the contact data by calculating the corrected reads count for each bin. Observed data, expected fend data and normalized fend data are saved into txt files.
Parameters:
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
  • bin_size (int) – bin size in bp of the contact matrix.
  • input_file (str) – object containing learned correction parameters in hdf5 format obtained with HiCtool_hifive.py.
  • species (str) – ‘hg38’ or ‘mm10’ or any other species label in string format.
  • chr_size (int) – chromosome size of your custom species if you did not use ‘hg38’ or ‘mm10’.
  • save_obs (bool) – if true, save the observed contact data.
  • save_expect (bool) – if true, save the expected contact data.
Returns:

normalized fend contact matrix.

Return type:

numpy array


normalize_chromosome_enrich_data(a_chr, bin_size, input_file=’HiC_norm_binning.hdf5’, species=’hg38’, chr_size=0, save_obs=True, save_expect=False)
Calculate the enrichment data as “observed/expected” where the expected reads count is for each bin considering the linear distance between read pairs and the learned correction parameters. Observed, expected and enrichment contact data are saved to txt files.
Parameters:
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
  • bin_size (int) – bin size in bp of the contact matrix.
  • input_file (str) – object containing learned correction parameters in hdf5 format obtained with HiCtool_hifive.py.
  • species (str) – ‘hg38’ or ‘mm10’ or any other species label in string format.
  • chr_size (int) – chromosome size of your custom species if you did not use ‘hg38’ or ‘mm10’.
  • save_obs (bool) – if true, save the observed contact data.
  • save_expect (bool) – if true, save the expected contact data.
Returns:

normalized enrichment contact matrix.

Return type:

numpy array


plot_chromosome_data(contact_matrix, a_chr, bin_size, full_matrix=True, start_coord=0, end_coord=0, species=’hg38’, data_type=’normalized_fend’, chr_size=0, my_colormap=[‘white’, ‘red’], cutoff_type=’percentile’, cutoff=95, max_color=’#460000’, plot_histogram=False)
Plot a contact map and histogram of the contact distribution for observed data, normalized fend data, expected fend and enrichment data.
Parameters:
  • contact_matrix – txt file of the contact matrix generated with the function “normalize_chromosome_fend_data” or contact matrix returned by “normalize_chromosome_fend_data”.
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
  • bin_size (int) – bin size of the contact matrix in bp.
  • full_matrix (bool) – if True plot the entire matrix. If False, insert start_coord and end_coord.
  • start_coord (int) – start coordinate for the plot in bp.
  • end_coord (int) – end coordinate for the plot in bp.
  • species (str) – ‘hg38’ or ‘mm10’ or any other species label in string format.
  • data_type (str) – type of data to plot either “observed”, “normalized_fend”, “expected_fend”, “expected_enrich”.
  • chr_size (int) – chromosome size of your custom species if you did not use ‘hg38’ or ‘mm10’.
  • my_colormap – colormap to be used to plot the data. 1) Use a string if choose among any colorbar here https://matplotlib.org/examples/color/colormaps_reference.html 2) Use a list of strings with colors if you want a custom colorbar. Example: [‘white’, ‘red’, ‘black’]. Colors can be specified also in this format: ‘#000000’.
  • cutoff_type (str) – to select a type of cutoff (‘percentile’ or ‘contact_number’) or plot the full range of the data (set the parameter as ‘None’).
  • cutoff (int) – percentile to set a maximum cutoff on the number of contacts for the colorbar.
  • max_color (str) – to set the color of the bins with contact counts over “cutoff”.
  • my_dpi (int) – resolution of the contact map in dpi.
  • plot_histogram (bool) – if true, plot and save to file the histogram of the contact distribution.

plot_chromosome_enrich_data(contact_matrix, a_chr, bin_size, full_matrix=True, start_coord=0, end_coord=0, species=’hg38’, chr_size=0, plot_histogram=False)
Plot the log2 of the “observed over expected” contact map and histogram of the enrichment values distribution generated with the function “normalize_chromosome_enrich_data”.
Parameters:
  • contact_matrix – txt file of the “observed / expected” contact matrix generated with the function “normalize_chromosome_enrich_data” or contact matrix returned by the function “normalize_chromosome_enrich_data”.
  • a_chr (str) – chromosome number (example for chromosome 1: ‘1’).
  • bin_size (int) – bin size of the contact matrix in bp.
  • full_matrix (bool) – if True plot the entire matrix. If False, insert start_coord and end_coord.
  • start_coord (int) – start coordinate for the plot in bp.
  • end_coord (int) – end coordinate for the plot in bp.
  • species (str) – ‘hg38’ or ‘mm10’ or any other species label in string format.
  • chr_size (int) – chromosome size of your custom species if you did not use ‘hg38’ or ‘mm10’.
  • my_dpi (int) – resolution of the contact map in dpi.
  • plot_histogram (bool) – if true, plot and save to file the histogram of the “observed / expected” contact distribution.

normalize_chromosome_fend_data_parallel(a_chr)
Normalize the contact data by calculating the corrected reads count for each bin. Observed data, expected fend data and normalized fend data are saved into txt files. Not to be used as standalone, but with the script “HiCtool_normalize_fend_parallel.py”.
Parameters:a_chr (str) – chromosome number (example for chromosome 1: ‘1’).

normalize_chromosome_enrich_data_parallel(a_chr)
Calculate the enrichment data as observed/expected where the expected reads count is for each bin considering the distance between fends and the learned correction parameters. Observed, expected and enrichment contact data are saved to txt files. Not to be used as standalone, but with the script “HiCtool_normalize_enrich_parallel.py”.
Parameters:a_chr (str) – chromosome number (example for chromosome 1: ‘1’).