Finding issues in consensus sequences#

All functions in this notebook will be available after version 0.1.12!

After building the consensus sequences using Verkko, some assembly errors may remain, such as switching errors. Some of these errors can be corrected by revisiting the graph and identifying incorrect paths. However, the difference in data formats between the final consensus FASTA and the graph makes this step challenging. The final consensus sequence is in FASTA format and is not homopolymer-compressed(HPC), unlike the graph. In this Jupyter notebook, we provide a way to convert the coordinates of a region of interest from the FASTA sequence to the corresponding nodes in the graph, and vice versa.

%load_ext autoreload
%autoreload 2

import sys 
import importlib
import pandas as pd
import time
import os
from IPython.display import Image, display
pd.set_option('mode.chained_assignment', None)
import matplotlib.pyplot as plt
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import warnings
import session_info
# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import sys
import tqdm
from importlib import reload
print(os.getcwd())
sys.path.append('/data/Phillippy/projects/giraffeT2T/assembly/script/verkko-fillet/src')
import verkkofillet as vf
vf = reload(vf)
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/path/to/your/folder/script/test_notebooks/basics
verkko-fillet version:0.1.12
verkko-fillet version:0.1.12
import verkkofillet as vf

Trace back from un-HPC sequence to corresponding nodes in HPC graph#

display(Image(filename="../image/igv_snapshot.png",width=1200))

Here is an example on chr6_pat in Giraffe, shown in the middle of the IGV screenshot. Both alignment patterns have issues. Let’s identify which nodes are involved in this region and check if it is one of the regions where we filled a gap.

We use a BED file containing all the intersected issues from both HiFi and ONT alignments. For more details, you can check marbl training material. Additionally, we will add padding regions at both ends of each region in the BED file.

# issue region
bed_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.bed"
fai = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet/assembly_trimmed_flipped_rename_sortedhap.fasta.fai"

vf.tl.addPadding_to_bed(bed_file , fai, force = True)
Force mode is enabled. Existing output files will be overwritten.
Saving padded BED file to /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.bed, with adding 10000 padding to the start and end positions.

Next, we will generate a JSON map file linking the coordinates between the un-HPC FASTA and the HPC FASTA. This process is performed using the following command and may take a long time to complete:

# input 
uncomp_fasta = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet/assembly_trimmed_flipped_rename_sortedhap.fasta"

# output
mapJson_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/findNodes/map.json"

vf.tl.build_sparse_compression_map(uncomp_fasta, mapJson_file=mapJson_file)

build_sparse_compression_map will generate a map.json file, which will be used in the next step to liftover the regions. By default, the output file name has the same name as the input FASTA file with the suffix .map.json. If you want to specify a custom file name, you can do so manually. For the BED file, this step may take a long time, so it is recommended to filter for regions of interest. Here, we have filtered only the primary contigs.

# input
uncomp_bed= "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.filtered.bed"

# output
comp_bed= "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.filtered.comp.bed"

vf.tl.lift_seqs(uncomp_fasta, mapJson_file, uncomp_bed)

After lifting over to HPC coordinates, the regions in the BED file can be processed using vf.pp.bed_to_regionList function. This function returns a list of regions, where The list follows the format: ‘chr:start-end’

regions_list = vf.pp.bed_to_regionsList(comp_bed)
regions_list[0:3]
['chr10_pat:11151143-11173573',
 'chr10_pat:31856658-31882613',
 'chr10_pat:109598646-109618946']
gaf_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/graphaligner_assembly_dip_with_cigar.gaf"
graph_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_gapFilling/assembly.homopolymer-compressed.noseq.gfa"

finalnodes = vf.pp.getNodes_from_unHPCregion(gaf_file, graph_file, regions_list)
Finding nodes for regions: 100%|██████████| 33/33 [00:03<00:00,  8.25it/s]
finalnodes
region nodes
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

In the example above, the region chr6_pat:54815981-54929325 is matched with [utig4-2329, utig4-2651]. We can observe that this region is one of the manually gap-filled regions where we introduced new nodes to the graph to connect unconnected nodes using ONT reads.

display(Image(filename="../image/graphchr6.png",width=500))

Now, you can save the finalnodes DataFrame locally to use it in the next step. Alternatively, we can use the vf.tl.make_bandage_csv function to create a CSV file that can be used in Bandage to visualize the final nodes. The final file will have the following columns:

  • node: Node name

  • uncomp: Position of the node in the uncompressed FASTA

  • comp: Position of the node in the compressed FASTA

vf.tl.make_bandage_csv(finalnodes, uncomp_bed, comp_bed, force = True)
Force mode is enabled. Existing output files will be overwritten.
writing to /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.filtered.comp.csv

Find Specific regions on nodes#

There may be cases where a clear haplotype switch error occurs in the middle of a large node rather than at the junction of nodes.

If you need to locate a specific region within a node, you can use the following code to identify the region.

assign haplotype to each reads using merly#

file = '/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/hifi.read.meryl.hapAssign.txt'
hifi_read = vf.pp.read_hapAssignRead(file)
hifi_read.head()
read len_read hap1_total_kmer hap1_found_kmer hap2_total_kmer hap2_found_kmer hap_read
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

Read layout and scfmap for each node#

mainDir = '/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic'

obj = vf.pp.read_Verkko(mainDir)
os.chdir(mainDir)
The Verkko fillet target directory already exists: /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_verkko_fillet
If you didn't mean this, please set another directory or for overwirting, please use force= True
Lock the original Verkko folder to prevent it from being modified.
[lock_original_folder] Command executed successfully!
Path file loading...from /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_verkko_fillet/assembly.paths.tsv
Path file loaded successfully.
scfmap file not found or path is None.
nodeinfo, node_layout,node_layout_idx= vf.pp.readNodeInfo()

print(nodeinfo.head())

print(node_layout.head())

print(node_layout_idx.head())
         node    mat    pat    mat:pat    color hap_node      len        piece
0     utig4-0      0      0        0:0  #AAAAAA   lowcov    91826  piece000001
1     utig4-1  59053    342  59053:342  #FF8888      mat  1586764  piece000002
2    utig4-10      0    527      0:527  #8888FF      pat    19952  piece000011
3   utig4-100    627      0      627:0  #FF8888      mat   393691  piece000101
4  utig4-1000      1  14336    1:14336  #8888FF      pat   413025  piece001001
                                         layout_info
0                                   tig\tpiece000001
1                                        len\t205259
2                                             trm\t0
3                                           rds\t860
4  3a15d69c-9360-42ca-9e19-042b77c18780;b25fd07b-...
   index        piece
0      0  piece000001
1    865  piece000002
2   6585  piece000003
3  12296  piece000004
4  24709  piece000005
gaf = '/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/graphaligner_assembly_dip_with_cigar.gaf'
graph = "assembly.homopolymer-compressed.noseq.gfa"
region = ['chr10_pat:11151143-11173573', 'chr10_mat:1943373-10400000']

regions_node_db, regions_node_coor = vf.pp.getNodes_from_unHPCregion(gaf, graph, region)

print(regions_node_db)

print(regions_node_coor)
Finding nodes for regions: 100%|██████████| 2/2 [00:00<00:00,  4.70it/s]
                        region                                 nodes
0  chr10_pat:11151143-11173573                          [utig4-2061]
1   chr10_mat:1943373-10400000  [utig4-1795, utig4-2059, utig4-2060]
         node  start_coor_path_comp  end_coor_path_comp  len_node  \
0  utig4-2061              10165055            11264668   1099613   
1  utig4-1795                  7607            10089484  10081877   
2  utig4-2059              10087816            10246152    158336   
3  utig4-2060              10244708            11344067   1099359   

   start_coor_on_node  end_coor_on_node                       region  
0              982859           1099613  chr10_pat:11151143-11173573  
1             1936147            310897   chr10_mat:1943373-10400000  
2                   0            154229   chr10_mat:1943373-10400000  
3                   0           1099359   chr10_mat:1943373-10400000  

loc_on_node, mergedb_all = vf.pp.getNodeCoor(obj, regions_node_db, hifi_read, node_layout, node_layout_idx, nodeinfo, regions_node_coor)

print(loc_on_node)

print(mergedb_all)
regions of interest: 100%|██████████| 2/2 [00:32<00:00, 16.44s/it]
         node                       region  \
0  utig4-2061  chr10_pat:11151143-11173573   
0  utig4-1795   chr10_mat:1943373-10400000   
0  utig4-2059   chr10_mat:1943373-10400000   
0  utig4-2060   chr10_mat:1943373-10400000   

                                       contig    start      end hap_node  
0  sire_compressed.k31.hapmer_from_utig4-1316   982859  1099613      pat  
0   dam_compressed.k31.hapmer_from_utig4-1315  1936147   310897      mat  
0   dam_compressed.k31.hapmer_from_utig4-1315        0   154229   lowcov  
0   dam_compressed.k31.hapmer_from_utig4-1315        0  1099359      mat  
                                                   read  len_read  \
0     44af7e69-5ecd-4c1f-8003-261aab21c883;a74bfbd1-...   79710.0   
1                 m84124_231207_000930_s3/124455843/ccs   34289.0   
2     bc98aa1a-5d1f-4dfd-93c8-2fbda885000f;94cfbd16-...  123230.0   
3                 m84124_230818_165616_s1/210177320/ccs   20133.0   
4                  m84124_230828_171016_s1/52172300/ccs   22835.0   
...                                                 ...       ...   
4018              m84124_230828_171016_s1/197921269/ccs   19332.0   
4019              m84124_231207_004036_s4/206047197/ccs   20605.0   
4020               m84124_230828_171016_s1/69405515/ccs    7034.0   
4021  495e0f16-df33-4efe-abff-b1164e25dfb9;bc3a472c-...   28173.0   
4022              m84124_230828_171016_s1/195563325/ccs   20649.0   

      hap1_total_kmer  hap1_found_kmer  hap2_total_kmer  hap2_found_kmer  \
0          37089377.0              0.0       36898822.0            554.0   
1          37089377.0              0.0       36898822.0            139.0   
2          37089377.0              0.0       36898822.0           4640.0   
3          37089377.0              0.0       36898822.0             93.0   
4          37089377.0              2.0       36898822.0             93.0   
...               ...              ...              ...              ...   
4018       37089377.0              0.0       36898822.0              0.0   
4019       37089377.0             28.0       36898822.0              0.0   
4020       37089377.0             31.0       36898822.0              0.0   
4021       37089377.0             31.0       36898822.0              0.0   
4022       37089377.0              6.0       36898822.0              0.0   

        hap_read  start_on_node  end_on_node platform        node  \
0           hap2              0        57146      ont  utig4-2061   
1           hap2          10731        35512      ccs  utig4-2061   
2           hap2          13716       101954      ont  utig4-2061   
3           hap2          14837        29390      ccs  utig4-2061   
4           hap2          15467        31963      ccs  utig4-2061   
...          ...            ...          ...      ...         ...   
4018  Unassigned        1142793      1129187      ccs  utig4-2060   
4019        hap1        1144888      1130374      ccs  utig4-2060   
4020        hap1        1145136      1140164      ccs  utig4-2060   
4021        hap1        1145667      1125855      ont  utig4-2060   
4022        hap1        1148218      1133601      ccs  utig4-2060   

                                          contig hap_node  mid_on_node  
0     sire_compressed.k31.hapmer_from_utig4-1316      pat        28573  
1     sire_compressed.k31.hapmer_from_utig4-1316      pat        23121  
2     sire_compressed.k31.hapmer_from_utig4-1316      pat        57835  
3     sire_compressed.k31.hapmer_from_utig4-1316      pat        22113  
4     sire_compressed.k31.hapmer_from_utig4-1316      pat        23715  
...                                          ...      ...          ...  
4018   dam_compressed.k31.hapmer_from_utig4-1315      mat      1135990  
4019   dam_compressed.k31.hapmer_from_utig4-1315      mat      1137631  
4020   dam_compressed.k31.hapmer_from_utig4-1315      mat      1142650  
4021   dam_compressed.k31.hapmer_from_utig4-1315      mat      1135761  
4022   dam_compressed.k31.hapmer_from_utig4-1315      mat      1140909  

[44318 rows x 14 columns]

hap_dict = dict({'hap1': 'mat',
                 'hap2': 'pat',
                 'Unassigned' : 'Unassigned'})
mergedb_all['hap_read'] = mergedb_all['hap_read'].map(hap_dict)
mergedb_all.groupby('hap_read').size()
0
hap_read
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
node = 'utig4-2061'
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", 
                          x = "start_on_node",
                          multiple = 'stack',
                          )
../../_images/fed66b7d8a0f68e2c7f3f0df7dc37a21b13c2d36cf7d086fc1c09a4daf824bcf.png
node = 'utig4-2060'
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", 
                          x = "start_on_node",
                          # x = "mid_on_node",
                          multiple = 'stack',
                          )
../../_images/3be7c7215793599f9ac1dbc4cc7dd9d9dc4976c59b4394b3be81f395e4fd3adb.png
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", x = "mid_on_node", multiple = 'fill')
../../_images/d12159b2b801d1876168441930f1dacd98f733c61a5b97efccc07fb6ac8b9756.png
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", x = "start_on_node")
../../_images/09915e381d167d97769f38f69f477224c746b1a597b0b6d9ae98de0a9da41223.png
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", x = "end_on_node")
../../_images/6d3a7b6092fc783572e52f28301f125b7d65799274c84b98149a0502c1cbcf31.png
exc_read_db = mergedb_all.copy()
exc_read_db['exc'] = "Y"
exc_read_db.loc[exc_read_db['hap_read'] == exc_read_db['hap_node'], 'exc'] = "N"
exc_read_db.loc[exc_read_db['hap_read'] == 'Unassigned', 'exc'] = "N"
exc_read_db = exc_read_db[exc_read_db['exc'] == 'Y']
exc_read_db.to_csv(f"../exc_read_db.csv", index=False)
obj = vf.pp.findGaps(obj)
45 gaps were found -> obj.gaps
segment, link = vf.pp.readGraph('assembly.homopolymer-compressed.noseq.gfa')
node_space = vf.pp.getNodeSpace_from_allPath(obj, segment, link, file = '../comp_node.bed')
100%|██████████| 366/366 [00:05<00:00, 67.60it/s] 
sub = node_space.loc[node_space['chrom'] == 'dam_compressed.k31.hapmer_from_utig4-1003',:]
sub['nodeLen'] = sub['chromEnd'] - sub['chromStart']
sub
chrom chromStart chromEnd name score strand nodeLen
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
sub = sub.merge(nodeinfo , left_on = 'name', right_on = 'node', how = 'left')
sub.loc[sub['nodeLen'] != sub['len'], :]
chrom chromStart chromEnd name score strand nodeLen node mat pat mat:pat color hap_node len piece
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
node_space.groupby('chrom').max('chromEnd')
chromStart chromEnd score
chrom
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)