Finding issues in consensus sequences#
All functions in this notebook will be available after version 0.1.12!
After building the consensus sequences using Verkko, some assembly errors may remain, such as switching errors. Some of these errors can be corrected by revisiting the graph and identifying incorrect paths. However, the difference in data formats between the final consensus FASTA and the graph makes this step challenging. The final consensus sequence is in FASTA format and is not homopolymer-compressed(HPC), unlike the graph. In this Jupyter notebook, we provide a way to convert the coordinates of a region of interest from the FASTA sequence to the corresponding nodes in the graph, and vice versa.
%load_ext autoreload
%autoreload 2
import sys
import importlib
import pandas as pd
import time
import os
from IPython.display import Image, display
pd.set_option('mode.chained_assignment', None)
import matplotlib.pyplot as plt
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import warnings
import session_info
# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import sys
import tqdm
from importlib import reload
print(os.getcwd())
sys.path.append('/data/Phillippy/projects/giraffeT2T/assembly/script/verkko-fillet/src')
import verkkofillet as vf
vf = reload(vf)
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
/path/to/your/folder/script/test_notebooks/basics
verkko-fillet version:0.1.12
verkko-fillet version:0.1.12
import verkkofillet as vf
Trace back from un-HPC sequence to corresponding nodes in HPC graph#
Here is an example on chr6_pat in Giraffe, shown in the middle of the IGV screenshot. Both alignment patterns have issues. Let’s identify which nodes are involved in this region and check if it is one of the regions where we filled a gap.
We use a BED file containing all the intersected issues from both HiFi and ONT alignments. For more details, you can check marbl training material. Additionally, we will add padding regions at both ends of each region in the BED file.
# issue region
bed_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.bed"
fai = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet/assembly_trimmed_flipped_rename_sortedhap.fasta.fai"
vf.tl.addPadding_to_bed(bed_file , fai, force = True)
Force mode is enabled. Existing output files will be overwritten.
Saving padded BED file to /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.bed, with adding 10000 padding to the start and end positions.
Next, we will generate a JSON map file linking the coordinates between the un-HPC FASTA and the HPC FASTA. This process is performed using the following command and may take a long time to complete:
# input
uncomp_fasta = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet/assembly_trimmed_flipped_rename_sortedhap.fasta"
# output
mapJson_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/findNodes/map.json"
vf.tl.build_sparse_compression_map(uncomp_fasta, mapJson_file=mapJson_file)
build_sparse_compression_map will generate a map.json file, which will be used in the next step to liftover the regions.
By default, the output file name has the same name as the input FASTA file with the suffix .map.json. If you want to specify a custom file name, you can do so manually.
For the BED file, this step may take a long time, so it is recommended to filter for regions of interest. Here, we have filtered only the primary contigs.
# input
uncomp_bed= "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.filtered.bed"
# output
comp_bed= "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.filtered.comp.bed"
vf.tl.lift_seqs(uncomp_fasta, mapJson_file, uncomp_bed)
After lifting over to HPC coordinates, the regions in the BED file can be processed using vf.pp.bed_to_regionList function.
This function returns a list of regions, where The list follows the format: ‘chr:start-end’
regions_list = vf.pp.bed_to_regionsList(comp_bed)
regions_list[0:3]
['chr10_pat:11151143-11173573',
'chr10_pat:31856658-31882613',
'chr10_pat:109598646-109618946']
gaf_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/graphaligner_assembly_dip_with_cigar.gaf"
graph_file = "/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_gapFilling/assembly.homopolymer-compressed.noseq.gfa"
finalnodes = vf.pp.getNodes_from_unHPCregion(gaf_file, graph_file, regions_list)
Finding nodes for regions: 100%|██████████| 33/33 [00:03<00:00, 8.25it/s]
finalnodes
| region | nodes |
|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
In the example above, the region chr6_pat:54815981-54929325 is matched with [utig4-2329, utig4-2651]. We can observe that this region is one of the manually gap-filled regions where we introduced new nodes to the graph to connect unconnected nodes using ONT reads.
Now, you can save the finalnodes DataFrame locally to use it in the next step.
Alternatively, we can use the vf.tl.make_bandage_csv function to create a CSV file that can be used in Bandage to visualize the final nodes.
The final file will have the following columns:
node: Node nameuncomp: Position of the node in the uncompressed FASTAcomp: Position of the node in the compressed FASTA
vf.tl.make_bandage_csv(finalnodes, uncomp_bed, comp_bed, force = True)
Force mode is enabled. Existing output files will be overwritten.
writing to /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/issues/Hifi_ONT.cat.sorted.merged.both.pad_10000.filtered.comp.csv
Find Specific regions on nodes#
There may be cases where a clear haplotype switch error occurs in the middle of a large node rather than at the junction of nodes.
If you need to locate a specific region within a node, you can use the following code to identify the region.
assign haplotype to each reads using merly#
file = '/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/hifi.read.meryl.hapAssign.txt'
hifi_read = vf.pp.read_hapAssignRead(file)
hifi_read.head()
| read | len_read | hap1_total_kmer | hap1_found_kmer | hap2_total_kmer | hap2_found_kmer | hap_read |
|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
Read layout and scfmap for each node#
mainDir = '/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic'
obj = vf.pp.read_Verkko(mainDir)
os.chdir(mainDir)
The Verkko fillet target directory already exists: /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_verkko_fillet
If you didn't mean this, please set another directory or for overwirting, please use force= True
Lock the original Verkko folder to prevent it from being modified.
[lock_original_folder] Command executed successfully!
Path file loading...from /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_verkko_fillet/assembly.paths.tsv
Path file loaded successfully.
scfmap file not found or path is None.
nodeinfo, node_layout,node_layout_idx= vf.pp.readNodeInfo()
print(nodeinfo.head())
print(node_layout.head())
print(node_layout_idx.head())
node mat pat mat:pat color hap_node len piece
0 utig4-0 0 0 0:0 #AAAAAA lowcov 91826 piece000001
1 utig4-1 59053 342 59053:342 #FF8888 mat 1586764 piece000002
2 utig4-10 0 527 0:527 #8888FF pat 19952 piece000011
3 utig4-100 627 0 627:0 #FF8888 mat 393691 piece000101
4 utig4-1000 1 14336 1:14336 #8888FF pat 413025 piece001001
layout_info
0 tig\tpiece000001
1 len\t205259
2 trm\t0
3 rds\t860
4 3a15d69c-9360-42ca-9e19-042b77c18780;b25fd07b-...
index piece
0 0 piece000001
1 865 piece000002
2 6585 piece000003
3 12296 piece000004
4 24709 piece000005
gaf = '/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet_polishing/graphaligner_assembly_dip_with_cigar.gaf'
graph = "assembly.homopolymer-compressed.noseq.gfa"
region = ['chr10_pat:11151143-11173573', 'chr10_mat:1943373-10400000']
regions_node_db, regions_node_coor = vf.pp.getNodes_from_unHPCregion(gaf, graph, region)
print(regions_node_db)
print(regions_node_coor)
Finding nodes for regions: 100%|██████████| 2/2 [00:00<00:00, 4.70it/s]
region nodes
0 chr10_pat:11151143-11173573 [utig4-2061]
1 chr10_mat:1943373-10400000 [utig4-1795, utig4-2059, utig4-2060]
node start_coor_path_comp end_coor_path_comp len_node \
0 utig4-2061 10165055 11264668 1099613
1 utig4-1795 7607 10089484 10081877
2 utig4-2059 10087816 10246152 158336
3 utig4-2060 10244708 11344067 1099359
start_coor_on_node end_coor_on_node region
0 982859 1099613 chr10_pat:11151143-11173573
1 1936147 310897 chr10_mat:1943373-10400000
2 0 154229 chr10_mat:1943373-10400000
3 0 1099359 chr10_mat:1943373-10400000
loc_on_node, mergedb_all = vf.pp.getNodeCoor(obj, regions_node_db, hifi_read, node_layout, node_layout_idx, nodeinfo, regions_node_coor)
print(loc_on_node)
print(mergedb_all)
regions of interest: 100%|██████████| 2/2 [00:32<00:00, 16.44s/it]
node region \
0 utig4-2061 chr10_pat:11151143-11173573
0 utig4-1795 chr10_mat:1943373-10400000
0 utig4-2059 chr10_mat:1943373-10400000
0 utig4-2060 chr10_mat:1943373-10400000
contig start end hap_node
0 sire_compressed.k31.hapmer_from_utig4-1316 982859 1099613 pat
0 dam_compressed.k31.hapmer_from_utig4-1315 1936147 310897 mat
0 dam_compressed.k31.hapmer_from_utig4-1315 0 154229 lowcov
0 dam_compressed.k31.hapmer_from_utig4-1315 0 1099359 mat
read len_read \
0 44af7e69-5ecd-4c1f-8003-261aab21c883;a74bfbd1-... 79710.0
1 m84124_231207_000930_s3/124455843/ccs 34289.0
2 bc98aa1a-5d1f-4dfd-93c8-2fbda885000f;94cfbd16-... 123230.0
3 m84124_230818_165616_s1/210177320/ccs 20133.0
4 m84124_230828_171016_s1/52172300/ccs 22835.0
... ... ...
4018 m84124_230828_171016_s1/197921269/ccs 19332.0
4019 m84124_231207_004036_s4/206047197/ccs 20605.0
4020 m84124_230828_171016_s1/69405515/ccs 7034.0
4021 495e0f16-df33-4efe-abff-b1164e25dfb9;bc3a472c-... 28173.0
4022 m84124_230828_171016_s1/195563325/ccs 20649.0
hap1_total_kmer hap1_found_kmer hap2_total_kmer hap2_found_kmer \
0 37089377.0 0.0 36898822.0 554.0
1 37089377.0 0.0 36898822.0 139.0
2 37089377.0 0.0 36898822.0 4640.0
3 37089377.0 0.0 36898822.0 93.0
4 37089377.0 2.0 36898822.0 93.0
... ... ... ... ...
4018 37089377.0 0.0 36898822.0 0.0
4019 37089377.0 28.0 36898822.0 0.0
4020 37089377.0 31.0 36898822.0 0.0
4021 37089377.0 31.0 36898822.0 0.0
4022 37089377.0 6.0 36898822.0 0.0
hap_read start_on_node end_on_node platform node \
0 hap2 0 57146 ont utig4-2061
1 hap2 10731 35512 ccs utig4-2061
2 hap2 13716 101954 ont utig4-2061
3 hap2 14837 29390 ccs utig4-2061
4 hap2 15467 31963 ccs utig4-2061
... ... ... ... ... ...
4018 Unassigned 1142793 1129187 ccs utig4-2060
4019 hap1 1144888 1130374 ccs utig4-2060
4020 hap1 1145136 1140164 ccs utig4-2060
4021 hap1 1145667 1125855 ont utig4-2060
4022 hap1 1148218 1133601 ccs utig4-2060
contig hap_node mid_on_node
0 sire_compressed.k31.hapmer_from_utig4-1316 pat 28573
1 sire_compressed.k31.hapmer_from_utig4-1316 pat 23121
2 sire_compressed.k31.hapmer_from_utig4-1316 pat 57835
3 sire_compressed.k31.hapmer_from_utig4-1316 pat 22113
4 sire_compressed.k31.hapmer_from_utig4-1316 pat 23715
... ... ... ...
4018 dam_compressed.k31.hapmer_from_utig4-1315 mat 1135990
4019 dam_compressed.k31.hapmer_from_utig4-1315 mat 1137631
4020 dam_compressed.k31.hapmer_from_utig4-1315 mat 1142650
4021 dam_compressed.k31.hapmer_from_utig4-1315 mat 1135761
4022 dam_compressed.k31.hapmer_from_utig4-1315 mat 1140909
[44318 rows x 14 columns]
hap_dict = dict({'hap1': 'mat',
'hap2': 'pat',
'Unassigned' : 'Unassigned'})
mergedb_all['hap_read'] = mergedb_all['hap_read'].map(hap_dict)
mergedb_all.groupby('hap_read').size()
| 0 | |
|---|---|
| hap_read | |
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
node = 'utig4-2061'
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read",
x = "start_on_node",
multiple = 'stack',
)
node = 'utig4-2060'
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read",
x = "start_on_node",
# x = "mid_on_node",
multiple = 'stack',
)
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", x = "mid_on_node", multiple = 'fill')
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", x = "start_on_node")
vf.pl.plotHist_readOnNode(nodeinfo, mergedb_all, loc_on_node, node, hue = "hap_read", x = "end_on_node")
exc_read_db = mergedb_all.copy()
exc_read_db['exc'] = "Y"
exc_read_db.loc[exc_read_db['hap_read'] == exc_read_db['hap_node'], 'exc'] = "N"
exc_read_db.loc[exc_read_db['hap_read'] == 'Unassigned', 'exc'] = "N"
exc_read_db = exc_read_db[exc_read_db['exc'] == 'Y']
exc_read_db.to_csv(f"../exc_read_db.csv", index=False)
obj = vf.pp.findGaps(obj)
45 gaps were found -> obj.gaps
segment, link = vf.pp.readGraph('assembly.homopolymer-compressed.noseq.gfa')
node_space = vf.pp.getNodeSpace_from_allPath(obj, segment, link, file = '../comp_node.bed')
100%|██████████| 366/366 [00:05<00:00, 67.60it/s]
sub = node_space.loc[node_space['chrom'] == 'dam_compressed.k31.hapmer_from_utig4-1003',:]
sub['nodeLen'] = sub['chromEnd'] - sub['chromStart']
sub
| chrom | chromStart | chromEnd | name | score | strand | nodeLen | |
|---|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
sub = sub.merge(nodeinfo , left_on = 'name', right_on = 'node', how = 'left')
sub.loc[sub['nodeLen'] != sub['len'], :]
| chrom | chromStart | chromEnd | name | score | strand | nodeLen | node | mat | pat | mat:pat | color | hap_node | len | piece | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
node_space.groupby('chrom').max('chromEnd')
| chromStart | chromEnd | score | |
|---|---|---|---|
| chrom | |||
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |