Final touch to the assembly before polishing#

This Jupyter notebook will guide you on how to create a cleaner assembly after rerunning Verkko to fill some gaps and before proceeding to the polishing step. This includes contig trimming, chromosome flipping, and chromosome renaming.

%load_ext autoreload
%autoreload 2
import sys 
import importlib
import pandas as pd
import time
import os
from IPython.display import Image, display
pd.set_option('mode.chained_assignment', None)
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import warnings
import session_info
# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import verkkofillet as vf
# importlib.reload(vf)
verkkoFillet version 0.1.0
verkkoFillet version 0.1.0
<module 'verkkoFillet' from '/path/to/your/folder/script/post_verkko_pkg/src/verkkoFillet/__init__.py'>

Load data#

verkkoDir="/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0"
obj = vf.pp.read_Verkko(verkkoDir)
The Verkko fillet target directory already exists: /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet
If you didn't mean this, please set another directory or for overwirting, please use force= True
Lock the original Verkko folder to prevent it from being modified.
[lock_original_folder] Command executed successfully!
Path file loading...from /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet/assembly.paths.tsv
Path file loaded successfully.
scfmap file loading...from /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet/assembly.scfmap
scfmap file loaded successfully.
os.chdir(obj.verkko_fillet_dir)
obj
FilletObj
  verkkoDir: /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0
  verkko_fillet_dir: /path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.0_verkko_fillet
  paths: name, path, assignment
  history: timestamp, activity
  scfmap: contig, pathName

Make stats#

%%time
vf.tl.getT2T(obj)
getT2T was done!
CPU times: user 2.26 ms, sys: 6.25 ms, total: 8.51 ms
Wall time: 6.86 s
ref_fasta = "/path/to/your/folder/ncbiRef/GCA_017591445.1_ASM1759144v1_genomic.rename.fa"
map_file="/path/to/your/folder/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/chromosome.map"
%%time
ref = "/path/to/your/folder/ncbiRef/GCA_017591445.1_ASM1759144v1_genomic.rename.fa"
vf.tl.chrAssign(obj = obj, ref = ref)
The [assembly.mashmap.out] file is already exists
If you want to re-run this job, please detete [assembly.mashmap.out]
CPU times: user 217 Β΅s, sys: 87 Β΅s, total: 304 Β΅s
Wall time: 391 Β΅s
obj = vf.pp.readChr(obj, map_file, sire = "pat", dam = "mat")
The chromosome infomation was stored in obj.stats
vf.pl.contigPlot(obj)
../../_images/6d2450e3bf859f6968d80916f1c6d7003915084e84fdcc415468cc99ba33705d.png

Trimming#

# vf.tl.run_intra_telo(obj)
file = "assembly.tel_finder.windows.0.5.bed"
vf.pp.find_intra_telo(obj, file=file, loc_from_end=15000)
contig old_chr ref_chr hap start end len_fai
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
trim_contig_dict = { 'contig' : ['dam_compressed.k31.hapmer-0000002'],
                    'from' : [18800],
                    'to' : [152521440]
}
vf.tl.runTrimming(obj, trim_contig_dict)
Trim Chromosomes:   0%|                         | 1/253 [00:00<02:19,  1.80it/s]
dam_compressed.k31.hapmer-0000002 will be trimmed
samtools faidx assembly.fasta dam_compressed.k31.hapmer-0000002:18800-152521440 | sed -e '1d' | sed -e '1i >dam_compressed.k31.hapmer-0000002'>> assembly_trimmed.fasta
Trim Chromosomes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 253/253 [00:15<00:00, 16.18it/s]
Trimming completed. Output file: assembly_trimmed.fasta

Flipping chromosome#

If your species already has a reference genome, you can examine the orientation of your assembly. If the orientation is flipped, you can adjust (flip) specific chromosomes as needed. However, it is important to ensure that the previous reference genome was carefully curated for orientation. To confirm this, you can review the methods section in the original publication associated with the released reference genome or other relevant documentation.

output

  • assembly.mashmap.out

  • assembly.homopolymer-compressed.chr.csv

  • unitigs.hpc.mashmap.out

  • translation_hap1

  • translation_hap2

  • chr_completeness_max_hap1

  • chr_completeness_max_hap2

image_path = "/path/to/your/folder/script/post_verkko_pkg/data/test_giraffe/fig/chromosomeFlipping.png"
display(Image(filename=image_path,width=400))
vf.tl.showPairwiseAlign(obj, size="large", minLen = 100000)
[showPairwiseAlign_1] Command executed successfully!
[showPairwiseAlign_2] Command executed successfully!
../../_images/2de070cba23d9f6f0da0cb39047b11c47c2c6a37e4a48eb6d34e73b72529e8b2.png
vf.pl.showMashmapOri(obj , by = 'all')
../../_images/c2b527aba0bafb1420ed9d7194a373cfa5dcffd09647ea5745575e9ca8b5803b.png
filp_contig_list = ['sire_compressed.k31.hapmer-0000229', #chr2_pat
                   'dam_compressed.k31.hapmer-0000003', # chr2_mat
                   'sire_compressed.k31.hapmer-0000242', #chr3_pat
                   'dam_compressed.k31.hapmer-0000015', #chr3_mat
                   'sire_compressed.k31.hapmer-0000232', #chr4_mat
                   'dam_compressed.k31.hapmer-0000006', #chr3_pat
                   'sire_compressed.k31.hapmer-0000230', #chr6_pat
                   'dam_compressed.k31.hapmer-0000004', #chr6_mat
                   'sire_compressed.k31.hapmer-0000238', #chr7_pat
                   'dam_compressed.k31.hapmer-0000012', #chr7_mat
                   'sire_compressed.k31.hapmer-0000239', #chr8_pat
                   'dam_compressed.k31.hapmer-0000013', #chr8_mat
                   'sire_compressed.k31.hapmer-0000240', #chr9_pat
                   'dam_compressed.k31.hapmer-0000014',] #chr9_pat
len(filp_contig_list)
14
obj.stats.loc[obj.stats['contig'].isin(filp_contig_list),]
contig ref_chr contig_len ref_chr_len hap old_chr completeness hap_verkko t2tStat
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

The vf.tl.flipChr() function flips the original FASTA file using the list of chromosomes, flip_contig_list. The final FASTA file will be generated with the suffix _flipped.fasta.

vf.tl.flipContig(filp_contig_list, 
              ori_fasta="assembly_trimmed.fasta",
              final_fasta="assembly_trimmed_flipped.fasta")
Flipping Chromosomes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 253/253 [00:58<00:00,  4.31it/s]
The chromosome flipping was completed successfully!
Output FASTA: assembly_trimmed_flipped.fasta

Renaming Primary and Other Small Contigs with Chromosome Information and Haplotype#

We will rename contigs using patterns such as [chr]_[hap] for primary chromosomes and [chr]_[hap]_random_[node] for random contigs, where node represents a representative node in the path. You can trace back the original paths later.

  • chr: chr1, chr2, chr3, … UnChr

  • hap: mat, pat, or Unhap

vf.pp.find_multi_used_node() identifies nodes that are used by more than one chromosome. We will exclude these nodes when clustering nodes by chromosome using link information from the GFA file.

You will get two outputs:

  • duplicate_nodes: A list of nodes used in different chromosomes.

  • node_database: A dataframe with two columns (β€œchromosome” and β€œnodeSet”), containing the set of nodes used exclusively in the corresponding chromosome.

image_path = "/path/to/your/folder/script/post_verkko_pkg/data/test_giraffe/fig/nodeClustering.png"
display(Image(filename=image_path,width=400))
duplicate_nodes, node_database = vf.pp.find_multi_used_node(obj)
# duplicates, node_database = vf.pp.find_multi_used_node(obj)
duplicate_nodes
['utig4-1759', 'utig4-1805', 'utig4-1757']

The vf.pp.naming_contigs() function uses the FAI index of assembly.fasta located in the obj.verkkoDir directory, along with obj.paths and obj.scfmap, to assign the corresponding chromosome and haplotype to non-primary contigs.

chrMap = vf.pp.naming_contigs(obj, node_database, duplicate_nodes)
component_15 : empty
chrMap
contig new_contig_name
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

If you have made any changes to the chromosome naming, please edit the chrMap directly before using vf.tl.renameContig().

chrMap['new_contig_name'] = chrMap['new_contig_name'].replace('chrX_pat', 'chrY_pat')

This is the final step. Using the dataframe obtained above, we can rename the contigs using vf.tl.renameContig(). The final FASTA file will be generated with the suffix _rename.fasta.

vf.tl.renameContig(obj, chrMap, original_fasta="assembly_trimmed_flipped.fasta")
[chrRename] Command executed successfully!
Final renamed fasta file : assembly_trimmed_flipped_rename.fasta

Sort#

Before polishing, we will sort the contigs either by chr or by hap:

  • by hap: Contigs will be sorted by chromosome within haplotype. For example: chr1_mat > chr2_mat > chr3_mat > … > chrX_mat > chr1_pat > chr2_pat > … > chr1_mat_random_utig4-10 > …

  • by chr: Contigs will be sorted by haplotype within chromosome. For example: chr1_mat > chr1_pat > chr2_mat > chr2_pat > … > chr1_mat_random_utig4-10 > …

The default mode is sort_by="hap". You can use sort_by="chr", but hap sorting is better when viewing the assembly in IGV.

vf.tl.sortContig("assembly_trimmed_flipped_rename.fasta", sort_by="hap")
Sorted sequences have been written to assembly_trimmed_flipped_rename_sortedhap.fasta