Formatting bed files for khanlab exome-rnaseq pipeline.¶

The Target bed files you receive from capture kits usually arrive in different formats. Here are the steps to convert them to a pipeline accepted format. This bed file will be used to calculate Readdepth, coverage and failedexons in the pipeline.

1. Filter GTF file to extract bedfile of exons.¶

Pipeline uses gtf annotation gencodev37.
Filter to get exons ie., by col 3.
Convert to bed format (columns 1-3) plus format strand (replace ± with f/r), extract transcript ID, gene name, and exon number from annotations, and add header. here is an example output

chr	start	stop	strand	transcript_ID	Genename	exon_number
chr1	11869	12227	f	ENST00000456328.2_1	DDX11L1	1
chr1	12613	12721	f	ENST00000456328.2_1	DDX11L1	2
chr1	13221	14409	f	ENST00000456328.2_1	DDX11L1	3
chr1	12010	12057	f	ENST00000450305.2_1	DDX11L1	1
chr1	12179	12227	f	ENST00000450305.2_1	DDX11L1	2
chr1	12613	12697	f	ENST00000450305.2_1	DDX11L1	3
chr1	12975	13052	f	ENST00000450305.2_1	DDX11L1	4

2. Downlaod HGNC Biomart table - To identify protein-coding genes¶

Download from HGNCtable with HGNC ID, status, approved symbol, approved name and Ensembl gene ID
Filter to protein-coding transcripts ("NM_") and reduce to columns 8-9 (transcript_ID,RefSeq transcript ID). Here is an example output.

transcript_ID	RefSeq transcript ID
ENST00000263100.8	NM_130786.4
ENST00000373997.8	NM_014576.4
ENST00000318602.12	NM_000014.6
ENST00000299698.12	NM_144670.6
ENST00000442999.3	NM_001080438.1

3. Filter outputs from above two steps by merging on transcript ID¶

#!/usr/bin/env python

import pandas as pd

df1 = pd.read_csv("Filtered_GTF_file_with_exons",sep='\t')
df1['transcript_ID'] = df1['transcript_ID'].str.split('.').str[0]
df1['exon_number'] = df1['exon_number'].round(decimals=0).astype(object)

df2 = pd.read_csv("Filtered_hgnc_biomart_table-transcript-NM",sep='\t')
df2['transcript_ID'] = df2['transcript_ID'].str.split('.').str[0]
df3 = pd.merge(df1,df2, on="transcript_ID",how="inner")
df3['exon_number'] = df3['exon_number'].astype(str)
df3['RefSeq transcript ID'] = df3['RefSeq transcript ID'].str.split('.').str[0]
df3['NM_ID_strand'] = df3['RefSeq transcript ID'] + "_cds_" + df3['strand']

df5 = df3[['chr','start','stop','Genename','exon_number','NM_ID_strand']]
df5['exon_number'] = df5['exon_number'].str.split('.').str[0]
df5.to_csv("gencode-exon-hgnc-biomart-table-khanlab-bed.csv",index=False,encoding='utf-8', sep ='\t')

4. Filter Capture kit target intervals using the output generated in step 3¶

Extract columns 1-3 from the capture kit bed file.

bedtools intersect -a capture_kit_bed_file /
                   -b gencode-exon-hgnc-biomart-table-khanlab-bed.csv /
                   -wa -wb | cut -f1,2,3,7,8,9| bedtools sort |uniq| /
                   -mergeBed -i - -c 4,5,6 -o distinct,collapse,collapse | /
                   awk '{OFS="\t"}{print $1,$2,$3,$4"___"$5,$6}' > khanlab_pipeline_accepted_bed_file.

This page was created through contributions from Erica Pehrsson and Vineela Gangalapudi.