Usage

Usage¶

This page describes how to run the workflow on Biowulf.
For installation and prerequisites, see the Installation page.

Running on Biowulf¶

On Biowulf, the pipeline is hosted in the Khanlab space.
A helper script, launch.sh, is provided to simplify execution. This script wraps the nextflow run command with Biowulf-specific settings and automatically configures output directories, logging, and profiles.

Step 1 — Copy the launch script¶

launch.sh

#!/bin/bash

# Set default values
DEFAULT_OUTDIR="/data/khanlab/projects/processed_DATA"
DEFAULT_GENOME="hg19"
DEFAULT_PLATFORM="biowulf"

# Check if the required samplesheet argument is provided
if [[ "$#" -lt 1 ]]; then
    echo "Usage: $0 <samplesheet_with_full_path> [output_directory] [genome]"
    echo "This script requires at least one positional argument:"
    echo "1. Path to samplesheet"
    echo "Optional arguments:"
    echo "2. Path to results directory (default: $DEFAULT_OUTDIR)"
    echo "3. Genome name. Accepted values are hg19 and mm39 (default: $DEFAULT_GENOME)"
    exit 1
fi

# Assign the first argument (samplesheet) to a variable
export SAMPLESHEET=$1

# Assign the second argument (output_directory) if provided, otherwise use the default
export OUTDIR=${2:-$DEFAULT_OUTDIR}

# Assign the third argument (genome) if provided, otherwise use the default
export GENOME=${3:-$DEFAULT_GENOME}

export PLATFORM=${4:-$DEFAULT_PLATFORM}

# Ensure the genome name is valid
if [[ "$GENOME" != "hg19" && "$GENOME" != "mm39" ]]; then
    echo "Invalid genome specified. Accepted values are hg19 and mm39."
    exit 1
fi

WF_HOME="/data/khanlab/projects/Nextflow_dev/dev/AWS_POC_MVP_NF"
CONFIG_FILE="$WF_HOME/biowulf_nextflow.config"

#export PATIENT=$(awk -F',' 'NR==1 {for (i=1; i<=NF; i++) if ($i=="sample") s=i} NR>1 {print $s}' "$SAMPLESHEET" | sort | uniq)
#export CASENAME=$(awk -F',' 'NR==1 {for (i=1; i<=NF; i++) if ($i=="casename") c=i} NR>1 {print $c}' "$SAMPLESHEET" | sort | uniq)
export PATIENT=$(python3 -c 'import csv,sys; r=csv.DictReader(open(sys.argv[1])); print("\n".join(sorted(set(row["sample"] for row in r))))' "$SAMPLESHEET")
export CASENAME=$(python3 -c 'import csv,sys; r=csv.DictReader(open(sys.argv[1])); print("\n".join(sorted(set(row["casename"] for row in r))))' "$SAMPLESHEET")

export RESULTSDIR="$OUTDIR/$PATIENT/$CASENAME"

mkdir -p "$RESULTSDIR"

export LOG="$RESULTSDIR/log"

mkdir -p "$LOG"

export NXF_HOME="$RESULTSDIR/.nextflow"

if [[ -z "$PATIENT" || -z "$CASENAME" ]]; then
    echo "Error: Could not extract PATIENT or CASENAME from the samplesheet."
    exit 1
fi


cd $RESULTSDIR

if [[ "$GENOME" == "hg19" ]]; then
    PROFILE="biowulf_test_run_slurm"
elif [[ "$GENOME" == "mm39" ]]; then
    PROFILE="biowulf_mouse_RNA_slurm"
else
    echo "Unknown genome: $GENOME"
    exit 1
fi

logname=$(basename "$SAMPLESHEET" .csv)
timestamp=$(date +"%Y%m%d-%H%M%S")
sbatch <<EOT
#!/bin/bash
#SBATCH --job-name="$logname"
#SBATCH --output="$OUTDIR/$PATIENT/$CASENAME/${logname}_%A_${timestamp}.out"
#SBATCH --cpus-per-task=2
#SBATCH --mem=05g
#SBATCH --time=08-00:00:00

module load nextflow/23.10.0 singularity graphviz
nextflow run -c $CONFIG_FILE -profile $PROFILE --logdir $LOG $WF_HOME/main.nf -resume --samplesheet $SAMPLESHEET --resultsdir $OUTDIR --genome_v $GENOME --platform $PLATFORM
exit 0
EOT

Download or copy the script into your Biowulf working directory and make it executable:

cp /data/khanlab/projects/Nextflow_dev/dev/AWS_POC_MVP_NF/launch.sh .
chmod +x launch.sh

Step 2 — View usage instructions¶

Run the script without arguments to display the usage help:

./launch.sh

Usage: ./launch.sh <samplesheet_with_full_path> [output_directory] [genome]
This script requires at least one positional argument:
1. Path to samplesheet
Optional arguments:
2. Path to results directory (default: /data/khanlab/projects/processed_DATA)
3. Genome name. Accepted values are hg19 and mm39 (default: hg19)

Usage Examples:

Run with defaults (output to default path, genome hg19):

./launch.sh /path/to/samplesheet.csv
Run with custom output directory:

./launch.sh /path/to/samplesheet.csv /custom/output/path
Run with custom output directory and genome:

./launch.sh /path/to/samplesheet.csv /custom/output/path mm39

mm39

If mm39 is selected for the genome, the pipeline will run Mapping & Gene expression only for mouse data.

Preparing the Samplesheet¶

You can prepare a samplesheet in two ways, depending on your use case:

Build from a Master Sheet – Recommended if you also plan to visualize results in the ClinOmics Data Portal.
Build Manually – For standalone processing without portal integration.

1. Build from a Master Sheet (Recommended)¶

For Khanlab workflows, master sheets are stored on Biowulf in:

/data/khanlab/projects/DATA/Sequencing_Tracking_Master

Use the script:

samplesheet_builder.py

python ./samplesheet_builder.py <PatientID> <CaseName>

Default Master Sheet Directory: /data/khanlab/projects/DATA/Sequencing_Tracking_Master

Default Input Directory: /data/khanlab/projects/DATA

To use custom directories, edit DEFAULT_SAMPLESHEET_DIR and DEFAULT_INPUT_DIR in the script.

Error Handling:

Invalid FASTQ Paths – If read1 or read2 paths are invalid, the script will print an error prompting you to verify input paths.
Unknown Capture Kit – If the capture kit entered is not listed in Sequencing Capture Kits, the script will assign unknown as the kit name.

2. Build Your Own Samplesheet¶

Create a CSV with these column information:

Column	Description	Example
sample	Patient name	NCI-Test1
library	Sample library name	Test1_T1D_E
read1	Path to R1 FASTQ	/data/khanlab/DATA/Sample_Test1_T1D_E/Sample_Test1_T1D_E.R1.fastq.gz
read2	Path to R2 FASTQ	/data/khanlab/DATA/Sample_Test1_T1D_E/Sample_Test1_T1D_E.R2.fastq.gz
sample_captures	Capture kit name	Sequencing Capture Kits
Matched_RNA	Matched RNA library (optional)	Test1_T1R_T
Matched_normal	Matched exome normal library (optional)	Test1_N1D_E
Diagnosis	Patient diagnosis	Glioma
casename	Case name	NCI-Test1
type	Data type	tumor_RNA, tumor_DNA, normal_DNA, etc.
FCID	Flowcell ID (optional)	ACJ678349
Genome	Genome	hg19 or mm39

Example Input csv:

sample	library	read1	read2	sample_captures	Diagnosis	Matched_RNA	Matched_normal	casename	type	FCID	Genome
Test8	Test5_T1D_E	/data/khanlab/projects/fastq/Test5_T1D_E_R1.fastq.gz	/data/khanlab/projects/fastq/Test5_T1D_E_R2.fastq.gz	clin.ex.v1	Osteosarcoma		Test8_N2D_E	NFtest0523	tumor_DNA	AWXYNH2	hg19
Test8	Test8_N2D_E	/data/khanlab/projects/fastq/Test8_N2D_E_R1.fastq.gz	/data/khanlab/projects/fastq/Test8_N2D_E_R2.fastq.gz	clin.ex.v1	Osteosarcoma			NFtest0523	normal_DNA	AWXYNH2	hg19