Usage
Usage¶
This page describes how to run the workflow on Biowulf.
For installation and prerequisites, see the Installation page.
Running on Biowulf¶
On Biowulf, the pipeline is hosted in the Khanlab space.
A helper script, launch.sh
, is provided to simplify execution. This script wraps the nextflow run
command with Biowulf-specific settings and automatically configures output directories, logging, and profiles.
Step 1 — Copy the launch script¶
launch.sh
#!/bin/bash
# Set default values
DEFAULT_OUTDIR="/data/khanlab/projects/processed_DATA"
DEFAULT_GENOME="hg19"
DEFAULT_PLATFORM="biowulf"
# Check if the required samplesheet argument is provided
if [[ "$#" -lt 1 ]]; then
echo "Usage: $0 <samplesheet_with_full_path> [output_directory] [genome]"
echo "This script requires at least one positional argument:"
echo "1. Path to samplesheet"
echo "Optional arguments:"
echo "2. Path to results directory (default: $DEFAULT_OUTDIR)"
echo "3. Genome name. Accepted values are hg19 and mm39 (default: $DEFAULT_GENOME)"
exit 1
fi
# Assign the first argument (samplesheet) to a variable
export SAMPLESHEET=$1
# Assign the second argument (output_directory) if provided, otherwise use the default
export OUTDIR=${2:-$DEFAULT_OUTDIR}
# Assign the third argument (genome) if provided, otherwise use the default
export GENOME=${3:-$DEFAULT_GENOME}
export PLATFORM=${4:-$DEFAULT_PLATFORM}
# Ensure the genome name is valid
if [[ "$GENOME" != "hg19" && "$GENOME" != "mm39" ]]; then
echo "Invalid genome specified. Accepted values are hg19 and mm39."
exit 1
fi
WF_HOME="/data/khanlab/projects/Nextflow_dev/dev/AWS_POC_MVP_NF"
CONFIG_FILE="$WF_HOME/biowulf_nextflow.config"
#export PATIENT=$(awk -F',' 'NR==1 {for (i=1; i<=NF; i++) if ($i=="sample") s=i} NR>1 {print $s}' "$SAMPLESHEET" | sort | uniq)
#export CASENAME=$(awk -F',' 'NR==1 {for (i=1; i<=NF; i++) if ($i=="casename") c=i} NR>1 {print $c}' "$SAMPLESHEET" | sort | uniq)
export PATIENT=$(python3 -c 'import csv,sys; r=csv.DictReader(open(sys.argv[1])); print("\n".join(sorted(set(row["sample"] for row in r))))' "$SAMPLESHEET")
export CASENAME=$(python3 -c 'import csv,sys; r=csv.DictReader(open(sys.argv[1])); print("\n".join(sorted(set(row["casename"] for row in r))))' "$SAMPLESHEET")
export RESULTSDIR="$OUTDIR/$PATIENT/$CASENAME"
mkdir -p "$RESULTSDIR"
export LOG="$RESULTSDIR/log"
mkdir -p "$LOG"
export NXF_HOME="$RESULTSDIR/.nextflow"
if [[ -z "$PATIENT" || -z "$CASENAME" ]]; then
echo "Error: Could not extract PATIENT or CASENAME from the samplesheet."
exit 1
fi
cd $RESULTSDIR
if [[ "$GENOME" == "hg19" ]]; then
PROFILE="biowulf_test_run_slurm"
elif [[ "$GENOME" == "mm39" ]]; then
PROFILE="biowulf_mouse_RNA_slurm"
else
echo "Unknown genome: $GENOME"
exit 1
fi
logname=$(basename "$SAMPLESHEET" .csv)
timestamp=$(date +"%Y%m%d-%H%M%S")
sbatch <<EOT
#!/bin/bash
#SBATCH --job-name="$logname"
#SBATCH --output="$OUTDIR/$PATIENT/$CASENAME/${logname}_%A_${timestamp}.out"
#SBATCH --cpus-per-task=2
#SBATCH --mem=05g
#SBATCH --time=08-00:00:00
module load nextflow/23.10.0 singularity graphviz
nextflow run -c $CONFIG_FILE -profile $PROFILE --logdir $LOG $WF_HOME/main.nf -resume --samplesheet $SAMPLESHEET --resultsdir $OUTDIR --genome_v $GENOME --platform $PLATFORM
exit 0
EOT
Download or copy the script into your Biowulf working directory and make it executable:
cp /data/khanlab/projects/Nextflow_dev/dev/AWS_POC_MVP_NF/launch.sh .
chmod +x launch.sh
Step 2 — View usage instructions¶
Run the script without arguments to display the usage help:
./launch.sh
Usage: ./launch.sh <samplesheet_with_full_path> [output_directory] [genome]
This script requires at least one positional argument:
1. Path to samplesheet
Optional arguments:
2. Path to results directory (default: /data/khanlab/projects/processed_DATA)
3. Genome name. Accepted values are hg19 and mm39 (default: hg19)
Usage Examples:
-
Run with defaults (output to default path, genome hg19):
./launch.sh /path/to/samplesheet.csv
-
Run with custom output directory:
./launch.sh /path/to/samplesheet.csv /custom/output/path
-
Run with custom output directory and genome:
./launch.sh /path/to/samplesheet.csv /custom/output/path mm39
mm39
If mm39 is selected for the genome, the pipeline will run Mapping & Gene expression only for mouse data.
Preparing the Samplesheet¶
You can prepare a samplesheet in two ways, depending on your use case:
- Build from a Master Sheet – Recommended if you also plan to visualize results in the ClinOmics Data Portal.
- Build Manually – For standalone processing without portal integration.
1. Build from a Master Sheet (Recommended)¶
For Khanlab workflows, master sheets are stored on Biowulf in:
/data/khanlab/projects/DATA/Sequencing_Tracking_Master
Use the script:
python ./samplesheet_builder.py <PatientID> <CaseName>
Default Master Sheet Directory: /data/khanlab/projects/DATA/Sequencing_Tracking_Master
Default Input Directory: /data/khanlab/projects/DATA
To use custom directories, edit DEFAULT_SAMPLESHEET_DIR and DEFAULT_INPUT_DIR in the script.
Error Handling:
- Invalid FASTQ Paths – If
read1
orread2
paths are invalid, the script will print an error prompting you to verify input paths. - Unknown Capture Kit – If the capture kit entered is not listed in Sequencing Capture Kits, the script will assign
unknown
as the kit name.
2. Build Your Own Samplesheet¶
Create a CSV with these column information:
Column | Description | Example |
---|---|---|
sample | Patient name | NCI-Test1 |
library | Sample library name | Test1_T1D_E |
read1 | Path to R1 FASTQ | /data/khanlab/DATA/Sample_Test1_T1D_E/Sample_Test1_T1D_E.R1.fastq.gz |
read2 | Path to R2 FASTQ | /data/khanlab/DATA/Sample_Test1_T1D_E/Sample_Test1_T1D_E.R2.fastq.gz |
sample_captures | Capture kit name | Sequencing Capture Kits |
Matched_RNA | Matched RNA library (optional) | Test1_T1R_T |
Matched_normal | Matched exome normal library (optional) | Test1_N1D_E |
Diagnosis | Patient diagnosis | Glioma |
casename | Case name | NCI-Test1 |
type | Data type | tumor_RNA, tumor_DNA, normal_DNA, etc. |
FCID | Flowcell ID (optional) | ACJ678349 |
Genome | Genome | hg19 or mm39 |
Example Input csv:
sample | library | read1 | read2 | sample_captures | Diagnosis | Matched_RNA | Matched_normal | casename | type | FCID | Genome |
---|---|---|---|---|---|---|---|---|---|---|---|
Test8 | Test5_T1D_E | /data/khanlab/projects/fastq/Test5_T1D_E_R1.fastq.gz | /data/khanlab/projects/fastq/Test5_T1D_E_R2.fastq.gz | clin.ex.v1 | Osteosarcoma | Test8_N2D_E | NFtest0523 | tumor_DNA | AWXYNH2 | hg19 | |
Test8 | Test8_N2D_E | /data/khanlab/projects/fastq/Test8_N2D_E_R1.fastq.gz | /data/khanlab/projects/fastq/Test8_N2D_E_R2.fastq.gz | clin.ex.v1 | Osteosarcoma | NFtest0523 | normal_DNA | AWXYNH2 | hg19 |