Domain Overview
The Genomics & Bioinformatics domain provides specialized computing environments optimized for large-scale genomic data analysis. With pre-configured software stacks and high-memory instances, researchers can process whole genome sequences, perform variant calling, and conduct population genetics studies without infrastructure complexity.
🔧 Pre-configured Tools
Complete genomics software stack including GATK, BWA, STAR, Samtools, and Bioconductor packages ready to use.
💾 High-Memory Instances
Memory-optimized instances (up to 768GB RAM) designed for large genome assemblies and variant calling pipelines.
📊 Scalable Storage
High-performance storage configurations optimized for large genomic datasets and intermediate file handling.
Pre-installed Software
The genomics environment includes industry-standard tools and the latest versions of essential bioinformatics software:
GATK v4.4.0
Genome Analysis Toolkit for variant discovery
BWA v0.7.17
Burrows-Wheeler Aligner for sequence mapping
STAR v2.7.10
RNA-seq aligner for transcript mapping
Samtools v1.17
Suite for manipulating SAM/BAM files
bcftools v1.17
Variant calling and format conversion
HISAT2 v2.2.1
Graph-based alignment of RNA-seq reads
StringTie v2.2.1
Transcript assembly and quantification
Cufflinks v2.2.1
RNA-seq analysis pipeline
FastQC v0.12.1
Quality control for high-throughput data
Trimmomatic v0.39
Flexible read trimming tool
Bowtie2 v2.5.1
Fast and memory-efficient alignment
TopHat v2.1.2
Splice junction mapper for RNA-seq
Recommended Instance Configurations
Choose the optimal configuration based on your data size and analysis requirements:
Configuration | Instance Type | vCPUs | Memory | Storage | Use Case | Est. Cost/Hour |
---|---|---|---|---|---|---|
Small | r6i.large | 2 | 16 GB | 500 GB SSD | Small datasets, testing | $0.252 |
Medium | r6i.xlarge | 4 | 32 GB | 1 TB SSD | Exome sequencing, small cohorts | $0.504 |
Large | r6i.4xlarge | 16 | 128 GB | 2 TB SSD | Whole genome sequencing | $2.016 |
XLarge | r6i.8xlarge | 32 | 256 GB | 4 TB SSD | Large cohorts, population studies | $4.032 |
Real-World Datasets
The genomics environment provides seamless access to major public genomics datasets:
1000 Genomes Project
Description: Complete genomic sequences from 2,504 individuals across 26 populations worldwide.
Access: s3://1000genomes/
Use Cases: Population genetics, variant discovery, ancestry analysis
Format: VCF, BAM, FASTQ files
gnomAD v3.1.2
Description: Genome Aggregation Database with variants from 76,156 genomes and 125,748 exomes.
Access: s3://gnomad-public-us-east-1/
Use Cases: Variant annotation, frequency analysis, clinical genomics
Format: VCF, Hail Tables, Parquet
TCGA Research Network
Description: Cancer genomics data from The Cancer Genome Atlas with clinical annotations.
Access: s3://tcga-2-open/
Use Cases: Cancer research, tumor analysis, biomarker discovery
Format: BAM, VCF, RNA-seq, methylation data
Human Pangenome Reference
Description: Draft human pangenome reference representing genetic diversity.
Access: s3://human-pangenomics/
Use Cases: Reference genome improvement, structural variant analysis
Format: FASTA, GFA, VCF
Common Workflows
Pre-configured workflows for standard genomics analyses:
Variant Calling Pipeline (GATK Best Practices)
Quality Control & Preprocessing
Run FastQC for quality assessment, followed by Trimmomatic for adapter removal and quality trimming.
fastqc sample_R1.fastq.gz sample_R2.fastq.gz
Read Alignment
Align reads to reference genome using BWA-MEM with optimal parameters for paired-end data.
bwa mem -M -t 16 reference.fa sample_R1.fastq.gz sample_R2.fastq.gz
Mark Duplicates
Use GATK MarkDuplicates to identify and mark PCR duplicates in aligned reads.
gatk MarkDuplicates -I aligned.bam -O marked.bam -M metrics.txt
Base Quality Recalibration
Apply GATK BaseRecalibrator and ApplyBQSR for accurate base quality scores.
gatk BaseRecalibrator -I marked.bam -R reference.fa --known-sites dbsnp.vcf
Variant Calling
Call variants using GATK HaplotypeCaller with appropriate parameters for your data type.
gatk HaplotypeCaller -R reference.fa -I recalibrated.bam -O variants.vcf
Variant Filtering
Apply hard filters or VQSR to remove low-quality variants from the callset.
gatk VariantFiltration -R reference.fa -V variants.vcf --filter-expression "QD < 2.0"
Performance Benchmarks
Real-world performance metrics for common genomics tasks:
Getting Started
Ready to start your genomics research? Follow these steps:
- Launch Research Wizard: Start the web interface and navigate to the Domains tab
- Select Genomics: Choose "Genomics & Bioinformatics" from the life sciences category
- Configure Environment: Select instance size based on your data requirements
- Deploy: One-click deployment creates your fully configured environment
- Access Tools: Connect via SSH or Jupyter for interactive analysis
- Monitor Usage: Track performance and costs in real-time
💡 Best Practices
- Use spot instances for development and testing (70% cost savings)
- Enable auto-shutdown to prevent unnecessary charges
- Store results in S3 for long-term retention
- Use EFS for shared storage across multiple instances
- Monitor memory usage and scale instances as needed