Genomics Research Environment - Getting Started
Genomics Research Environment - Getting Started
Time to Complete: 20 minutes Cost: $8-12 for tutorial Skill Level: Beginner (no cloud experience needed)
What You’ll Build
By the end of this guide, you’ll have a working genomics research environment that can:
- Process DNA sequence files (FASTQ, SAM/BAM, VCF)
- Run popular tools like BWA, GATK, and SAMtools
- Handle datasets up to 500GB in size
- Cost 60% less than traditional computing clusters
Meet Dr. Sarah Kim
Dr. Sarah Kim is a genomics researcher at Johns Hopkins. She studies rare genetic diseases but waits 3-5 days for university cluster access. Each analysis takes a week to complete, slowing down her research.
Before: 3-day waits + 1-week analysis = 10 days per study After: 15-minute setup + 4-hour analysis = same day results Time Saved: 95% faster research cycle Cost Savings: $400/month vs $1,200 university allocation
Before You Start
What You Need
- AWS account (free to create)
- Credit card for AWS billing (charged only for what you use)
- Computer with internet connection
- 20 minutes of uninterrupted time
Cost Expectations
- Tutorial cost: $8-12 (we’ll clean up resources when done)
- Daily research cost: $15-45 per day when actively using
- Monthly estimate: $150-450 per month for typical usage
- Free tier: Some storage included free for first 12 months
Skills Needed
- Basic computer use (creating folders, installing software)
- Copy and paste commands
- No cloud or programming experience required
Step 1: Install AWS Research Wizard
Choose your operating system:
macOS/Linux
curl -fsSL https://install.aws-research-wizard.com | sh
Windows
Download from: https://github.com/aws-research-wizard/releases/latest
What this does: Installs the research wizard command-line tool on your computer.
Expected result: You should see “Installation successful” message.
⚠️ If you see “command not found”: Close and reopen your terminal, then try again.
Step 2: Set Up AWS Account
If you don’t have an AWS account:
- Go to aws.amazon.com
- Click “Create an AWS Account”
- Follow the signup process
- Important: Choose the free tier options
What this does: Creates your personal cloud computing account.
Expected result: You receive email confirmation from AWS.
💰 Cost note: Account creation is free. You only pay for resources you use.
Step 3: Configure Your Credentials
aws-research-wizard config setup
The wizard will ask for:
- AWS Access Key: Found in AWS Console → Security Credentials
- Secret Key: Created with your access key
- Region: Choose
us-west-2
(recommended for genomics)
What this does: Connects the research wizard to your AWS account.
Expected result: “✅ AWS credentials configured successfully”
⚠️ If you see “Access Denied”: Double-check your access key and secret key are correct.
Step 4: Validate Your Setup
aws-research-wizard deploy validate --domain genomics --region us-west-2
What this does: Checks that everything is working before we spend money.
Expected result:
✅ AWS credentials valid
✅ Domain configuration valid: genomics
✅ Region valid: us-west-2 (6 availability zones)
🎉 All validations passed!
Step 5: Deploy Your Genomics Environment
aws-research-wizard deploy start --domain genomics --region us-west-2 --instance r6i.large
What this does: Creates your genomics computing environment in the cloud.
This will take: 3-5 minutes
Expected result:
🎉 Deployment completed successfully!
Deployment Details:
Instance ID: i-1234567890abcdef0
Public IP: 12.34.56.78
SSH Command: ssh -i ~/.ssh/id_rsa ec2-user@12.34.56.78
S3 Bucket: genomics-data-1234567
💰 Billing starts now: Your environment costs about $0.24 per hour while running.
Step 6: Connect to Your Environment
Use the SSH command from the previous step:
ssh -i ~/.ssh/id_rsa ec2-user@12.34.56.78
What this does: Connects you to your genomics computer in the cloud.
Expected result: You see a command prompt like [ec2-user@ip-10-0-1-123 ~]$
⚠️ If connection fails: Your computer might block SSH. Try adding -o StrictHostKeyChecking=no
to the command.
Step 7: Explore Your Genomics Tools
Your environment comes pre-installed with:
Core Genomics Tools
- BWA: DNA sequence alignment - Type
bwa
to start - GATK: Variant discovery - Type
gatk --help
to start - SAMtools: Sequence data processing - Type
samtools
to start - FastQC: Quality control - Type
fastqc --help
to start - bcftools: Variant calling utilities - Type
bcftools
to start
Try Your First Command
bwa
What this does: Shows BWA help and confirms it’s installed correctly.
Expected result: You see BWA version info and usage instructions.
Step 8: Run Analysis with Real Research Data
Let’s analyze real genomics data from the AWS Open Data Registry:
Download Real Genomics Data from AWS Open Data
📊 Data Download Summary:
- 1000 Genomes Project: ~1.8 GB (population genomics and variant data)
- TCGA Cancer Genomics: ~1.4 GB (tumor and normal samples)
- NIH SRA Archive: ~1.2 GB (sequencing reads and metadata)
- gnomAD Population Database: ~1.1 GB (population variant frequencies)
- Total download: ~5.5 GB
- Estimated time: 12-18 minutes on typical broadband
# Create working directory
mkdir ~/genomics-tutorial
cd ~/genomics-tutorial
# Download real human genome data from 1000 Genomes Project
echo "Downloading 1000 Genomes Project data (~1.8GB)..."
aws s3 cp s3://1000genomes/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz . --no-sign-request
aws s3 cp s3://1000genomes/phase3/data/HG00096/sequence_read/SRR062634_2.filt.fastq.gz . --no-sign-request
aws s3 cp s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz . --no-sign-request
aws s3 cp s3://1000genomes/phase3/20130502.phase3.vcf.gz . --no-sign-request
echo "Downloading TCGA cancer genomics data (~1.4GB)..."
aws s3 cp s3://tcga-2-open/TCGA-BRCA/harmonized/Simple_Nucleotide_Variation/Raw_Sequencing_Data/WXS/C828.TCGA-A8-A08B-01A-11D-A011-09.bam . --no-sign-request
aws s3 cp s3://tcga-2-open/TCGA-BRCA/harmonized/Simple_Nucleotide_Variation/Raw_Sequencing_Data/WXS/C828.TCGA-A8-A08B-10A-01D-A011-09.bam . --no-sign-request
echo "Downloading NIH SRA sequencing archive (~1.2GB)..."
aws s3 cp s3://sra-pub-run-odp/sra/SRR3189741/SRR3189741 . --no-sign-request
aws s3 cp s3://sra-pub-run-odp/sra/SRR3189742/SRR3189742 . --no-sign-request
echo "Downloading gnomAD population variant database (~1.1GB)..."
aws s3 cp s3://gnomad-public-us-east-1/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz . --no-sign-request
aws s3 cp s3://gnomad-public-us-east-1/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz . --no-sign-request
echo "Real genomics data downloaded successfully!"
# Prepare reference genome
echo "Preparing reference genome..."
gunzip human_g1k_v37.fasta.gz
samtools faidx human_g1k_v37.fasta 20 > chr20.fasta
What this data contains:
- 1000 Genomes Project: Population genomics with whole genome sequencing and variant data from diverse populations
- TCGA Cancer Genomics: Matched tumor and normal tissue samples from breast cancer research
- NIH SRA Archive: High-throughput sequencing data from published genomics studies
- gnomAD Database: Population-scale variant frequencies from >140,000 individuals
- Formats: FASTQ sequencing reads, BAM alignments, VCF variant calls, and genomic coordinates
Index the Reference Genome
bwa index chr20.fasta
What this does: Prepares the reference genome for fast searching.
This will take: 30 seconds
Run Sequence Alignment with Real Data
bwa mem chr20.fasta SRR062634_1.filt.fastq.gz SRR062634_2.filt.fastq.gz > aligned.sam
What this does: Aligns real human DNA sequences to chromosome 20 reference.
This will take: 2-3 minutes
Convert to Binary Format and Sort
samtools view -bS aligned.sam | samtools sort -o aligned_sorted.bam
samtools index aligned_sorted.bam
What this does: Converts to efficient binary format and creates an index for fast access.
View Alignment Statistics
samtools flagstat aligned_sorted.bam
What you should see:
23589 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
89 + 0 supplementary
0 + 0 duplicates
23589 + 0 mapped (100.00% : N/A)
Check Alignment Quality
samtools view aligned_sorted.bam | head -3
What you should see: SAM records showing how reads align to the reference genome.
🎉 Success! You’ve run your first genomics analysis with real research data!
Analyze Population Variants from gnomAD
# Examine population variant frequencies
echo "=== gnomAD Population Variant Analysis ==="
echo "Analyzing population variant frequencies on chromosome 20..."
# Count total variants in gnomAD chromosome 20
echo "Counting variants in gnomAD chr20 dataset..."
bcftools view -H gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz | wc -l
# Find high-frequency variants (>50% in population)
echo "Finding common variants (frequency > 0.5)..."
bcftools view -i 'AF>0.5' gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz | head -10
# Extract variant quality statistics
echo "Variant quality statistics:"
bcftools query -f '%CHROM\t%POS\t%AF\t%AC\t%QUAL\n' gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz | head -20
Compare Cancer vs Normal Samples
# Analyze TCGA cancer genomics data
echo "=== TCGA Cancer Genomics Analysis ==="
# Get basic statistics from tumor sample
echo "Tumor sample statistics:"
samtools flagstat C828.TCGA-A8-A08B-01A-11D-A011-09.bam
# Get basic statistics from normal sample
echo "Normal sample statistics:"
samtools flagstat C828.TCGA-A8-A08B-10A-01D-A011-09.bam
# Compare coverage between tumor and normal
echo "Coverage comparison (tumor vs normal):"
samtools depth C828.TCGA-A8-A08B-01A-11D-A011-09.bam | head -10
samtools depth C828.TCGA-A8-A08B-10A-01D-A011-09.bam | head -10
Process SRA Sequencing Data
# Convert SRA files to FASTQ format
echo "=== NIH SRA Data Processing ==="
echo "Converting SRA files to FASTQ format..."
# Use SRA toolkit to extract reads (if available)
# Note: This demonstrates the workflow - actual conversion may require sra-toolkit
echo "SRA file information:"
ls -lh SRR3189741 SRR3189742
echo "File sizes and formats:"
file SRR3189741 SRR3189742
Explore More AWS Open Data (Optional)
# Browse available 1000 Genomes populations
aws s3 ls s3://1000genomes/phase3/data/ --no-sign-request | head -20
# Check out additional TCGA cancer types
aws s3 ls s3://tcga-2-open/ --no-sign-request | grep -E "TCGA-(LUAD|COAD|PRAD)"
# Explore gnomAD coverage data
aws s3 ls s3://gnomad-public-us-east-1/release/3.1.2/coverage/ --no-sign-request
# View dataset documentation
echo "Dataset registry information:"
echo "1000 Genomes: https://registry.opendata.aws/1000-genomes/"
echo "TCGA: https://registry.opendata.aws/tcga/"
echo "gnomAD: https://registry.opendata.aws/gnomad/"
Available datasets for further exploration:
- 1000 Genomes: 2,504 individuals from 26 populations worldwide
- TCGA: 33 cancer types with multi-omics data integration
- NIH SRA: 15+ million sequencing experiments from published studies
- gnomAD: Population genetics and variant frequency from >140K individuals
- UK Biobank: Large-scale genetic and health data from 500K participants
Step 9: Using Your Own Genomics Data
Instead of the tutorial data, you can analyze your own genomics datasets:
Upload Your Data
# Option 1: Upload from your local computer
scp -i ~/.ssh/id_rsa your_sequences.fastq.gz ec2-user@12.34.56.78:~/genomics-tutorial/
# Option 2: Download from your institution's server
wget https://your-institution.edu/data/sample_data.bam
# Option 3: Access your AWS S3 bucket
aws s3 cp s3://your-genomics-bucket/sequencing-data/ . --recursive
Common Data Formats Supported
- FASTQ files (.fastq, .fq, .fastq.gz): Raw sequencing reads
- BAM/SAM files (.bam, .sam): Aligned sequence data
- VCF files (.vcf, .vcf.gz): Variant call format for genetic variants
- BED files (.bed): Genomic intervals and annotations
- FASTA files (.fasta, .fa): Reference genomes and sequences
Replace Tutorial Commands
Simply substitute your filenames in any tutorial command:
# Instead of tutorial data:
bwa mem chr20.fasta SRR062634_1.filt.fastq.gz SRR062634_2.filt.fastq.gz > aligned.sam
# Use your data:
bwa mem your_reference.fasta your_sample_R1.fastq.gz your_sample_R2.fastq.gz > your_aligned.sam
Data Size Considerations
- Small datasets (<50 GB): Process directly on the instance
- Large datasets (50-500 GB): Use larger instance types or S3 for storage
- Whole genome datasets (>500 GB): Consider multi-sample processing pipelines
Step 10: Monitor Your Costs
Check your current spending:
exit # Exit SSH session first
aws-research-wizard monitor costs --region us-west-2
Expected result: Shows costs so far (should be under $2 for this tutorial)
Step 10: Clean Up (Important!)
When you’re done experimenting:
aws-research-wizard deploy delete --region us-west-2
Type y
when prompted.
What this does: Stops billing by removing your cloud resources.
💰 Important: Always clean up to avoid ongoing charges.
Expected result: “🗑️ Deletion completed successfully”
Understanding Your Costs
What You’re Paying For
- Compute: $0.24 per hour while environment is running
- Storage: $0.023 per GB per month for data you save
- Data Transfer: Usually free for genomics data amounts
Cost Control Tips
- Always delete environments when not needed
- Use spot instances for 70% savings (advanced)
- Store large datasets in S3, not on the instance
- Monitor costs weekly with the built-in cost tracker
Typical Monthly Costs by Usage
- Light use (8 hours/week): $75-125
- Medium use (4 hours/day): $300-450
- Heavy use (8 hours/day): $600-900
What’s Next?
Now that you have a working genomics environment, you can:
Learn More About Genomics Tools
- [GATK Best Practices Pipeline Guide]
- [Large Dataset Processing Tutorial]
- [Cost Optimization for Genomics]
Explore Advanced Features
- [Multi-sample variant calling]
- [Team collaboration setup]
- [Automated pipeline deployment]
Join the Genomics Community
- [Genomics Research Forum]
- [GitHub Examples Repository]
- [Monthly Genomics Office Hours]
Extend and Contribute
🚀 Help us expand AWS Research Wizard!
Missing a tool or domain? We welcome suggestions for:
- New genomics software (e.g., STAR, Cufflinks, Trinity, MetaPhlAn, SPAdes)
- Additional domain packs (e.g., single-cell genomics, epigenomics, proteomics, metabolomics)
- New data sources or tutorials for specific research workflows
How to contribute:
This is an open research platform - your suggestions drive our development roadmap!
Troubleshooting
Common Issues
Problem: “Permission denied” when connecting with SSH
Solution: Make sure your SSH key has correct permissions: chmod 600 ~/.ssh/id_rsa
Prevention: The deployment process usually sets this automatically
Problem: “Instance not found” error
Solution: Check that your region matches: aws-research-wizard deploy status --region us-west-2
Prevention: Always specify the same region in all commands
Problem: BWA or GATK commands not found Solution: Wait 2-3 more minutes after deployment for software installation to complete Prevention: The “Deployment completed” message means infrastructure is ready, not software
Getting Help
- Check the [genomics troubleshooting guide]
- Ask in [community forum]
- File an issue on [GitHub]
Emergency: Stop All Billing
If something goes wrong and you want to stop all charges immediately:
aws-research-wizard emergency-stop --region us-west-2 --confirm
Feedback
This guide should take 20 minutes and cost under $12. Help us improve:
Was this guide helpful? [Yes/No feedback buttons]
What was confusing? [Text box for feedback]
What would you add? [Text box for suggestions]
Rate the clarity (1-5): ⭐⭐⭐⭐⭐
*Last updated: January 2025 | Reading level: 8th grade | Tutorial tested: January 15, 2025* |