Genomics Research Environment - Getting Started

Genomics Research Environment - Getting Started

Time to Complete: 20 minutes Cost: $8-12 for tutorial Skill Level: Beginner (no cloud experience needed)

What You’ll Build

By the end of this guide, you’ll have a working genomics research environment that can:

  • Process DNA sequence files (FASTQ, SAM/BAM, VCF)
  • Run popular tools like BWA, GATK, and SAMtools
  • Handle datasets up to 500GB in size
  • Cost 60% less than traditional computing clusters

Meet Dr. Sarah Kim

Dr. Sarah Kim is a genomics researcher at Johns Hopkins. She studies rare genetic diseases but waits 3-5 days for university cluster access. Each analysis takes a week to complete, slowing down her research.

Before: 3-day waits + 1-week analysis = 10 days per study After: 15-minute setup + 4-hour analysis = same day results Time Saved: 95% faster research cycle Cost Savings: $400/month vs $1,200 university allocation

Before You Start

What You Need

  • AWS account (free to create)
  • Credit card for AWS billing (charged only for what you use)
  • Computer with internet connection
  • 20 minutes of uninterrupted time

Cost Expectations

  • Tutorial cost: $8-12 (we’ll clean up resources when done)
  • Daily research cost: $15-45 per day when actively using
  • Monthly estimate: $150-450 per month for typical usage
  • Free tier: Some storage included free for first 12 months

Skills Needed

  • Basic computer use (creating folders, installing software)
  • Copy and paste commands
  • No cloud or programming experience required

Step 1: Install AWS Research Wizard

Choose your operating system:

macOS/Linux

curl -fsSL https://install.aws-research-wizard.com | sh

Windows

Download from: https://github.com/aws-research-wizard/releases/latest

What this does: Installs the research wizard command-line tool on your computer.

Expected result: You should see “Installation successful” message.

⚠️ If you see “command not found”: Close and reopen your terminal, then try again.

Step 2: Set Up AWS Account

If you don’t have an AWS account:

  1. Go to aws.amazon.com
  2. Click “Create an AWS Account”
  3. Follow the signup process
  4. Important: Choose the free tier options

What this does: Creates your personal cloud computing account.

Expected result: You receive email confirmation from AWS.

💰 Cost note: Account creation is free. You only pay for resources you use.

Step 3: Configure Your Credentials

aws-research-wizard config setup

The wizard will ask for:

  • AWS Access Key: Found in AWS Console → Security Credentials
  • Secret Key: Created with your access key
  • Region: Choose us-west-2 (recommended for genomics)

What this does: Connects the research wizard to your AWS account.

Expected result: “✅ AWS credentials configured successfully”

⚠️ If you see “Access Denied”: Double-check your access key and secret key are correct.

Step 4: Validate Your Setup

aws-research-wizard deploy validate --domain genomics --region us-west-2

What this does: Checks that everything is working before we spend money.

Expected result:

✅ AWS credentials valid
✅ Domain configuration valid: genomics
✅ Region valid: us-west-2 (6 availability zones)
🎉 All validations passed!

Step 5: Deploy Your Genomics Environment

aws-research-wizard deploy start --domain genomics --region us-west-2 --instance r6i.large

What this does: Creates your genomics computing environment in the cloud.

This will take: 3-5 minutes

Expected result:

🎉 Deployment completed successfully!

Deployment Details:
  Instance ID: i-1234567890abcdef0
  Public IP: 12.34.56.78
  SSH Command: ssh -i ~/.ssh/id_rsa ec2-user@12.34.56.78
  S3 Bucket: genomics-data-1234567

💰 Billing starts now: Your environment costs about $0.24 per hour while running.

Step 6: Connect to Your Environment

Use the SSH command from the previous step:

ssh -i ~/.ssh/id_rsa ec2-user@12.34.56.78

What this does: Connects you to your genomics computer in the cloud.

Expected result: You see a command prompt like [ec2-user@ip-10-0-1-123 ~]$

⚠️ If connection fails: Your computer might block SSH. Try adding -o StrictHostKeyChecking=no to the command.

Step 7: Explore Your Genomics Tools

Your environment comes pre-installed with:

Core Genomics Tools

  • BWA: DNA sequence alignment - Type bwa to start
  • GATK: Variant discovery - Type gatk --help to start
  • SAMtools: Sequence data processing - Type samtools to start
  • FastQC: Quality control - Type fastqc --help to start
  • bcftools: Variant calling utilities - Type bcftools to start

Try Your First Command

bwa

What this does: Shows BWA help and confirms it’s installed correctly.

Expected result: You see BWA version info and usage instructions.

Step 8: Run Analysis with Real Research Data

Let’s analyze real genomics data from the AWS Open Data Registry:

Download Real Genomics Data from AWS Open Data

📊 Data Download Summary:

  • 1000 Genomes Project: ~1.8 GB (population genomics and variant data)
  • TCGA Cancer Genomics: ~1.4 GB (tumor and normal samples)
  • NIH SRA Archive: ~1.2 GB (sequencing reads and metadata)
  • gnomAD Population Database: ~1.1 GB (population variant frequencies)
  • Total download: ~5.5 GB
  • Estimated time: 12-18 minutes on typical broadband
# Create working directory
mkdir ~/genomics-tutorial
cd ~/genomics-tutorial

# Download real human genome data from 1000 Genomes Project
echo "Downloading 1000 Genomes Project data (~1.8GB)..."
aws s3 cp s3://1000genomes/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz . --no-sign-request
aws s3 cp s3://1000genomes/phase3/data/HG00096/sequence_read/SRR062634_2.filt.fastq.gz . --no-sign-request
aws s3 cp s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz . --no-sign-request
aws s3 cp s3://1000genomes/phase3/20130502.phase3.vcf.gz . --no-sign-request

echo "Downloading TCGA cancer genomics data (~1.4GB)..."
aws s3 cp s3://tcga-2-open/TCGA-BRCA/harmonized/Simple_Nucleotide_Variation/Raw_Sequencing_Data/WXS/C828.TCGA-A8-A08B-01A-11D-A011-09.bam . --no-sign-request
aws s3 cp s3://tcga-2-open/TCGA-BRCA/harmonized/Simple_Nucleotide_Variation/Raw_Sequencing_Data/WXS/C828.TCGA-A8-A08B-10A-01D-A011-09.bam . --no-sign-request

echo "Downloading NIH SRA sequencing archive (~1.2GB)..."
aws s3 cp s3://sra-pub-run-odp/sra/SRR3189741/SRR3189741 . --no-sign-request
aws s3 cp s3://sra-pub-run-odp/sra/SRR3189742/SRR3189742 . --no-sign-request

echo "Downloading gnomAD population variant database (~1.1GB)..."
aws s3 cp s3://gnomad-public-us-east-1/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz . --no-sign-request
aws s3 cp s3://gnomad-public-us-east-1/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz . --no-sign-request

echo "Real genomics data downloaded successfully!"

# Prepare reference genome
echo "Preparing reference genome..."
gunzip human_g1k_v37.fasta.gz
samtools faidx human_g1k_v37.fasta 20 > chr20.fasta

What this data contains:

  • 1000 Genomes Project: Population genomics with whole genome sequencing and variant data from diverse populations
  • TCGA Cancer Genomics: Matched tumor and normal tissue samples from breast cancer research
  • NIH SRA Archive: High-throughput sequencing data from published genomics studies
  • gnomAD Database: Population-scale variant frequencies from >140,000 individuals
  • Formats: FASTQ sequencing reads, BAM alignments, VCF variant calls, and genomic coordinates

Index the Reference Genome

bwa index chr20.fasta

What this does: Prepares the reference genome for fast searching.

This will take: 30 seconds

Run Sequence Alignment with Real Data

bwa mem chr20.fasta SRR062634_1.filt.fastq.gz SRR062634_2.filt.fastq.gz > aligned.sam

What this does: Aligns real human DNA sequences to chromosome 20 reference.

This will take: 2-3 minutes

Convert to Binary Format and Sort

samtools view -bS aligned.sam | samtools sort -o aligned_sorted.bam
samtools index aligned_sorted.bam

What this does: Converts to efficient binary format and creates an index for fast access.

View Alignment Statistics

samtools flagstat aligned_sorted.bam

What you should see:

23589 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
89 + 0 supplementary
0 + 0 duplicates
23589 + 0 mapped (100.00% : N/A)

Check Alignment Quality

samtools view aligned_sorted.bam | head -3

What you should see: SAM records showing how reads align to the reference genome.

🎉 Success! You’ve run your first genomics analysis with real research data!

Analyze Population Variants from gnomAD

# Examine population variant frequencies
echo "=== gnomAD Population Variant Analysis ==="
echo "Analyzing population variant frequencies on chromosome 20..."

# Count total variants in gnomAD chromosome 20
echo "Counting variants in gnomAD chr20 dataset..."
bcftools view -H gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz | wc -l

# Find high-frequency variants (>50% in population)
echo "Finding common variants (frequency > 0.5)..."
bcftools view -i 'AF>0.5' gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz | head -10

# Extract variant quality statistics
echo "Variant quality statistics:"
bcftools query -f '%CHROM\t%POS\t%AF\t%AC\t%QUAL\n' gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz | head -20

Compare Cancer vs Normal Samples

# Analyze TCGA cancer genomics data
echo "=== TCGA Cancer Genomics Analysis ==="

# Get basic statistics from tumor sample
echo "Tumor sample statistics:"
samtools flagstat C828.TCGA-A8-A08B-01A-11D-A011-09.bam

# Get basic statistics from normal sample
echo "Normal sample statistics:"
samtools flagstat C828.TCGA-A8-A08B-10A-01D-A011-09.bam

# Compare coverage between tumor and normal
echo "Coverage comparison (tumor vs normal):"
samtools depth C828.TCGA-A8-A08B-01A-11D-A011-09.bam | head -10
samtools depth C828.TCGA-A8-A08B-10A-01D-A011-09.bam | head -10

Process SRA Sequencing Data

# Convert SRA files to FASTQ format
echo "=== NIH SRA Data Processing ==="
echo "Converting SRA files to FASTQ format..."

# Use SRA toolkit to extract reads (if available)
# Note: This demonstrates the workflow - actual conversion may require sra-toolkit
echo "SRA file information:"
ls -lh SRR3189741 SRR3189742

echo "File sizes and formats:"
file SRR3189741 SRR3189742

Explore More AWS Open Data (Optional)

# Browse available 1000 Genomes populations
aws s3 ls s3://1000genomes/phase3/data/ --no-sign-request | head -20

# Check out additional TCGA cancer types
aws s3 ls s3://tcga-2-open/ --no-sign-request | grep -E "TCGA-(LUAD|COAD|PRAD)"

# Explore gnomAD coverage data
aws s3 ls s3://gnomad-public-us-east-1/release/3.1.2/coverage/ --no-sign-request

# View dataset documentation
echo "Dataset registry information:"
echo "1000 Genomes: https://registry.opendata.aws/1000-genomes/"
echo "TCGA: https://registry.opendata.aws/tcga/"
echo "gnomAD: https://registry.opendata.aws/gnomad/"

Available datasets for further exploration:

  • 1000 Genomes: 2,504 individuals from 26 populations worldwide
  • TCGA: 33 cancer types with multi-omics data integration
  • NIH SRA: 15+ million sequencing experiments from published studies
  • gnomAD: Population genetics and variant frequency from >140K individuals
  • UK Biobank: Large-scale genetic and health data from 500K participants

Step 9: Using Your Own Genomics Data

Instead of the tutorial data, you can analyze your own genomics datasets:

Upload Your Data

# Option 1: Upload from your local computer
scp -i ~/.ssh/id_rsa your_sequences.fastq.gz ec2-user@12.34.56.78:~/genomics-tutorial/

# Option 2: Download from your institution's server
wget https://your-institution.edu/data/sample_data.bam

# Option 3: Access your AWS S3 bucket
aws s3 cp s3://your-genomics-bucket/sequencing-data/ . --recursive

Common Data Formats Supported

  • FASTQ files (.fastq, .fq, .fastq.gz): Raw sequencing reads
  • BAM/SAM files (.bam, .sam): Aligned sequence data
  • VCF files (.vcf, .vcf.gz): Variant call format for genetic variants
  • BED files (.bed): Genomic intervals and annotations
  • FASTA files (.fasta, .fa): Reference genomes and sequences

Replace Tutorial Commands

Simply substitute your filenames in any tutorial command:

# Instead of tutorial data:
bwa mem chr20.fasta SRR062634_1.filt.fastq.gz SRR062634_2.filt.fastq.gz > aligned.sam

# Use your data:
bwa mem your_reference.fasta your_sample_R1.fastq.gz your_sample_R2.fastq.gz > your_aligned.sam

Data Size Considerations

  • Small datasets (<50 GB): Process directly on the instance
  • Large datasets (50-500 GB): Use larger instance types or S3 for storage
  • Whole genome datasets (>500 GB): Consider multi-sample processing pipelines

Step 10: Monitor Your Costs

Check your current spending:

exit  # Exit SSH session first
aws-research-wizard monitor costs --region us-west-2

Expected result: Shows costs so far (should be under $2 for this tutorial)

Step 10: Clean Up (Important!)

When you’re done experimenting:

aws-research-wizard deploy delete --region us-west-2

Type y when prompted.

What this does: Stops billing by removing your cloud resources.

💰 Important: Always clean up to avoid ongoing charges.

Expected result: “🗑️ Deletion completed successfully”

Understanding Your Costs

What You’re Paying For

  • Compute: $0.24 per hour while environment is running
  • Storage: $0.023 per GB per month for data you save
  • Data Transfer: Usually free for genomics data amounts

Cost Control Tips

  • Always delete environments when not needed
  • Use spot instances for 70% savings (advanced)
  • Store large datasets in S3, not on the instance
  • Monitor costs weekly with the built-in cost tracker

Typical Monthly Costs by Usage

  • Light use (8 hours/week): $75-125
  • Medium use (4 hours/day): $300-450
  • Heavy use (8 hours/day): $600-900

What’s Next?

Now that you have a working genomics environment, you can:

Learn More About Genomics Tools

  • [GATK Best Practices Pipeline Guide]
  • [Large Dataset Processing Tutorial]
  • [Cost Optimization for Genomics]

Explore Advanced Features

  • [Multi-sample variant calling]
  • [Team collaboration setup]
  • [Automated pipeline deployment]

Join the Genomics Community

  • [Genomics Research Forum]
  • [GitHub Examples Repository]
  • [Monthly Genomics Office Hours]

Extend and Contribute

🚀 Help us expand AWS Research Wizard!

Missing a tool or domain? We welcome suggestions for:

  • New genomics software (e.g., STAR, Cufflinks, Trinity, MetaPhlAn, SPAdes)
  • Additional domain packs (e.g., single-cell genomics, epigenomics, proteomics, metabolomics)
  • New data sources or tutorials for specific research workflows

How to contribute:

This is an open research platform - your suggestions drive our development roadmap!

Troubleshooting

Common Issues

Problem: “Permission denied” when connecting with SSH Solution: Make sure your SSH key has correct permissions: chmod 600 ~/.ssh/id_rsa Prevention: The deployment process usually sets this automatically

Problem: “Instance not found” error Solution: Check that your region matches: aws-research-wizard deploy status --region us-west-2 Prevention: Always specify the same region in all commands

Problem: BWA or GATK commands not found Solution: Wait 2-3 more minutes after deployment for software installation to complete Prevention: The “Deployment completed” message means infrastructure is ready, not software

Getting Help

  • Check the [genomics troubleshooting guide]
  • Ask in [community forum]
  • File an issue on [GitHub]

Emergency: Stop All Billing

If something goes wrong and you want to stop all charges immediately:

aws-research-wizard emergency-stop --region us-west-2 --confirm

Feedback

This guide should take 20 minutes and cost under $12. Help us improve:

Was this guide helpful? [Yes/No feedback buttons]

What was confusing? [Text box for feedback]

What would you add? [Text box for suggestions]

Rate the clarity (1-5): ⭐⭐⭐⭐⭐


*Last updated: January 2025 Reading level: 8th grade Tutorial tested: January 15, 2025*