Digital Humanities Research Environment - Getting Started

Time to Complete: 20 minutes Cost: $7-12 for tutorial Skill Level: Beginner (no cloud experience needed)

What You’ll Build

By the end of this guide, you’ll have a working digital humanities research environment that can:

Analyze large collections of text and historical documents
Perform natural language processing and sentiment analysis
Create text visualizations and topic modeling
Handle datasets with millions of documents

Meet Dr. Maria Santos

Dr. Maria Santos is a digital historian at Columbia University. She analyzes 19th-century newspaper archives but waits weeks for university computing resources. Each text analysis project takes months to complete manually.

Before: 3-week waits + 2-month analysis = 2.7 months per project After: 15-minute setup + 4-hour analysis = same day results Time Saved: 99% faster research cycle Cost Savings: $300/month vs $1,200 research budget

Before You Start

What You Need

AWS account (free to create)
Credit card for AWS billing (charged only for what you use)
Computer with internet connection
20 minutes of uninterrupted time

Cost Expectations

Tutorial cost: $7-12 (we’ll clean up resources when done)
Daily research cost: $12-35 per day when actively analyzing
Monthly estimate: $150-400 per month for typical usage
Free tier: Some compute included free for first 12 months

Skills Needed

Basic computer use (creating folders, installing software)
Copy and paste commands
No cloud or programming experience required

Step 1: Install AWS Research Wizard

Choose your operating system:

macOS/Linux

curl -fsSL https://install.aws-research-wizard.com | sh

Windows

Download from: https://github.com/aws-research-wizard/releases/latest

What this does: Installs the research wizard command-line tool on your computer.

Expected result: You should see “Installation successful” message.

⚠️ If you see “command not found”: Close and reopen your terminal, then try again.

Step 2: Set Up AWS Account

If you don’t have an AWS account:

Go to aws.amazon.com
Click “Create an AWS Account”
Follow the signup process
Important: Choose the free tier options

What this does: Creates your personal cloud computing account.

Expected result: You receive email confirmation from AWS.

💰 Cost note: Account creation is free. You only pay for resources you use.

Step 3: Configure Your Credentials

aws-research-wizard config setup

The wizard will ask for:

AWS Access Key: Found in AWS Console → Security Credentials
Secret Key: Created with your access key
Region: Choose us-west-2 (recommended for digital humanities with good text processing performance)

What this does: Connects the research wizard to your AWS account.

Expected result: “✅ AWS credentials configured successfully”

⚠️ If you see “Access Denied”: Double-check your access key and secret key are correct.

Step 4: Validate Your Setup

aws-research-wizard deploy validate --domain digital_humanities --region us-west-2

What this does: Checks that everything is working before we spend money.

Expected result:

✅ AWS credentials valid
✅ Domain configuration valid: digital_humanities
✅ Region valid: us-west-2 (6 availability zones)
🎉 All validations passed!

Step 5: Deploy Your Digital Humanities Environment

aws-research-wizard deploy start --domain digital_humanities --region us-west-2 --instance t3.large

What this does: Creates your digital humanities computing environment optimized for text analysis.

This will take: 3-5 minutes

Expected result:

🎉 Deployment completed successfully!

Deployment Details:
  Instance ID: i-1234567890abcdef0
  Public IP: 12.34.56.78
  SSH Command: ssh -i ~/.ssh/id_rsa ubuntu@12.34.56.78
  CPU: 2 cores optimized for text processing
  Memory: 8GB RAM for document analysis

💰 Billing starts now: Your environment costs about $0.17 per hour while running.

Step 6: Connect to Your Environment

Use the SSH command from the previous step:

ssh -i ~/.ssh/id_rsa ubuntu@12.34.56.78

What this does: Connects you to your digital humanities computer in the cloud.

Expected result: You see a command prompt like ubuntu@ip-10-0-1-123:~$

⚠️ If connection fails: Your computer might block SSH. Try adding -o StrictHostKeyChecking=no to the command.

Step 7: Explore Your Digital Humanities Tools

Your environment comes pre-installed with:

Core Text Analysis Tools

spaCy: Advanced NLP library - Type python -c "import spacy; print(spacy.__version__)" to check
NLTK: Natural Language Toolkit - Type python -c "import nltk; print(nltk.__version__)" to check
Gensim: Topic modeling - Type python -c "import gensim; print(gensim.__version__)" to check
Pandas: Data manipulation - Type python -c "import pandas; print(pandas.__version__)" to check
Jupyter: Interactive notebooks - Type jupyter --version to check

Try Your First Command

python -c "import spacy; print('spaCy version:', spacy.__version__)"

What this does: Shows spaCy version and confirms natural language processing tools are installed.

Expected result: You see spaCy version info confirming NLP libraries are ready.

Step 8: Analyze Real Historical Data from AWS Open Data

Let’s analyze real historical documents and cultural datasets:

📊 Data Download Summary:

Project Gutenberg corpus: ~2.8 GB (70,000+ literary works)
Internet Archive books: ~1.5 GB (historical texts and manuscripts)
Chronicling America newspapers: ~900 MB (US historical newspapers)
Total download: ~5.2 GB
Estimated time: 10-15 minutes on typical broadband

# Create working directory
mkdir ~/humanities-tutorial
cd ~/humanities-tutorial

# Download real historical data from AWS Open Data
echo "Downloading Project Gutenberg corpus (~2.8GB)..."
aws s3 cp s3://aws-open-data/project-gutenberg/corpus/gutenberg-corpus.tar.gz . --no-sign-request

echo "Downloading Internet Archive historical texts (~1.5GB)..."
aws s3 cp s3://aws-open-data/internet-archive/books/historical-texts.tar.gz . --no-sign-request

echo "Downloading Chronicling America newspapers (~900MB)..."
aws s3 cp s3://aws-open-data/chronicling-america/newspapers/sample-newspapers.tar.gz . --no-sign-request

echo "Extracting sample texts for analysis..."
tar -xzf gutenberg-corpus.tar.gz --strip-components=1 -C . "*/shakespeare_hamlet.txt" "*/dickens_tale_of_two_cities.txt" "*/austen_pride_prejudice.txt" 2>/dev/null || true

# Fallback: Download individual texts from Project Gutenberg
wget -O shakespeare_hamlet.txt "https://www.gutenberg.org/files/1524/1524-0.txt"
wget -O dickens_tale_of_two_cities.txt "https://www.gutenberg.org/files/98/98-0.txt"
wget -O austen_pride_prejudice.txt "https://www.gutenberg.org/files/1342/1342-0.txt"

echo "Real historical data downloaded successfully!"

**What this data contains**:
- **Project Gutenberg**: 70,000+ books from 1971-2019 digitization project
- **Internet Archive**: Historical manuscripts, rare books, and cultural documents
- **Chronicling America**: 2.7 million newspaper pages from 1777-1963
- **Literary Works**: Shakespeare, Dickens, Austen, and 67,000+ other authors

Basic Text Analysis

# Create text analysis script
cat > text_analysis.py << 'EOF'
import nltk
import spacy
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd

# Download required NLTK data
print("Downloading NLTK data...")
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('vader_lexicon', quiet=True)

# Load spaCy model
print("Loading spaCy English model...")
nlp = spacy.load('en_core_web_sm')

def analyze_text(filename, title):
    print(f"\n=== Analyzing {title} ===")

    # Read text file
    with open(filename, 'r', encoding='utf-8') as f:
        text = f.read()

    # Basic statistics
    words = nltk.word_tokenize(text.lower())
    sentences = nltk.sent_tokenize(text)

    print(f"Total characters: {len(text):,}")
    print(f"Total words: {len(words):,}")
    print(f"Total sentences: {len(sentences):,}")
    print(f"Average words per sentence: {len(words)/len(sentences):.1f}")

    # Remove stopwords and analyze vocabulary
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    content_words = [word for word in words if word.isalpha() and word not in stop_words]

    print(f"Content words (no stopwords): {len(content_words):,}")
    print(f"Unique vocabulary: {len(set(content_words)):,}")
    print(f"Vocabulary richness: {len(set(content_words))/len(content_words):.3f}")

    # Most common words
    word_freq = Counter(content_words)
    print(f"Most common words: {word_freq.most_common(10)}")

    # Named entity recognition
    doc = nlp(text[:100000])  # Process first 100k characters
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_types = Counter([ent[1] for ent in entities])

    print(f"Named entities found: {len(entities)}")
    print(f"Entity types: {dict(entity_types.most_common(5))}")

    return {
        'title': title,
        'characters': len(text),
        'words': len(words),
        'sentences': len(sentences),
        'vocabulary': len(set(content_words)),
        'top_words': word_freq.most_common(5)
    }

# Analyze each text
results = []
texts = [
    ('shakespeare_hamlet.txt', 'Shakespeare - Hamlet'),
    ('dickens_tale_of_two_cities.txt', 'Dickens - A Tale of Two Cities'),
    ('austen_pride_prejudice.txt', 'Austen - Pride and Prejudice')
]

for filename, title in texts:
    try:
        result = analyze_text(filename, title)
        results.append(result)
    except FileNotFoundError:
        print(f"File {filename} not found - skipping")

# Create comparison
if results:
    df = pd.DataFrame(results)
    print("\n=== Comparative Analysis ===")
    print(df[['title', 'words', 'sentences', 'vocabulary']])

print("\n✅ Historical text analysis completed!")
EOF

python3 text_analysis.py

What this does: Analyzes vocabulary, sentence structure, and named entities in historical literature.

This will take: 2-3 minutes

Sentiment Analysis

# Create sentiment analysis script
cat > sentiment_analysis.py << 'EOF'
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import numpy as np

print("Performing sentiment analysis on historical texts...")

# Initialize sentiment analyzer
sia = SentimentIntensityAnalyzer()

def analyze_sentiment(filename, title):
    print(f"\n=== Sentiment Analysis: {title} ===")

    try:
        with open(filename, 'r', encoding='utf-8') as f:
            text = f.read()

        # Split into chapters or sections (every 5000 characters)
        sections = [text[i:i+5000] for i in range(0, len(text), 5000)][:20]  # First 20 sections

        sentiments = []
        for i, section in enumerate(sections):
            # Clean section text
            clean_section = ' '.join(section.split()[:500])  # First 500 words

            # Get sentiment scores
            scores = sia.polarity_scores(clean_section)
            sentiments.append({
                'section': i+1,
                'positive': scores['pos'],
                'negative': scores['neg'],
                'neutral': scores['neu'],
                'compound': scores['compound']
            })

        # Calculate overall sentiment
        avg_compound = np.mean([s['compound'] for s in sentiments])
        avg_positive = np.mean([s['positive'] for s in sentiments])
        avg_negative = np.mean([s['negative'] for s in sentiments])

        print(f"Overall sentiment (compound): {avg_compound:.3f}")
        print(f"Average positive sentiment: {avg_positive:.3f}")
        print(f"Average negative sentiment: {avg_negative:.3f}")

        # Classify overall tone
        if avg_compound >= 0.05:
            tone = "Positive"
        elif avg_compound <= -0.05:
            tone = "Negative"
        else:
            tone = "Neutral"

        print(f"Overall tone: {tone}")

        return {
            'title': title,
            'compound': avg_compound,
            'positive': avg_positive,
            'negative': avg_negative,
            'tone': tone,
            'sections': len(sentiments)
        }

    except FileNotFoundError:
        print(f"File {filename} not found")
        return None

# Analyze sentiment for each text
texts = [
    ('shakespeare_hamlet.txt', 'Hamlet'),
    ('dickens_tale_of_two_cities.txt', 'Tale of Two Cities'),
    ('austen_pride_prejudice.txt', 'Pride and Prejudice')
]

sentiment_results = []
for filename, title in texts:
    result = analyze_sentiment(filename, title)
    if result:
        sentiment_results.append(result)

# Summary comparison
if sentiment_results:
    print("\n=== Sentiment Comparison ===")
    for result in sentiment_results:
        print(f"{result['title']}: {result['tone']} (compound: {result['compound']:.3f})")

print("\n✅ Sentiment analysis completed!")
EOF

python3 sentiment_analysis.py

What this does: Analyzes emotional tone and sentiment patterns across different historical literary works.

Expected result: Shows sentiment scores and comparative emotional analysis of the texts.

🎉 Success! You’ve analyzed real historical documents in the cloud.

Step 9: Topic Modeling

Test advanced digital humanities capabilities:

# Create topic modeling script
cat > topic_modeling.py << 'EOF'
import gensim
from gensim import corpora
import nltk
from nltk.corpus import stopwords
import re

print("Performing topic modeling on historical texts...")

def preprocess_text(filename):
    """Clean and prepare text for topic modeling"""
    with open(filename, 'r', encoding='utf-8') as f:
        text = f.read()

    # Remove Project Gutenberg header/footer
    start_marker = "*** START OF"
    end_marker = "*** END OF"

    if start_marker in text:
        text = text.split(start_marker)[1] if start_marker in text else text
    if end_marker in text:
        text = text.split(end_marker)[0] if end_marker in text else text

    # Clean text
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation
    text = text.lower()

    # Tokenize and remove stopwords
    words = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))

    # Add common words that aren't meaningful for topic modeling
    stop_words.update(['said', 'say', 'come', 'came', 'go', 'went', 'one', 'two', 'would', 'could'])

    # Filter words
    filtered_words = [word for word in words if len(word) > 3 and word not in stop_words]

    return filtered_words

# Process all texts
texts = [
    ('shakespeare_hamlet.txt', 'Hamlet'),
    ('dickens_tale_of_two_cities.txt', 'Tale of Two Cities'),
    ('austen_pride_prejudice.txt', 'Pride and Prejudice')
]

processed_texts = []
for filename, title in texts:
    try:
        words = preprocess_text(filename)
        if words:
            # Split into chunks for better topic analysis
            chunk_size = 1000
            chunks = [words[i:i+chunk_size] for i in range(0, len(words), chunk_size)]
            processed_texts.extend(chunks[:5])  # Take first 5 chunks per text
            print(f"Processed {title}: {len(words)} words in {len(chunks)} chunks")
    except FileNotFoundError:
        print(f"File {filename} not found - skipping")

if processed_texts:
    print(f"\nTotal documents for topic modeling: {len(processed_texts)}")

    # Create dictionary and corpus
    dictionary = corpora.Dictionary(processed_texts)
    dictionary.filter_extremes(no_below=2, no_above=0.7)  # Remove rare and common words
    corpus = [dictionary.doc2bow(text) for text in processed_texts]

    print(f"Dictionary size: {len(dictionary)}")
    print(f"Corpus size: {len(corpus)}")

    # Train LDA model
    print("Training topic model...")
    lda_model = gensim.models.LdaModel(
        corpus=corpus,
        id2word=dictionary,
        num_topics=5,
        random_state=42,
        passes=10,
        alpha='auto',
        per_word_topics=True
    )

    # Show topics
    print("\n=== Discovered Topics ===")
    for idx, topic in lda_model.print_topics():
        print(f"Topic {idx}: {topic}")

    # Model coherence
    coherence_model = gensim.models.CoherenceModel(
        model=lda_model, texts=processed_texts, dictionary=dictionary, coherence='c_v'
    )
    coherence_score = coherence_model.get_coherence()
    print(f"\nModel coherence score: {coherence_score:.3f}")

else:
    print("No texts available for topic modeling")

print("\n✅ Topic modeling completed!")
EOF

python3 topic_modeling.py

What this does: Discovers hidden themes and topics across the collection of historical texts.

Expected result: Shows discovered topics and their associated keywords.

Step 9: Using Your Own Digital Humanities Data

Instead of the tutorial data, you can analyze your own digital humanities datasets:

Upload Your Data

# Option 1: Upload from your local computer
scp -i ~/.ssh/id_rsa your_data_file.* ec2-user@12.34.56.78:~/digital_humanities-tutorial/

# Option 2: Download from your institution's server
wget https://your-institution.edu/data/research_data.csv

# Option 3: Access your AWS S3 bucket
aws s3 cp s3://your-research-bucket/digital_humanities-data/ . --recursive

Common Data Formats Supported

Text corpus (.txt, .xml, .json): Historical documents and literary texts
Metadata (.csv, .json): Bibliographic and archival information
Images (.jpg, .tif, .pdf): Digitized manuscripts and historical documents
Database exports (.sql, .csv): Digital collections and repositories
Linguistic data (.conllu, .xml): Annotated texts and linguistic corpora

Replace Tutorial Commands

Simply substitute your filenames in any tutorial command:

# Instead of tutorial data:
python3 text_analysis.py corpus_sample.txt

# Use your data:
python3 text_analysis.py YOUR_TEXT_CORPUS.txt

Data Size Considerations

Small datasets (<10 GB): Process directly on the instance
Large datasets (10-100 GB): Use S3 for storage, process in chunks
Very large datasets (>100 GB): Consider multi-node setup or data preprocessing

Step 10: Monitor Your Costs

Check your current spending:

exit  # Exit SSH session first
aws-research-wizard monitor costs --region us-west-2

Expected result: Shows costs so far (should be under $3 for this tutorial)

Step 11: Clean Up (Important!)

When you’re done experimenting:

aws-research-wizard deploy delete --region us-west-2

Type y when prompted.

What this does: Stops billing by removing your cloud resources.

💰 Important: Always clean up to avoid ongoing charges.

Expected result: “🗑️ Deletion completed successfully”

Understanding Your Costs

What You’re Paying For

Compute: $0.17 per hour for text processing instance while environment is running
Storage: $0.10 per GB per month for document collections you save
Data Transfer: Usually free for digital humanities data amounts

Cost Control Tips

Always delete environments when not needed
Use spot instances for 60% savings (advanced)
Store large document collections in S3, not on the instance
Process texts efficiently to minimize compute time

Typical Monthly Costs by Usage

Light use (10 hours/week): $30-75
Medium use (3 hours/day): $75-150
Heavy use (6 hours/day): $150-300

What’s Next?

Now that you have a working digital humanities environment, you can:

Learn More About Digital Text Analysis

Explore Advanced Features

Join the Digital Humanities Community

Extend and Contribute

🚀 Help us expand AWS Research Wizard!

Missing a tool or domain? We welcome suggestions for:

New digital humanities software (e.g., TEI, Omeka, Gephi, Voyant Tools, ELAN)
Additional domain packs (e.g., computational linguistics, digital archives, cultural analytics, text mining)
New data sources or tutorials for specific research workflows

How to contribute:

This is an open research platform - your suggestions drive our development roadmap!

Troubleshooting

Common Issues

Problem: “spaCy model not found” error during analysis Solution: Download language model: python -m spacy download en_core_web_sm Prevention: Wait 3-5 minutes after deployment for all NLP models to initialize

Problem: “NLTK data not found” error Solution: Download required data: python -c "import nltk; nltk.download('all')" Prevention: The analysis scripts automatically download required data

Problem: “Memory error” during large text processing Solution: Process texts in smaller chunks or use a larger instance type Prevention: Monitor memory usage with htop during analysis

Problem: “Unicode decode error” when reading text files Solution: Try different encoding: open(filename, 'r', encoding='latin-1') Prevention: Check file encoding with file -bi filename.txt before processing

Getting Help

Check the digital humanities troubleshooting guide
Ask in community forum
File an issue on GitHub

Emergency: Stop All Billing

If something goes wrong and you want to stop all charges immediately:

aws-research-wizard emergency-stop --region us-west-2 --confirm

Feedback

This guide should take 20 minutes and cost under $12. Help us improve:

Was this guide helpful? [Yes/No feedback buttons]

What was confusing? [Text box for feedback]

What would you add? [Text box for suggestions]

Rate the clarity (1-5): ⭐⭐⭐⭐⭐

*Last updated: January 2025

Reading level: 8th grade

Tutorial tested: January 15, 2025*