Welcome to the Web Scraping Microservice! This documentation provides a comprehensive guide for setting up, deploying, and maintaining the service on AWS Lambda. Enjoy the journey! 🚀
Below is the layout of the project:
lambda-services/scrape-service/
├── src/
│ └── scrape_lambda.py
├── requirements.txt
├── Dockerfile
├── build_and_deploy.sh
├── test/
│ └── test_scrape_lambda.py
└── README.md
Before getting started, ensure you have:
- AWS CLI configured with proper credentials 🔑
- Python 3.12 installed 🐍
- Docker installed (optional for container deployment) 🐳
- Node.js (for optional monitoring setup) ⚙️
- Deployment User Policy (for AWS CLI deployment):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lambda:CreateFunction",
"lambda:UpdateFunctionCode",
"lambda:UpdateFunctionConfiguration",
"lambda:PublishVersion",
"lambda:CreateAlias",
"lambda:UpdateAlias",
"lambda:DeleteFunction",
"lambda:GetFunction",
"lambda:InvokeFunction",
"lambda:CreateFunctionUrlConfig",
"lambda:AddPermission",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"apigateway:*",
"cloudwatch:*",
"codebuild:*",
"applicationinsights:*"
],
"Resource": [
"arn:aws:lambda:*:*:function:ScrapeService*",
"arn:aws:logs:*:*:log-group:/aws/lambda/ScrapeService*",
"arn:aws:s3:::inovationai-lambda-services-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"iam:PassRole",
"iam:CreateRole",
"iam:GetRole",
"iam:AttachRolePolicy"
],
"Resource": "arn:aws:iam::*:role/lambda-scrape-service-role"
}
]
}- Lambda Execution Role Policy (automatically attached to Lambda function):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:log-group:/aws/lambda/ScrapeService*"
]
}
]
}You can create these policies in IAM console:
- Go to IAM Console
- Create new policies with the JSON above
- For the deployment user policy, attach it to your deployment user
- The Lambda execution policy will be automatically attached to the Lambda role during deployment
The AWS_PROFILE refers to a named profile in your AWS credentials file. Here's how to set it up:
-
After running
aws configure, your credentials are stored in:- Windows:
%UserProfile%\.aws\credentials - Linux/MacOS:
~/.aws\credentials
- Windows:
-
The credentials file looks like this:
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
region = us-east-1
[development]
aws_access_key_id = ANOTHER_ACCESS_KEY
aws_secret_access_key = ANOTHER_SECRET_KEY
region = us-east-1- You can create multiple profiles using:
# Create a new named profile
aws configure --profile development- Then use the profile name in your configuration:
# Linux/MacOS
export AWS_PROFILE="development"
# Windows PowerShell
$env:AWS_PROFILE="development"Note: If you only have one AWS account, you can use
"default"as your profile name or omit the AWS_PROFILE setting entirely.
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Configure AWS CLI
aws configure
# You will be prompted for:
# AWS Access Key ID: [Your access key]
# AWS Secret Access Key: [Your secret key]
# Default region name: [Your region, e.g., us-east-1]
# Default output format: [json]# Install AWS CLI using MSI installer
# Download from: https://awscli.amazonaws.com/AWSCLIV2.msi
# Or using winget:
winget install -e --id Amazon.AWSCLI
# Configure AWS CLI (PowerShell or Command Prompt)
aws configure
# You will be prompted for:
# AWS Access Key ID: [Your access key]
# AWS Secret Access Key: [Your secret key]
# Default region name: [Your region, e.g., us-east-1]
# Default output format: [json]# Create virtual environment
python -m venv venv
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Create virtual environment
python -m venv venv-win
# Activate virtual environment (PowerShell)
.\venv-win\Scripts\Activate.ps1
# OR (Command Prompt)
.\venv-win\Scripts\activate.bat
# Install dependencies
pip install -r requirements.txt# Add to ~/.bashrc or ~/.zshrc
export AWS_PROFILE="your-profile"
export AWS_REGION="us-east-1"
export LAMBDA_FUNCTION="ScrapeService"
export LOG_LEVEL="INFO"# Using PowerShell (User level)
[System.Environment]::SetEnvironmentVariable('AWS_PROFILE', 'your-profile', 'User')
[System.Environment]::SetEnvironmentVariable('AWS_REGION', 'us-east-1', 'User')
[System.Environment]::SetEnvironmentVariable('LAMBDA_FUNCTION', 'ScrapeService', 'User')
[System.Environment]::SetEnvironmentVariable('LOG_LEVEL', 'INFO', 'User')
# OR using Command Prompt
setx AWS_PROFILE "your-profile"
setx AWS_REGION "us-east-1"
setx LAMBDA_FUNCTION "ScrapeService"
setx LOG_LEVEL "INFO"Set your environment variables appropriately:
# Required
export AWS_PROFILE="your-profile" # AWS CLI profile
export AWS_REGION="us-east-1" # AWS region
# Optional
export LAMBDA_FUNCTION="ScrapeService" # Default function name
export LOG_LEVEL="INFO" # DEBUG/INFO/WARNING/ERRORDeploy your service to AWS Lambda using one of the methods below:
Set these environment variables before deployment:
# Required
export AWS_PROFILE="your-aws-cli-profile" # AWS credentials profile
export AWS_REGION="us-east-1" # AWS region
export LAMBDA_FUNCTION="ScrapeService" # Lambda function name
# Optional
export LAMBDA_TIMEOUT=30 # Execution timeout in seconds
export LAMBDA_MEMORY_SIZE=512 # Memory allocation in MB
export LOG_LEVEL="INFO" # Debugging: DEBUG/INFO/WARNING/ERROR- Edit deployment script:
# Open the deployment script
nano build_and_deploy.sh
# Modify these lines (if needed):
SERVICE_NAME="scrape-service"
AWS_PROFILE="default" # Change to your AWS profile
AWS_REGION="us-east-1" # Update your region
LAMBDA_FUNCTION="ScrapeService" # Match Lambda console name- Execute deployment:
chmod +x build_and_deploy.sh
./build_and_deploy.sh# Build with custom parameters
docker build \
--build-arg AWS_REGION=$AWS_REGION \
--build-arg LAMBDA_FUNCTION=$LAMBDA_FUNCTION \
-t scrape-service .Add these secrets to your CI/CD platform:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_DEFAULT_REGION
Example GitHub Actions workflow:
- name: Deploy to Lambda
env:
AWS_PROFILE: ${{ secrets.AWS_PROFILE }}
AWS_REGION: ${{ secrets.AWS_REGION }}
run: |
chmod +x build_and_deploy.sh
./build_and_deploy.sh# Make script executable
chmod +x build_and_deploy.sh
# Run deployment
./build_and_deploy.sh# Create new file build_and_deploy.ps1
$SERVICE_NAME = "scrape-service"
$LAMBDA_ZIP = "scrape-lambda.zip"
# Configurable variables with env fallback
$AWS_PROFILE = if ($env:AWS_PROFILE) { $env:AWS_PROFILE } else { "default" }
$AWS_REGION = if ($env:AWS_REGION) { $env:AWS_REGION } else { "us-east-1" }
$LAMBDA_FUNCTION = if ($env:LAMBDA_FUNCTION) { $env:LAMBDA_FUNCTION } else { "ScrapeService" }
$LAMBDA_ROLE = if ($env:LAMBDA_ROLE) { $env:LAMBDA_ROLE } else { "lambda-scrape-service-role" }
# Install dependencies
pip install -r requirements.txt -t ./package
# Package
if (!(Test-Path -Path "package")) {
New-Item -ItemType Directory -Path "package"
}
Push-Location package
Compress-Archive -Path * -DestinationPath "../$LAMBDA_ZIP" -Force
Pop-Location
Compress-Archive -Path src/scrape_lambda.py -Update -DestinationPath $LAMBDA_ZIP
# Check if Lambda function exists
$functionExists = $false
try {
aws lambda get-function --function-name $LAMBDA_FUNCTION --region $AWS_REGION --profile $AWS_PROFILE | Out-Null
$functionExists = $true
} catch {
Write-Host "Lambda function does not exist. Creating new function..."
}
if ($functionExists) {
# Update existing function
Write-Host "Updating existing Lambda function..."
aws lambda update-function-code `
--function-name $LAMBDA_FUNCTION `
--zip-file fileb://$LAMBDA_ZIP `
--region $AWS_REGION `
--profile $AWS_PROFILE
} else {
# Create IAM role if it doesn't exist
$roleArn = ""
try {
$roleArn = (aws iam get-role --role-name $LAMBDA_ROLE --query 'Role.Arn' --output text --profile $AWS_PROFILE)
} catch {
Write-Host "Creating IAM role..."
$trustPolicy = @{
Version = "2012-10-17"
Statement = @(
@{
Effect = "Allow"
Principal = @{
Service = "lambda.amazonaws.com"
}
Action = "sts:AssumeRole"
}
)
} | ConvertTo-Json -Depth 10
$roleArn = (aws iam create-role `
--role-name $LAMBDA_ROLE `
--assume-role-policy-document $trustPolicy `
--profile $AWS_PROFILE `
--query 'Role.Arn' `
--output text)
# Attach basic Lambda execution policy
aws iam attach-role-policy `
--role-name $LAMBDA_ROLE `
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole `
--profile $AWS_PROFILE
# Wait for role to propagate
Start-Sleep -Seconds 10
}
# Create new Lambda function
Write-Host "Creating new Lambda function..."
aws lambda create-function `
--function-name $LAMBDA_FUNCTION `
--runtime python3.12 `
--handler src.scrape_lambda.lambda_handler `
--role $roleArn `
--zip-file fileb://$LAMBDA_ZIP `
--region $AWS_REGION `
--profile $AWS_PROFILE `
--timeout 30 `
--memory-size 512
}
Write-Host "Deployment completed successfully!"
# Optional: For container deployment
<#
docker build -t $SERVICE_NAME .
aws ecr create-repository --repository-name $SERVICE_NAME --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
docker tag ${SERVICE_NAME}:latest <account-id>.dkr.ecr.${AWS_REGION}.amazonaws.com/${SERVICE_NAME}:latest
docker push <account-id>.dkr.ecr.${AWS_REGION}.amazonaws.com/${SERVICE_NAME}:latest
#>To run the Windows deployment script:
# Allow script execution (if not already enabled)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Run deployment script
.\build_and_deploy.ps1Note: Make sure to run PowerShell as Administrator if you encounter permission issues.
| Variable | Required | Default | Description |
|---|---|---|---|
AWS_PROFILE |
No | default |
AWS credentials profile |
AWS_REGION |
No | us-east-1 |
AWS service region |
LAMBDA_FUNCTION |
No | ScrapeService |
Target Lambda function name |
Temporary Configuration:
# Single deployment with custom config
AWS_PROFILE="staging" AWS_REGION="eu-west-1" ./build_and_deploy.shPersistent Configuration:
# Add to shell profile (~/.bashrc or ~/.zshrc)
export AWS_PROFILE="production"
export AWS_REGION="sa-east-1"Ensure your microservice works as expected:
Run the unit tests:
pytest test/ -vTest the integration with a sample URL:
# Test with sample URL
python -m pytest test/integration/test_scrape_integration.pyAccess the Scrape API through the following endpoint:
Endpoint:
POST https://{api-id}.execute-api.{region}.amazonaws.com/scrape
Request:
curl -X POST https://api.example.com/scrape \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{"url": "https://example.com"}'Response:
{
"markdown": "# Example Domain...",
"images": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"metadata": {
"processing_time": "1.23s",
"content_size": "45KB"
}
}Monitor your AWS Lambda logs using CloudWatch:
# Create CloudWatch log group
aws logs create-log-group --log-group-name /aws/lambda/ScrapeService
# View real-time logs
aws logs tail /aws/lambda/ScrapeService --followKey security practices include:
- HTTPS Enforcement: Enabled at API Gateway 🛡️
- Rate Limiting: 100 requests/second ⏱️
- Authentication: API Key required 🔑
- IAM Policies: Least privilege access 🔐
- Secret Rotation: Quarterly key rotation 🔄
Optimize your AWS deployment by:
- Enabling Lambda Provisioned Concurrency for steady traffic ⚙️
- Utilizing CloudFront caching for frequent requests 📦
- Setting appropriate memory size (512MB recommended) ⚖️
- Activating API Gateway caching 🚀
Common issues and remedies:
- Timeout Errors ⏰
Increase Lambda timeout (max 15 minutes) - Missing Dependencies
⚠️
Runpip install -r requirements.txt - Permission Denied 🚫
Verify IAM roles include:- AWSLambdaBasicExecutionRole
- AmazonAPIGatewayInvokeFullAccess
- Invalid URL Format 🌐
Ensure URLs include the protocol (http:// or https://) - Virtual Environment Issues 🔧
- Windows: If unable to activate venv, run
Set-ExecutionPolicy RemoteSigned -Scope CurrentUserin PowerShell - Linux: If permission denied, run
chmod +x venv/bin/activate
- Windows: If unable to activate venv, run
- AWS CLI Configuration Issues ⚙️
- Verify credentials file location:
- Windows:
%UserProfile%\.aws\credentials - Linux/MacOS:
~/.aws/credentials
- Windows:
- Check AWS CLI installation:
aws --version
- Verify credentials file location:
Automate your deployments with GitHub Actions. Example workflow (.github/workflows/deploy.yml):
name: Deploy
on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.12'
- run: pip install -r requirements.txt
- run: chmod +x build_and_deploy.sh
- run: ./build_and_deploy.sh
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: 'us-east-1'Keep your service up-to-date and optimized:
Dependency Updates:
# Update all packages
pip list --outdated
pip freeze > requirements.txtScheduled Cleanup:
# Remove old Lambda versions
aws lambda list-versions-by-function --function-name ScrapeService
aws lambda delete-function --function-name ScrapeService --qualifier <version>graph TD
A[API Gateway] --> B[AWS Lambda]
B --> C{Scraping Process}
C --> D[Fetch HTML]
C --> E[Parse Content]
C --> F[Convert to Markdown]
D --> G[Return Structured Data]
# View logs
aws logs tail /aws/lambda/$LAMBDA_FUNCTION \
--region $AWS_REGION \
--profile $AWS_PROFILE- ✅ HTTP requests with redirect handling
- ✅ HTML parsing with BeautifulSoup
- ✅ Markdown conversion with link preservation
- ✅ Image extraction with absolute URLs
- ✅ Custom HTTP methods support
- ✅ Custom headers support
- ✅ Multiple response formats
POST https://<function-url>/scrape
{
"url": "https://example.com",
"format": "json|html|text|proxy", // Optional, defaults to "html"
"method": "GET|POST|PUT|...", // Optional, defaults to "GET"
"maxsize": 2000, // Optional, max size of summary text (default: 2000)
"max_level": 0, // Optional, recursion depth for links (default: 0)
"max_recursion_links": 10, // Optional, max number of recursive links to process
"link_exp_filter": "\\.(pdf|docx)$", // Optional, regex to filter links
"images": true, // Optional, include images in response (default: true)
"headers": [ // Optional, custom request headers
{"header-name": "value"},
{"another-header": "value"}
]
}| Parameter | Type | Description |
|---|---|---|
url |
string | Required. URL to scrape |
format |
string | Response format. Options: json, html, text, proxy. Default: html |
method |
string | HTTP method for the request. Default: GET |
maxsize |
number | Maximum size of summary text. Default: 2000 |
max_level |
number | Recursion depth for link processing. Default: 0 (no recursion) |
max_recursion_links |
number | Maximum number of links to process recursively |
link_exp_filter |
string | Regular expression to filter which links to process |
images |
boolean | Whether to include images in response. Default: true |
headers |
array | Array of header objects to be sent with the request |
The headers parameter accepts an array of objects, where each object represents a header:
{
"headers": [
{"x-api-key": "abc123"},
{"content-type": "application/json"},
{"custom-header": "value"}
]
}{
"title": "Page Title",
"markdown": "# Markdown content...",
"html": "<p>HTML content...</p>",
"images": ["https://..."],
"links": ["https://..."],
"final_url": "https://..."
}{
"url": "https://example.com"
}{
"url": "https://example.com",
"max_level": 2,
"max_recursion_links": 10,
"link_exp_filter": "\\.(pdf|docx)$",
"format": "markdown"
}{
"url": "https://api.example.com/data",
"method": "POST",
"headers": [
{"x-api-key": "abc123"},
{"content-type": "application/json"}
],
"format": "json"
}