- Elastic MapReduce is a managed clustered platform to run big data frameworks
- Over 21 different frameworks available!
- They are “out of box”, pre-installed
- But really **Spark** is the one
-
It’s on-demand
- Companies can save money by spinning up clusters up/down without having to setup their own Hadoop clusters
- Cost savings using EMR managed auto-scaling
-
Effective for OLAP (Online analytical processing) and Batch Processing jobs
- At a minimum, you’ll need terabytes worth of data processing
- Suitable for “embarrassingly parallel” tasks
-
By default, EMR uses YARN (Yet Another Resource Negotiator) for Cluster resource management
- Consists of Resource Manager and Node Managers to coordinate job scheduling
YARN was introduced in Apache Hadoop 2.0 to help with cluster management
There exists many AWS Services that integrate with EMR. The biggest ones to know are
-
Amazon S3
-
Amazon EC2
- Provision virtual machines for our worker nodes
- Choose from various instance types (e.g., m5 for general processing, c5 for ML tasks, x1 for memory-intensive operations).
- Fleet instances allow you to mix and match on-demand, spot, reserved instances
-
Amazon VPC
- Secure your networking environment
- Limit your IP addresses, subnets, and traffic
- Don’t put all your services in a single, default VPC
- Shortly secures networking environments with subnets and traffic controls. Other notable services:
-
AWS Step Functions for workflow orchestration.
-
AWS Glue for serverless ETL operations. Initially, I work with AWS Glue for my project but later switched between various tools before settling on my current choice.
An EMR cluster comprises three node types:
-
Primary Node:
- Coordinates data and task distribution.
- Tracks task statuses and monitors cluster health.
-
Core Node:
- Runs tasks and stores data in the ephemeral Hadoop Distributed File System (HDFS).
-
Task Node (optional):
- Executes tasks but does not store data.
An EMR cluster consists of a primary node, core nodes and task nodes (optional).
EMR clusters are Elastic
- Horizontal* scaling using 1) manual or 2) auto-scaling
- EMR managed scaling allows you to set minimum/maximum cluster size
- 💡 Pro tip: set number of on-demand instances for critical tasks, then provision the rest as spot instances
example of EMR managed auto-scaling interface
When you run an EMR job, you do this by triggering Steps. There are 3 ways to trigger EMR steps
- Management Console
- SSH into the primary node and manually run a task
- AWS Command Line Interface (recommended way)
EMR clusters run on 3 different compute platforms
- Amazon EC2: Default option. High performance and expensive
- Amazon EKS: Run light-weight applications
- Serverless: Small, low code applications
This project demonstrates a movie recommendation system implemented on Amazon EMR using PySpark.
- Flask: Provides a lightweight web framework to create API endpoints for recommendations and feedback.
- PySpark: Leverages Spark’s MLlib for collaborative filtering.
- boto3: Enables interaction with AWS S3 for data loading.
from flask import Flask, request, jsonify
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col
import boto3
app = Flask(__name__)
spark = SparkSession.builder \
.appName("MovieRecommendationApp") \
.config("spark.executor.memory", "2g") \
.getOrCreate()- Loads movie and ratings datasets from an S3 bucket using
spark.read.csv. - Replace
BUCKET_NAME,MOVIES_FILE, andRATINGS_FILEwith your S3 paths.
BUCKET_NAME = 'your-s3-bucket-name'
MOVIES_FILE = 'path/to/movies.csv'
RATINGS_FILE = 'path/to/ratings.csv'
movies = spark.read.csv(f"s3a://{BUCKET_NAME}/{MOVIES_FILE}", header=True, inferSchema=True)
ratings = spark.read.csv(f"s3a://{BUCKET_NAME}/{RATINGS_FILE}", header=True, inferSchema=True)- Uses the ALS (Alternating Least Squares) algorithm from PySpark MLlib to train the recommendation model.
- Key parameters:
maxIter: Maximum number of iterations.regParam: Regularization parameter.coldStartStrategy: Handles missing data by dropping rows.
def train_model():
als = ALS(
maxIter=10,
regParam=0.1,
userCol="userId",
itemCol="movieId",
ratingCol="rating",
coldStartStrategy="drop"
)
model = als.fit(ratings)
return model
model = train_model()- Accepts a user ID and generates a list of recommended movies.
- Steps:
- Creates a DataFrame for the user.
- Uses
recommendForUserSubsetto get recommendations. - Filters movie metadata from the
moviesdataset for output.
@app.route('/recommend', methods=['GET'])
def recommend():
user_id = int(request.args.get('user_id'))
user_df = spark.createDataFrame([(user_id,)], ["userId"])
recommendations = model.recommendForUserSubset(user_df, 10)
recommendations = recommendations.select("recommendations.movieId", "recommendations.rating").collect()
movie_ids = [row.movieId for row in recommendations[0]["movieId"]]
recommended_movies = movies.filter(col("movieId").isin(movie_ids)).collect()
return jsonify({
"user_id": user_id,
"recommendations": [
{"movieId": row.movieId, "title": row.title} for row in recommended_movies
]
})- Accepts user feedback (movie ID and rating) via a POST request.
- Updates the
ratingsdataset in memory. - In production, feedback should be saved to a database.
@app.route('/feedback', methods=['POST'])
def feedback():
data = request.json
user_id = data['user_id']
movie_id = data['movie_id']
rating = data['rating']
new_rating = [(user_id, movie_id, rating)]
new_df = spark.createDataFrame(new_rating, ["userId", "movieId", "rating"])
global ratings
ratings = ratings.union(new_df)
return jsonify({"message": "Feedback received successfully"})- Runs the Flask app on port 5000.
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)- Cluster Setup:
- Provision an EMR cluster with Spark installed.
- Data Loading:
- Ensure datasets are uploaded to S3.
- Code Deployment:
- Deploy the application to the EMR cluster.
- API Interaction:
- Use the
/recommendand/feedbackendpoints to interact with the model.
- Use the
This project demonstrates how to combine PySpark and AWS services to build scalable, distributed machine learning applications.


