generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 75
Open
Description
I created a new Python 3.11 & Spark 3.5 Dockerfile locally and successfully built it using the Makefile, but after using the image in some processing jobs I'm noticing that only algo-1 seems to do any work, the rest of the algos drop down to below 10% utilization, whereas before, when using the Python 3.9 & Spark 3.5 image, the work was distributed nicely across all algos.
I haven't seen much of a push to upgrade the image beyond Python 3.9, and I was wondering if that could be related to the issues I'm seeing with my processing jobs?
Here is the Docker image I built:
FROM public.ecr.aws/amazonlinux/amazonlinux:2023
ARG REGION
ENV AWS_REGION ${REGION}
RUN rpm -q system-release --qf '%{VERSION}'
RUN dnf clean all \
&& dnf update -y \
&& dnf install -y awscli vim gcc gzip unzip zip tar wget liblapack* libblas* libopenblas* \
&& dnf install -y openssl openssl-devel \
&& dnf install -y kernel kernel-headers kernel-devel \
&& dnf install -y bzip2-devel libffi-devel sqlite-devel xz-devel \
&& dnf install -y ncurses ncurses-compat-libs binutils \
&& dnf install -y nss-softokn-freebl avahi-libs avahi dbus dbus-libs \
&& dnf install -y python-pillow
# Install python 3.11
ARG PYTHON_BASE_VERSION=3.11
ARG PYTHON_WITH_BASE_VERSION=python${PYTHON_BASE_VERSION}
ARG PIP_WITH_BASE_VERSION=pip${PYTHON_BASE_VERSION}
ARG PYTHON_VERSION=${PYTHON_BASE_VERSION}.9
RUN dnf groupinstall -y 'Development Tools' \
&& wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \
&& tar xzf Python-${PYTHON_VERSION}.tgz \
&& cd Python-*/ \
&& ./configure --enable-optimizations \
&& make altinstall \
&& echo -e 'alias python3=python3.11\nalias pip3=pip3.11' >> ~/.bashrc \
&& ln -s $(which ${PYTHON_WITH_BASE_VERSION}) /usr/local/bin/python3 \
&& ln -s $(which ${PIP_WITH_BASE_VERSION}) /usr/local/bin/pip3 \
&& cd .. \
&& rm Python-${PYTHON_VERSION}.tgz \
&& rm -rf Python-${PYTHON_VERSION}
#Amazon Linux 2023 uses dnf instead of yum as pacakge management tool: https://docs.aws.amazon.com/linux/al2023/ug/package-management.html
# Copied from EMR: https://tiny.amazon.com/kycbidpc/codeamazpackAwsCblob51c8src
RUN dnf install -y java-1.8.0-amazon-corretto-devel nginx python3-virtualenv \
&& dnf -y clean all && rm -rf /var/cache/dnf
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
ENV PYTHONHASHSEED 0
ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1
# Install EMR Spark/Hadoop
ENV HADOOP_HOME /usr/lib/hadoop
ENV HADOOP_CONF_DIR /usr/lib/hadoop/etc/hadoop
ENV SPARK_HOME /usr/lib/spark
COPY yum/emr-apps.repo /etc/yum.repos.d/emr-apps.repo
# Install hadoop / spark dependencies from EMR's yum repository for Spark optimizations.
# replace placeholder with region in repository URL
RUN sed -i "s/REGION/${AWS_REGION}/g" /etc/yum.repos.d/emr-apps.repo
RUN ls /etc/yum.repos.d/emr-apps.repo
RUN cat /etc/yum.repos.d/emr-apps.repo
RUN adduser -N hadoop
# These packages are a subset of what EMR installs in a cluster with the
# "hadoop", "spark", and "hive" applications.
# They include EMR-optimized libraries and extras.
RUN dnf install -y aws-hm-client \
aws-java-sdk \
emr-goodies \
emr-scripts \
emr-s3-select \
emrfs \
hadoop \
hadoop-client \
hadoop-hdfs \
hadoop-hdfs-datanode \
hadoop-hdfs-namenode \
hadoop-httpfs \
hadoop-kms \
hadoop-lzo \
hadoop-mapreduce \
hadoop-yarn \
hadoop-yarn-nodemanager \
hadoop-yarn-proxyserver \
hadoop-yarn-resourcemanager \
hadoop-yarn-timelineserver \
hive \
hive-hcatalog \
hive-hcatalog-server \
hive-jdbc \
hive-server2 \
s3-dist-cp \
spark-core \
spark-datanucleus \
spark-history-server \
spark-python \
&& dnf -y clean all \
&& rm -rf /var/cache/dnf /var/lib/dnf/* /etc/yum.repos.d/emr-*
# Point Spark at proper python binary
ENV PYSPARK_PYTHON=/usr/local/bin/python3
# Setup Spark/Yarn/HDFS user as root
ENV PATH="/usr/bin:/opt/program:${PATH}"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"
# Set up bootstrapping program and Spark configuration
COPY hadoop-config /opt/hadoop-config
COPY nginx-config /opt/nginx-config
# COPY aws-config /opt/aws-config
COPY Pipfile Pipfile.lock setup.py *.whl /opt/program/
ENV PIPENV_PIPFILE=/opt/program/Pipfile
# Use --system flag, so it will install all packages into the system python,
# and not into the virtualenv. Since docker containers do not need to have virtualenvs
# pipenv > 2022.4.8 fails to build smspark
RUN /usr/local/bin/python3.11 -m pip --version
RUN /usr/local/bin/python3.11 -m pip install --upgrade pip
RUN /usr/local/bin/python3.11 -m pip install --upgrade pip setuptools wheel
RUN /usr/local/bin/python3.11 -m pip install pipenv==2022.4.8 \
&& pipenv install --system \
&& /usr/local/bin/python3.11 -m pip install /opt/program/*.whl
# Setup container bootstrapper
COPY container-bootstrap-config /opt/container-bootstrap-config
RUN chmod +x /opt/container-bootstrap-config/bootstrap.sh \
&& /opt/container-bootstrap-config/bootstrap.sh
# With this config, spark history server will not run as daemon, otherwise there
# will be no server running and container will terminate immediately
ENV SPARK_NO_DAEMONIZE TRUE
WORKDIR $SPARK_HOME
# Install the sagemaker feature store spark connector
# https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html
# Feature store connector library currently does not support spark 3.4 so commenting out this line
# RUN /usr/local/bin/python3.11 -m pip install sagemaker-feature-store-pyspark-3.3==1.1.2 --no-binary :all:
ENTRYPOINT ["smspark-submit"]
gcbeltramini, shreeshaprabhu, hiramOk and gygabyte017
Metadata
Metadata
Assignees
Labels
No labels