Python 3.11 Support

I created a new Python 3.11 & Spark 3.5 Dockerfile locally and successfully built it using the Makefile, but after using the image in some processing jobs I'm noticing that only algo-1 seems to do any work, the rest of the algos drop down to below 10% utilization, whereas before, when using the Python 3.9 & Spark 3.5 image, the work was distributed nicely across all algos. 

I haven't seen much of a push to upgrade the image beyond Python 3.9, and I was wondering if that could be related to the issues I'm seeing with my processing jobs?


Here is the Docker image I built:
```
FROM public.ecr.aws/amazonlinux/amazonlinux:2023
ARG REGION
ENV AWS_REGION ${REGION}

RUN rpm -q system-release --qf '%{VERSION}'

RUN dnf clean all \
    && dnf update -y \
    && dnf install -y awscli vim gcc gzip unzip zip tar wget liblapack* libblas* libopenblas* \
    && dnf install -y openssl openssl-devel \
    && dnf install -y kernel kernel-headers kernel-devel \
    && dnf install -y bzip2-devel libffi-devel sqlite-devel xz-devel \
    && dnf install -y ncurses ncurses-compat-libs binutils \
    && dnf install -y nss-softokn-freebl avahi-libs avahi dbus dbus-libs \
    && dnf install -y python-pillow

# Install python 3.11
ARG PYTHON_BASE_VERSION=3.11
ARG PYTHON_WITH_BASE_VERSION=python${PYTHON_BASE_VERSION}
ARG PIP_WITH_BASE_VERSION=pip${PYTHON_BASE_VERSION}
ARG PYTHON_VERSION=${PYTHON_BASE_VERSION}.9
RUN dnf groupinstall -y 'Development Tools' \
    && wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \
    && tar xzf Python-${PYTHON_VERSION}.tgz \
    && cd Python-*/ \
    && ./configure --enable-optimizations \
    && make altinstall \
    && echo -e 'alias python3=python3.11\nalias pip3=pip3.11' >> ~/.bashrc \
    && ln -s $(which ${PYTHON_WITH_BASE_VERSION}) /usr/local/bin/python3 \
    && ln -s $(which ${PIP_WITH_BASE_VERSION}) /usr/local/bin/pip3 \
    && cd .. \
    && rm Python-${PYTHON_VERSION}.tgz \
    && rm -rf Python-${PYTHON_VERSION}

#Amazon Linux 2023 uses dnf instead of yum as pacakge management tool: https://docs.aws.amazon.com/linux/al2023/ug/package-management.html

# Copied from EMR: https://tiny.amazon.com/kycbidpc/codeamazpackAwsCblob51c8src
RUN dnf install -y java-1.8.0-amazon-corretto-devel nginx python3-virtualenv \
    && dnf -y clean all && rm -rf /var/cache/dnf

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
ENV PYTHONHASHSEED 0
ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1

# Install EMR Spark/Hadoop
ENV HADOOP_HOME /usr/lib/hadoop
ENV HADOOP_CONF_DIR /usr/lib/hadoop/etc/hadoop
ENV SPARK_HOME /usr/lib/spark

COPY yum/emr-apps.repo /etc/yum.repos.d/emr-apps.repo

# Install hadoop / spark dependencies from EMR's yum repository for Spark optimizations.
# replace placeholder with region in repository URL
RUN sed -i "s/REGION/${AWS_REGION}/g" /etc/yum.repos.d/emr-apps.repo
RUN ls /etc/yum.repos.d/emr-apps.repo
RUN cat /etc/yum.repos.d/emr-apps.repo
RUN adduser -N hadoop

# These packages are a subset of what EMR installs in a cluster with the
# "hadoop", "spark", and "hive" applications.
# They include EMR-optimized libraries and extras.
RUN dnf install -y aws-hm-client \
    aws-java-sdk \
    emr-goodies \
    emr-scripts \
    emr-s3-select \
    emrfs \
    hadoop \
    hadoop-client \
    hadoop-hdfs \
    hadoop-hdfs-datanode \
    hadoop-hdfs-namenode \
    hadoop-httpfs \
    hadoop-kms \
    hadoop-lzo \
    hadoop-mapreduce \
    hadoop-yarn \
    hadoop-yarn-nodemanager \
    hadoop-yarn-proxyserver \
    hadoop-yarn-resourcemanager \
    hadoop-yarn-timelineserver \
    hive \
    hive-hcatalog \
    hive-hcatalog-server \
    hive-jdbc \
    hive-server2 \
    s3-dist-cp \
    spark-core \
    spark-datanucleus \
    spark-history-server \
    spark-python \
    && dnf -y clean all \
    && rm -rf /var/cache/dnf /var/lib/dnf/* /etc/yum.repos.d/emr-*

# Point Spark at proper python binary
ENV PYSPARK_PYTHON=/usr/local/bin/python3


# Setup Spark/Yarn/HDFS user as root
ENV PATH="/usr/bin:/opt/program:${PATH}"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"

# Set up bootstrapping program and Spark configuration
COPY hadoop-config /opt/hadoop-config
COPY nginx-config /opt/nginx-config
# COPY aws-config /opt/aws-config
COPY Pipfile Pipfile.lock setup.py *.whl /opt/program/
ENV PIPENV_PIPFILE=/opt/program/Pipfile
# Use --system flag, so it will install all packages into the system python,
# and not into the virtualenv. Since docker containers do not need to have virtualenvs
# pipenv > 2022.4.8 fails to build smspark
RUN /usr/local/bin/python3.11 -m pip --version
RUN /usr/local/bin/python3.11 -m pip install --upgrade pip
RUN /usr/local/bin/python3.11 -m pip install --upgrade pip setuptools wheel


RUN /usr/local/bin/python3.11 -m pip install pipenv==2022.4.8 \
    && pipenv install --system \
    && /usr/local/bin/python3.11 -m pip install /opt/program/*.whl

# Setup container bootstrapper
COPY container-bootstrap-config /opt/container-bootstrap-config
RUN chmod +x /opt/container-bootstrap-config/bootstrap.sh \
    && /opt/container-bootstrap-config/bootstrap.sh

# With this config, spark history server will not run as daemon, otherwise there
# will be no server running and container will terminate immediately
ENV SPARK_NO_DAEMONIZE TRUE

WORKDIR $SPARK_HOME

# Install the sagemaker feature store spark connector
# https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html
# Feature store connector library currently does not support spark 3.4 so commenting out this line
# RUN /usr/local/bin/python3.11 -m pip install sagemaker-feature-store-pyspark-3.3==1.1.2 --no-binary :all:

ENTRYPOINT ["smspark-submit"]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python 3.11 Support #156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python 3.11 Support #156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions