Add example of running the tf.data service in GKE. (#162)

aaudiber · jhseu · web-flow · commit 791a42f42713 · 2020-06-22T13:02:16.000-07:00
Co-authored-by: Jonathan Hseu &lt;vomjom@vomjom.net&gt;
diff --git a/data_service/Dockerfile.tf_std_data_server b/data_service/Dockerfile.tf_std_data_server
@@ -0,0 +1,4 @@
+FROM tensorflow/tensorflow:nightly
+
+COPY tf_std_data_server.py /
+ENTRYPOINT ["python", "-u", "/tf_std_data_server.py"]
diff --git a/data_service/README.md b/data_service/README.md
@@ -0,0 +1,143 @@
+# Distributed input processing with tf.data service.
+
+This directory provides an example of running the tf.data service to
+horizontally scale tf.data input processing. We use GKE
+(Google Kubernetes Engine) to manage the tf.data servers.
+
+This directory contains the following files:
+
+- `Dockerfile.tf_std_data_server`: A dockerfile to build a tf.data server image.
+- `data_service.yaml.jinja`: A Jinja-templated Kubernetes definition for running
+  tf.data service servers
+- `data_service_interfaces.yaml.jinja`: A Jinja-templated Kubernetes definition
+  for creating load balancers which expose the tf.data service endpoints
+  outside the GKE cluster (but within the same VPC network). This is needed
+  for TPUs to be able to connect to servers running in GKE.
+- `tf_std_data_server.py`: A basic tf.data server implementation.
+
+## Run the tf.data service in GKE
+
+### Start a GKE cluster
+
+If you don't already have a [GKE](https://cloud.google.com/kubernetes-engine)
+cluster, create one:
+
+Replace `${CLUSTER_NAME}` with a name of your choice.
+Replace `${NUM_NODES}` with the number of tf.data service machines to run, e.g.
+`8`.
+Replace `${MACHINE_TYPE}` with the machine type to use, e.g. `e2-standard-4`
+
+```
+gcloud container clusters create ${CLUSTER_NAME} --zone europe-west4-a \
+ --scopes=cloud-platform --enable-ip-alias --num-nodes=${NUM_NODES} \
+ --machine-type=${MACHINE_TYPE}
+```
+
+`--enable-ip-alias` is needed to be able to connect to the cluster from a TPU.
+
+### Create service endpoints
+
+Set number of workers in `data_service_interfaces`
+Edit the variable at the start of `data_service_interfaces.yaml.jinja` to set the number of workers.
+{%- set workers = 8 -%}
+
+Create data service endpoints so that the data service can be accessed from outside GKE.
+This requires `jinja2`, install it if you don't have it already: `pip3 install jinja2`.
+
+```
+python3 ../render_template.py data_service_interfaces.yaml.jinja | kubectl apply -f -
+```
+
+### Create tf.data server image
+
+```
+docker build --no-cache -t gcr.io/${PROJECT_ID}/tf_std_data_server:latest \
+  -f Dockerfile.tf_std_data_server .
+docker push gcr.io/${PROJECT_ID}/tf_std_data_server:latest
+```
+
+### Start tf.data servers
+
+Edit `data_service.yaml.jinja`, setting the image variable at the top of the
+file to the image created in the previous step, e.g.
+`"gcr.io/${PROJECT_ID}/tf_std_data_server:latest"`
+
+Wait for GKE to assign endpoints for all services created in the "Create service
+endpoints" step. This may
+take a few minutes. The below command will query all worker endpoints:
+
+```
+kubectl get services -o=jsonpath='{"\n"}{range .items[*]}"{.metadata.name}": "{.status.loadBalancer.ingress[*].ip}",{"\n"}{end}{"\n"}' | grep data-service-worker
+```
+
+Once the command shows non-empty addresses for all workers, copy the output
+of the command into the `ip_mapping` variable at the start of `data_service.yaml.jinja`.
+
+```
+{% set ip_mapping = {
+"data-service-worker-0": "10.164.0.40",
+"data-service-worker-1": "10.164.0.41",
+...
+} %}
+```
+
+Now launch the tf.data servers:
+
+```
+python3 ../render_template.py data_service.yaml.jinja | kubectl apply -f -
+```
+
+The service is now ready to use. To find the service address, run
+
+```
+kubectl get services data-service-master
+```
+
+and examine the `EXTERNAL-IP` and `PORT(S)` columns. To access the cluster,
+you will use the string `'grpc://<EXTERNAL-IP>:<PORT>'`
+
+## Run ResNet using the tf.data service for input.
+
+The `classifier_trainer.py` script in the [TensorFlow Model
+Garden](https://github.com/tensorflow/models) supports using the tf.data service to
+get input data.
+
+To run the script, do the following:
+
+```
+git clone https://github.com/tensorflow/models.git
+cd models/official/vision/image_classification
+```
+
+Edit either `configs/examples/resnet/imagenet/gpu.yaml` or
+`configs/examples/resnet/imagenet/tpu.yaml`,
+depending on whether you want to run on GPU or TPU. Under the `train_dataset`
+and `validation_dataset` sections, update `builder` from `'tfds'` to
+`'records'`. Then under the `train_dataset` section, add `tf_data_service:
+'grpc://<EXTERNAL_IP>:<PORT>'`.
+
+Finally, run the ResNet model.
+
+```
+export PYTHONPATH=/path/to/models
+python3 classifier_trainer.py \
+ --mode=train_and_eval --model_type=resnet --dataset=imagenet --tpu=$TPU_NAME \
+ --model_dir=$MODEL_DIR --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet \
+ --config_file=path/to/config
+```
+
+## Restarting tf.data servers
+
+tf.data servers are meant to live for the duration of a single training job.
+When starting a new job, you can use the following commands to stop the tf.data
+servers:
+
+```
+kubectl get rs --no-headers=true | grep "data-service-" | xargs kubectl delete rs
+```
+
+Then to start the servers again, run
+
+```
+python3 ../render_template.py data_service.yaml.jinja | kubectl apply -f -
+```
diff --git a/data_service/data_service.yaml.jinja b/data_service/data_service.yaml.jinja
@@ -0,0 +1,50 @@
+{%- set image = "gcr.io/<project_id>/tf_std_data_server:latest" -%}
+{%- set port = 5050 -%}
+{% set ip_mapping = {
+} %}
+
+kind: ReplicaSet
+apiVersion: extensions/v1beta1
+metadata:
+  name: data-service-master
+spec:
+  replicas: 1
+  template:
+    metadata:
+      labels:
+        name: data-service-master
+    spec:
+      containers:
+      - name: tensorflow
+        image: {{ image }}
+        ports:
+        - containerPort: {{ port }}
+        args:
+        - "--port={{ port }}"
+        - "--is_master=true"
+---
+
+{% for worker_name, worker_ip in ip_mapping.items() %}
+kind: ReplicaSet
+apiVersion: extensions/v1beta1
+metadata:
+  name: {{ worker_name }}
+spec:
+  replicas: 1
+  template:
+    metadata:
+      labels:
+        name: {{ worker_name }}
+    spec:
+      containers:
+      - name: tensorflow
+        image: {{ image }}
+        ports:
+        - containerPort: {{ port }}
+        args:
+        - "--port={{ port }}"
+        - "--is_master=false"
+        - "--master_address=data-service-master:{{ port }}"
+        - "--worker_address={{ worker_ip }}:{{ port }}"
+---
+{% endfor %}
diff --git a/data_service/data_service_interfaces.yaml.jinja b/data_service/data_service_interfaces.yaml.jinja
@@ -0,0 +1,35 @@
+{%- set workers = 8 -%}
+{%- set port = 5050 -%}
+
+kind: Service
+apiVersion: v1
+metadata:
+  name: data-service-master
+  annotations:
+    cloud.google.com/load-balancer-type: "Internal"
+spec:
+  type: LoadBalancer
+  selector:
+    name: data-service-master
+  ports:
+  - port: {{ port }}
+    targetPort: {{ port }}
+    protocol: TCP
+---
+{% for i in range(workers) %}
+kind: Service
+apiVersion: v1
+metadata:
+  name: data-service-worker-{{ i }}
+  annotations:
+    cloud.google.com/load-balancer-type: "Internal"
+spec:
+  type: LoadBalancer
+  selector:
+    name: data-service-worker-{{ i }}
+  ports:
+  - port: {{ port }}
+    targetPort: {{ port }}
+    protocol: TCP
+---
+{% endfor %}
diff --git a/data_service/tf_std_data_server.py b/data_service/tf_std_data_server.py
@@ -0,0 +1,49 @@
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Run a tf.data service server."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+flags = tf.compat.v1.app.flags
+
+flags.DEFINE_integer("port", 0, "Port to listen on")
+flags.DEFINE_bool("is_master", False, "Whether to start a master (as opposed to a worker server")
+flags.DEFINE_string("master_address", "", "The address of the master server. This is only needed when starting a worker server.")
+flags.DEFINE_string("worker_address", "", "The address of the worker server. This is only needed when starting a worker server.")
+
+FLAGS = flags.FLAGS
+
+
+def main(unused_argv):
+  if FLAGS.is_master:
+    print("Starting tf.data service master")
+    server = tf.data.experimental.service.MasterServer(
+        port=FLAGS.port,
+        protocol="grpc")
+  else:
+    print("Starting tf.data service worker")
+    server = tf.data.experimental.service.WorkerServer(
+        port=FLAGS.port,
+        protocol="grpc",
+        master_address=FLAGS.master_address,
+        worker_address=FLAGS.worker_address)
+  server.join()
+
+
+if __name__ == "__main__":
+  tf.compat.v1.app.run()