|
| 1 | +# Distributed input processing with tf.data service. |
| 2 | + |
| 3 | +This directory provides an example of running the tf.data service to |
| 4 | +horizontally scale tf.data input processing. We use GKE |
| 5 | +(Google Kubernetes Engine) to manage the tf.data servers. |
| 6 | + |
| 7 | +This directory contains the following files: |
| 8 | + |
| 9 | +- `Dockerfile.tf_std_data_server`: A dockerfile to build a tf.data server image. |
| 10 | +- `data_service.yaml.jinja`: A Jinja-templated Kubernetes definition for running |
| 11 | + tf.data service servers |
| 12 | +- `data_service_interfaces.yaml.jinja`: A Jinja-templated Kubernetes definition |
| 13 | + for creating load balancers which expose the tf.data service endpoints |
| 14 | + outside the GKE cluster (but within the same VPC network). This is needed |
| 15 | + for TPUs to be able to connect to servers running in GKE. |
| 16 | +- `tf_std_data_server.py`: A basic tf.data server implementation. |
| 17 | + |
| 18 | +## Run the tf.data service in GKE |
| 19 | + |
| 20 | +### Start a GKE cluster |
| 21 | + |
| 22 | +If you don't already have a [GKE](https://cloud.google.com/kubernetes-engine) |
| 23 | +cluster, create one: |
| 24 | + |
| 25 | +Replace `${CLUSTER_NAME}` with a name of your choice. |
| 26 | +Replace `${NUM_NODES}` with the number of tf.data service machines to run, e.g. |
| 27 | +`8`. |
| 28 | +Replace `${MACHINE_TYPE}` with the machine type to use, e.g. `e2-standard-4` |
| 29 | + |
| 30 | +``` |
| 31 | +gcloud container clusters create ${CLUSTER_NAME} --zone europe-west4-a \ |
| 32 | + --scopes=cloud-platform --enable-ip-alias --num-nodes=${NUM_NODES} \ |
| 33 | + --machine-type=${MACHINE_TYPE} |
| 34 | +``` |
| 35 | + |
| 36 | +`--enable-ip-alias` is needed to be able to connect to the cluster from a TPU. |
| 37 | + |
| 38 | +### Create service endpoints |
| 39 | + |
| 40 | +Set number of workers in `data_service_interfaces` |
| 41 | +Edit the variable at the start of `data_service_interfaces.yaml.jinja` to set the number of workers. |
| 42 | +{%- set workers = 8 -%} |
| 43 | + |
| 44 | +Create data service endpoints so that the data service can be accessed from outside GKE. |
| 45 | +This requires `jinja2`, install it if you don't have it already: `pip3 install jinja2`. |
| 46 | + |
| 47 | +``` |
| 48 | +python3 ../render_template.py data_service_interfaces.yaml.jinja | kubectl apply -f - |
| 49 | +``` |
| 50 | + |
| 51 | +### Create tf.data server image |
| 52 | + |
| 53 | +``` |
| 54 | +docker build --no-cache -t gcr.io/${PROJECT_ID}/tf_std_data_server:latest \ |
| 55 | + -f Dockerfile.tf_std_data_server . |
| 56 | +docker push gcr.io/${PROJECT_ID}/tf_std_data_server:latest |
| 57 | +``` |
| 58 | + |
| 59 | +### Start tf.data servers |
| 60 | + |
| 61 | +Edit `data_service.yaml.jinja`, setting the image variable at the top of the |
| 62 | +file to the image created in the previous step, e.g. |
| 63 | +`"gcr.io/${PROJECT_ID}/tf_std_data_server:latest"` |
| 64 | + |
| 65 | +Wait for GKE to assign endpoints for all services created in the "Create service |
| 66 | +endpoints" step. This may |
| 67 | +take a few minutes. The below command will query all worker endpoints: |
| 68 | + |
| 69 | +``` |
| 70 | +kubectl get services -o=jsonpath='{"\n"}{range .items[*]}"{.metadata.name}": "{.status.loadBalancer.ingress[*].ip}",{"\n"}{end}{"\n"}' | grep data-service-worker |
| 71 | +``` |
| 72 | + |
| 73 | +Once the command shows non-empty addresses for all workers, copy the output |
| 74 | +of the command into the `ip_mapping` variable at the start of `data_service.yaml.jinja`. |
| 75 | + |
| 76 | +``` |
| 77 | +{% set ip_mapping = { |
| 78 | +"data-service-worker-0": "10.164.0.40", |
| 79 | +"data-service-worker-1": "10.164.0.41", |
| 80 | +... |
| 81 | +} %} |
| 82 | +``` |
| 83 | + |
| 84 | +Now launch the tf.data servers: |
| 85 | + |
| 86 | +``` |
| 87 | +python3 ../render_template.py data_service.yaml.jinja | kubectl apply -f - |
| 88 | +``` |
| 89 | + |
| 90 | +The service is now ready to use. To find the service address, run |
| 91 | + |
| 92 | +``` |
| 93 | +kubectl get services data-service-master |
| 94 | +``` |
| 95 | + |
| 96 | +and examine the `EXTERNAL-IP` and `PORT(S)` columns. To access the cluster, |
| 97 | +you will use the string `'grpc://<EXTERNAL-IP>:<PORT>'` |
| 98 | + |
| 99 | +## Run ResNet using the tf.data service for input. |
| 100 | + |
| 101 | +The `classifier_trainer.py` script in the [TensorFlow Model |
| 102 | +Garden](https://github.com/tensorflow/models) supports using the tf.data service to |
| 103 | +get input data. |
| 104 | + |
| 105 | +To run the script, do the following: |
| 106 | + |
| 107 | +``` |
| 108 | +git clone https://github.com/tensorflow/models.git |
| 109 | +cd models/official/vision/image_classification |
| 110 | +``` |
| 111 | + |
| 112 | +Edit either `configs/examples/resnet/imagenet/gpu.yaml` or |
| 113 | +`configs/examples/resnet/imagenet/tpu.yaml`, |
| 114 | +depending on whether you want to run on GPU or TPU. Under the `train_dataset` |
| 115 | +and `validation_dataset` sections, update `builder` from `'tfds'` to |
| 116 | +`'records'`. Then under the `train_dataset` section, add `tf_data_service: |
| 117 | +'grpc://<EXTERNAL_IP>:<PORT>'`. |
| 118 | + |
| 119 | +Finally, run the ResNet model. |
| 120 | + |
| 121 | +``` |
| 122 | +export PYTHONPATH=/path/to/models |
| 123 | +python3 classifier_trainer.py \ |
| 124 | + --mode=train_and_eval --model_type=resnet --dataset=imagenet --tpu=$TPU_NAME \ |
| 125 | + --model_dir=$MODEL_DIR --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet \ |
| 126 | + --config_file=path/to/config |
| 127 | +``` |
| 128 | + |
| 129 | +## Restarting tf.data servers |
| 130 | + |
| 131 | +tf.data servers are meant to live for the duration of a single training job. |
| 132 | +When starting a new job, you can use the following commands to stop the tf.data |
| 133 | +servers: |
| 134 | + |
| 135 | +``` |
| 136 | +kubectl get rs --no-headers=true | grep "data-service-" | xargs kubectl delete rs |
| 137 | +``` |
| 138 | + |
| 139 | +Then to start the servers again, run |
| 140 | + |
| 141 | +``` |
| 142 | +python3 ../render_template.py data_service.yaml.jinja | kubectl apply -f - |
| 143 | +``` |
0 commit comments