资源算法amazon-eks-machine-learning-with-terraform-and-kubeflow

amazon-eks-machine-learning-with-terraform-and-kubeflow

2020-04-10 | |  562 |   0 |   0

Distributed TensorFlow training using Kubeflow on Amazon EKS

Prerequisites

  1. Create and activate an AWS Account

  2. Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketplace.

  3. Manage your service limits so you can launch at least 4 EKS-optimized GPU enabled Amazon EC2 P3 instances.

  4. Create an AWS Service role for an EC2 instance and add AWS managed policy for power user access to this IAM Role.

  5. We need a build environment with AWS CLI and Docker installed. Launch a m5.xlarge Amazon EC2 instance from an AWS Deep Learning AMI (Ubuntu) using an EC2 instance profile containing the Role created in Step 4. All steps described under Step by step section below must be executed on this build environment instance.

Step by step

While all the concepts described here are quite general, we will make these concepts concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN model.

The high-level outline of steps is as follows:

  1. Create GPU enabled Amazon EKS cluster

  2. Create Persistent Volume and Persistent Volume Claim for Amazon EFS or Amazon FSx file system

  3. Stage COCO 2017 data for training on Amazon EFS or FSx file system

  4. Use Helm charts to manage training jobs in EKS cluster

Create GPU Enabled Amazon EKS Cluster

Quick start option

This option creates an Amazon EKS cluster with one worker node group. This is the recommended option for walking through this tutorial.

  1. In eks-cluster directory, execute: ./install-kubectl-linux.sh to install kubectl on Linux clients.

    For non-linux operating systems, install and configure kubectl for EKS, and install aws-iam-authenticator and make sure the command aws-iam-authenticator help works.

  2. Install Terraform.

  3. In eks-cluster/terraform/aws-eks-cluster-and-nodegroup folder, execute:

    terraform init

    The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:

    terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.14" -var="key_pair=xxx"

Advanced option

This option separates the creation of the EKS cluster from the worker node group. You can create the EKS cluster and later add one or more worker node groups to the cluster.

  1. Install Terraform.

  2. In eks-cluster/terraform/aws-eks-cluster folder, execute:

    terraform init

    terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.14"

    Customize Terraform variables as appropriate. K8s version can be specified using -var="k8s_version=x.xx". Save the output of the apply command for next step below.

  3. In eks-cluster/terraform/aws-eks-nodegroup folder, using the output of previous terraform apply as inputs into this step, execute:

    terraform init

    The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:

    terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="efs_id=fs-xxx" -var="subnet_id=subnet-xxx" -var="key_pair=xxx" -var="cluster_sg=sg-xxx" -var="nodegroup_name=xxx"

    To create more than one nodegroup in an EKS cluster, copy eks-cluster/terraform/aws-eks-nodegroup folder to a new folder under eks-cluster/terraform/ and specify a unique value for nodegroup_name variable.

  4. In eks-cluster directory, execute: ./install-kubectl-linux.sh to install kubectl on Linux clients. For other operating systems, install and configure kubectl for EKS.

  5. Install aws-iam-authenticator and make sure the command aws-iam-authenticator help works. In eks-cluster directory, customize set-cluster.sh and execute: ./update-kubeconfig.sh to update kube configuration.

    Ensure that you have at least version 1.16.73 of the AWS CLI installed. Your system's Python version must be Python 3, or Python 2.7.9 or greater.

  6. Upgrade Amazon CNI Plugin for Kubernetes, if needed (optional step)

  7. In eks-cluster directory, customize NodeInstanceRole in aws-auth-cm.yaml and execute: ./apply-aws-auth-cm.sh to allow worker nodes to join EKS cluster. Note, if this is not your first EKS node group, you must add the new node instance role Amazon Resource Name (ARN) to aws-auth-cm.yaml, while preserving the existing role ARNs in aws-auth-cm.yaml.

  8. In eks-cluster directory, execute: ./apply-nvidia-plugin.sh to create NVIDIA-plugin daemon set

Create EKS Persistent Volume

We have two shared file system options for staging data for distributed training:

  1. Amazon EFS

  2. Amazon FSx Lustre

Below, you only need to create Persistent Volume and Persistent Volume Claim for EFS, or FSx, not both.

Persistent Volume for EFS

  1. Execute: kubectl create namespace kubeflow to create kubeflow namespace

  2. In eks-cluster directory, customize pv-kubeflow-efs-gp-bursting.yaml for EFS file-system id and AWS region and execute: kubectl apply -n kubeflow -f pv-kubeflow-efs-gp-bursting.yaml

  3. Check to see the persistent-volume was successfully created by executing: kubectl get pv -n kubeflow

  4. Execute: kubectl apply -n kubeflow -f pvc-kubeflow-efs-gp-bursting.yaml to create an EKS persistent-volume-claim

  5. Check to see the persistent-volume was successfully bound to peristent-volume-claim by executing: kubectl get pv -n kubeflow

Persistent Volume for FSx

  1. Install K8s Container Storage Interface (CS) driver for Amazon FSx Lustre file system in your EKS cluster

  2. Execute: kubectl create namespace kubeflow to create kubeflow namespace

  3. In eks-cluster directory, customize pv-kubeflow-fsx.yaml for FSx file-system id and AWS region and execute: kubectl apply -n kubeflow -f pv-kubeflow-fsx.yaml

  4. Check to see the persistent-volume was successfully created by executing: kubectl get pv -n kubeflow

  5. Execute: kubectl apply -n kubeflow -f pvc-kubeflow-fsx.yaml to create an EKS persistent-volume-claim

  6. Check to see the persistent-volume was successfully bound to persistent-volume-claim by executing: kubectl get pv -n kubeflow

Build and Upload Docker Image to Amazon EC2 Container Registry (ECR)

We need to package TensorFlow, TensorPack and Horovod in a Docker image and upload the image to Amazon ECR. To that end, in container/build_tools directory in this project, customize for AWS region and execute: ./build_and_push.sh shell script. This script creates and uploads the required Docker image to Amazon ECR in your selected AWS region, which by default is the region configured in your default AWS CLI profile and may not be us-west-2, the assumed region for this tutorial. Save the ECR URL of the pushed image for later steps.

Optimized MaskRCNN

To use an optimized version of MaskRCNN, go into container-optimized/build_tools directory in this project, customize AWS region and execute: ./build_and_push.sh shell script. This script creates and uploads the required Docker image to Amazon ECR in your default AWS region, which by default is the region configured in your default AWS CLI profile and may not be us-west-2, the assumed region for this tutorial. Save the ECR URL of the pushed image for later steps.

Stage Data

To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh script to specify your S3 bucket in S3_BUCKET variable and execute eks-cluster/prepare-s3-bucket.sh

Next, we stage the data on EFS or FSx file-system. We need to use either EFS or FSx below, not both.

Use EFS, or FSx

To stage data on EFS or FSx, set image in eks-cluster/stage-data.yaml to the ECR URL you noted above, customize S3_BUCKET variable and execute:

kubectl apply -f stage-data.yaml -n kubeflow

to stage data on selected persistent volume claim for EFS (default), or FSX. Customize persistent volume claim in eks-cluster/stage-data.yaml to use FSx.

Execute kubectl get pods -n kubeflow to check the status of stage-data Pod. Once the status of stage-data Pod is marked Completed, execute following commands to verify data has been staged correctly:

kubectl apply -f attach-pvc.yaml -n kubeflow
kubectl exec attach-pvc -it -n kubeflow -- /bin/bash

You will be attached to the EFS or FSx file system persistent volume. Type exit once you have verified the data.

Install Helm

Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm according to instructions here.

After installing Helm, initalize Helm as described below:

  1. In eks-cluster folder, execute kubectl create -f tiller-rbac-config.yaml. You should see following two messages:

     serviceaccount "tiller" created  
     clusterrolebinding "tiller" created
  2. Execute helm init --service-account tiller --history-max 200

Release Helm charts for training

  1. In the charts folder in this project, execute helm install --name mpijob ./mpijob/ to deploy Kubeflow MPIJob CustomResouceDefintion in EKS using mpijob chart.

  2. a) In the charts/maskrcnn folder in this project, customize imagedata_fsshared_fs and shared_pvc variables in values.yaml. Set image to ECR docker image URL you built and uploaded in a previous step. Set shared_fs to efs or fsx, as applicable. Set data_fs to efsfsx or ebs, as applicable. Set shared_pvc to the name of the k8s persistent volume you created in relevant k8s namespace.

    b) To use an optimized version of MaskRCNN under active development, in the charts/maskrcnn-optimized folder in this project, customize imagedata_fsshared_fs and shared_pvc variables in valuex.yaml. Set image to the optimized MaskRCNN ECR docker image URL you built and uploaded in a previous step. Set shared_fs to efs or fsx, as applicable. Set data_fs to efsfsx or ebs, as applicable. Set shared_pvc to the name of the k8s persistent volume you created in relevant k8s namespace.

    c) To create a brand new Helm chart for defining a new MPIJOb, copy maskrcnn folder to a new folder under charts. Update the chart name in Chart.yaml. Update the namespace global variable in values.yaml to specify a new K8s namespace.

  3. In the charts folder in this project, execute helm install --name maskrcnn ./maskrcnn/ to create the MPI Operator Deployment resource and also define an MPIJob resource for Mask-RCNN Training.

  4. Execute: kubectl get pods -n kubeflow to see the status of the pods

  5. Execute: kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow to see live log of training from the launcher (change xxxxx to your specific pod name).

  6. Model checkpoints and logs will be placed on the shared_fs file-system set in values.yaml, i.e. efs or fsx.

Visualize Tensorboard summaries

Execute: kubectl get services -n kubeflow to get Tensorboard service DNS address. Access the Tensorboard DNS service in a browser on port 80 to visualize Tensorboard summaries.

Purge Helm charts after training

When training is complete, yoy may purge a release by exeucting helm del --purge maskrcnn. This will destroy all pods used in training, including Tensorboard service pods. However, the training output will be preserved in the EFS or FSx shared file system used for training.

Destroy GPU enabled EKS cluster

When you are done with distributed training, you can destory the EKS cluster and worker node group.

Quick start option

If you used the quick start option above to create the EKS cluster and worker node group, then in eks-cluster/terraform/aws-eks-cluster-and-nodegroup fodler, execute terraform destroy with the same arguments you used with terraform apply above.

Advanced option

In eks-cluster/terraform/aws-eks-nodegroup folder, execute terraform destroy with the same arguments you used with terraform apply above to destroy the worker node group, and then similarly execute terraform destroy in eks-cluster/terraform/aws-eks-cluster to destroy EKS cluster.

This step will not destroy the shared EFS or FSx file-system used in training.


上一篇:kubeflow-aliyun

下一篇: bundle-kubeflow

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...