amazon-eks-machine-learning-with-terraform-and-kubeflow
Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketplace.
Manage your service limits so you can launch at least 4 EKS-optimized GPU enabled Amazon EC2 P3 instances.
Create an AWS Service role for an EC2 instance and add AWS managed policy for power user access to this IAM Role.
We need a build environment with AWS CLI and Docker installed. Launch a m5.xlarge Amazon EC2 instance from an AWS Deep Learning AMI (Ubuntu) using an EC2 instance profile containing the Role created in Step 4. All steps described under Step by step section below must be executed on this build environment instance.
While all the concepts described here are quite general, we will make these concepts concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN model.
The high-level outline of steps is as follows:
Create GPU enabled Amazon EKS cluster
Create Persistent Volume and Persistent Volume Claim for Amazon EFS or Amazon FSx file system
Stage COCO 2017 data for training on Amazon EFS or FSx file system
Use Helm charts to manage training jobs in EKS cluster
This option creates an Amazon EKS cluster with one worker node group. This is the recommended option for walking through this tutorial.
In eks-cluster
directory, execute: ./install-kubectl-linux.sh
to install kubectl
on Linux clients.
For non-linux operating systems, install and configure kubectl for EKS, and install aws-iam-authenticator and make sure the command aws-iam-authenticator help
works.
In eks-cluster/terraform/aws-eks-cluster-and-nodegroup
folder, execute:
terraform init
The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.14" -var="key_pair=xxx"
This option separates the creation of the EKS cluster from the worker node group. You can create the EKS cluster and later add one or more worker node groups to the cluster.
In eks-cluster/terraform/aws-eks-cluster
folder, execute:
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="k8s_version=1.14"
Customize Terraform variables as appropriate. K8s version can be specified using -var="k8s_version=x.xx"
. Save the output of the apply command for next step below.
In eks-cluster/terraform/aws-eks-nodegroup
folder, using the output of previous terraform apply
as inputs into this step, execute:
terraform init
The next command requires an Amazon EC2 key pair. If you have not already created an EC2 key pair, create one before executing the command below:
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="efs_id=fs-xxx" -var="subnet_id=subnet-xxx" -var="key_pair=xxx" -var="cluster_sg=sg-xxx" -var="nodegroup_name=xxx"
To create more than one nodegroup in an EKS cluster, copy eks-cluster/terraform/aws-eks-nodegroup
folder to a new folder under eks-cluster/terraform/
and specify a unique value for nodegroup_name
variable.
In eks-cluster
directory, execute: ./install-kubectl-linux.sh
to install kubectl
on Linux clients. For other operating systems, install and configure kubectl for EKS.
Install aws-iam-authenticator and make sure the command aws-iam-authenticator help
works. In eks-cluster
directory, customize set-cluster.sh
and execute: ./update-kubeconfig.sh
to update kube configuration.
Ensure that you have at least version 1.16.73 of the AWS CLI installed. Your system's Python version must be Python 3, or Python 2.7.9 or greater.
Upgrade Amazon CNI Plugin for Kubernetes, if needed (optional step)
In eks-cluster
directory, customize NodeInstanceRole in aws-auth-cm.yaml
and execute: ./apply-aws-auth-cm.sh
to allow worker nodes to join EKS cluster. Note, if this is not your first EKS node group, you must add the new node instance role Amazon Resource Name (ARN) to aws-auth-cm.yaml
, while preserving the existing role ARNs in aws-auth-cm.yaml
.
In eks-cluster
directory, execute: ./apply-nvidia-plugin.sh
to create NVIDIA-plugin daemon set
We have two shared file system options for staging data for distributed training:
Below, you only need to create Persistent Volume and Persistent Volume Claim for EFS, or FSx, not both.
Execute: kubectl create namespace kubeflow
to create kubeflow namespace
In eks-cluster
directory, customize pv-kubeflow-efs-gp-bursting.yaml
for EFS file-system id and AWS region and execute: kubectl apply -n kubeflow -f pv-kubeflow-efs-gp-bursting.yaml
Check to see the persistent-volume was successfully created by executing: kubectl get pv -n kubeflow
Execute: kubectl apply -n kubeflow -f pvc-kubeflow-efs-gp-bursting.yaml
to create an EKS persistent-volume-claim
Check to see the persistent-volume was successfully bound to peristent-volume-claim by executing: kubectl get pv -n kubeflow
Install K8s Container Storage Interface (CS) driver for Amazon FSx Lustre file system in your EKS cluster
Execute: kubectl create namespace kubeflow
to create kubeflow namespace
In eks-cluster
directory, customize pv-kubeflow-fsx.yaml
for FSx file-system id and AWS region and execute: kubectl apply -n kubeflow -f pv-kubeflow-fsx.yaml
Check to see the persistent-volume was successfully created by executing: kubectl get pv -n kubeflow
Execute: kubectl apply -n kubeflow -f pvc-kubeflow-fsx.yaml
to create an EKS persistent-volume-claim
Check to see the persistent-volume was successfully bound to persistent-volume-claim by executing: kubectl get pv -n kubeflow
We need to package TensorFlow, TensorPack and Horovod in a Docker image and upload the image to Amazon ECR. To that end, in container/build_tools
directory in this project, customize for AWS region and execute: ./build_and_push.sh
shell script. This script creates and uploads the required Docker image to Amazon ECR in your selected AWS region, which by default is the region configured in your default AWS CLI profile and may not be us-west-2, the assumed region for this tutorial. Save the ECR URL of the pushed image for later steps.
To use an optimized version of MaskRCNN, go into container-optimized/build_tools
directory in this project, customize AWS region and execute: ./build_and_push.sh
shell script. This script creates and uploads the required Docker image to Amazon ECR in your default AWS region, which by default is the region configured in your default AWS CLI profile and may not be us-west-2, the assumed region for this tutorial. Save the ECR URL of the pushed image for later steps.
To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh
script to specify your S3 bucket in S3_BUCKET
variable and execute eks-cluster/prepare-s3-bucket.sh
Next, we stage the data on EFS or FSx file-system. We need to use either EFS or FSx below, not both.
To stage data on EFS or FSx, set image
in eks-cluster/stage-data.yaml
to the ECR URL you noted above, customize S3_BUCKET
variable and execute:
kubectl apply -f stage-data.yaml -n kubeflow
to stage data on selected persistent volume claim for EFS (default), or FSX. Customize persistent volume claim in eks-cluster/stage-data.yaml
to use FSx.
Execute kubectl get pods -n kubeflow
to check the status of stage-data
Pod. Once the status of stage-data
Pod is marked Completed
, execute following commands to verify data has been staged correctly:
kubectl apply -f attach-pvc.yaml -n kubeflow
kubectl exec attach-pvc -it -n kubeflow -- /bin/bash
You will be attached to the EFS or FSx file system persistent volume. Type exit
once you have verified the data.
Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm according to instructions here.
After installing Helm, initalize Helm as described below:
In eks-cluster
folder, execute kubectl create -f tiller-rbac-config.yaml
. You should see following two messages:
serviceaccount "tiller" created clusterrolebinding "tiller" created
Execute helm init --service-account tiller --history-max 200
In the charts
folder in this project, execute helm install --name mpijob ./mpijob/
to deploy Kubeflow MPIJob CustomResouceDefintion in EKS using mpijob chart.
a) In the charts/maskrcnn
folder in this project, customize image
, data_fs
, shared_fs
and shared_pvc
variables in values.yaml
. Set image
to ECR docker image URL you built and uploaded in a previous step. Set shared_fs
to efs
or fsx
, as applicable. Set data_fs
to efs
, fsx
or ebs
, as applicable. Set shared_pvc
to the name of the k8s persistent volume you created in relevant k8s namespace.
b) To use an optimized version of MaskRCNN under active development, in the charts/maskrcnn-optimized
folder in this project, customize image
, data_fs
, shared_fs
and shared_pvc
variables in valuex.yaml
. Set image
to the optimized MaskRCNN ECR docker image URL you built and uploaded in a previous step. Set shared_fs
to efs
or fsx
, as applicable. Set data_fs
to efs
, fsx
or ebs
, as applicable. Set shared_pvc
to the name of the k8s persistent volume you created in relevant k8s namespace.
c) To create a brand new Helm chart for defining a new MPIJOb, copy maskrcnn
folder to a new folder under charts
. Update the chart name in Chart.yaml
. Update the namespace
global variable in values.yaml
to specify a new K8s namespace.
In the charts
folder in this project, execute helm install --name maskrcnn ./maskrcnn/
to create the MPI Operator Deployment resource and also define an MPIJob resource for Mask-RCNN Training.
Execute: kubectl get pods -n kubeflow
to see the status of the pods
Execute: kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow
to see live log of training from the launcher (change xxxxx to your specific pod name).
Model checkpoints and logs will be placed on the shared_fs
file-system set in values.yaml
, i.e. efs
or fsx
.
Execute: kubectl get services -n kubeflow
to get Tensorboard service DNS address. Access the Tensorboard DNS service in a browser on port 80 to visualize Tensorboard summaries.
When training is complete, yoy may purge a release by exeucting helm del --purge maskrcnn
. This will destroy all pods used in training, including Tensorboard service pods. However, the training output will be preserved in the EFS or FSx shared file system used for training.
When you are done with distributed training, you can destory the EKS cluster and worker node group.
If you used the quick start option above to create the EKS cluster and worker node group, then in eks-cluster/terraform/aws-eks-cluster-and-nodegroup
fodler, execute terraform destroy
with the same arguments you used with terraform apply
above.
In eks-cluster/terraform/aws-eks-nodegroup
folder, execute terraform destroy
with the same arguments you used with terraform apply
above to destroy the worker node group, and then similarly execute terraform destroy
in eks-cluster/terraform/aws-eks-cluster
to destroy EKS cluster.
This step will not destroy the shared EFS or FSx file-system used in training.
上一篇:kubeflow-aliyun
下一篇: bundle-kubeflow
还没有评论,说两句吧!
热门资源
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com