aistore

2019-12-23 |

57 |

0 |

aistore

AIStore is a lightweight object storage system with the capability to linearly scale-out with each added storage node and a special focus on petascale deep learning.

AIStore (AIS for short) is a built from scratch, lightweight storage stack tailored for AI apps. At its version 2.x, AIS consistently shows balanced I/O distribution across arbitrary numbers of clustered servers and hard drives (consistently) producing performance charts that look as follows:

The picture above comprises 120 HDDs.

The capability to linearly scale-out for millions of stored objects (often also referred to as shards) was, and remains, one of the main incentives to build AIStore. But not the only one.

Features

S3-like HTTP REST API to GET/PUT objects, and create, destroy, list and configure buckets, and more;
FUSE client (called aisfs), to access AIS objects as files;
arbitrary number of access points - AIS gateways (aka proxies) that are extremely lightweight and can run anywhere;
easy-to-use command-line interface (CLI);
scale-out with no downtime and no limitation;
automated rebalancing upon changes in cluster membership, drive failures, bucket renaming, etc.;
n-way mirroring (RAID-1), Reed–Solomon erasure coding, end-to-end data protection;

In addition, AIStore:

can be deployed on any commodity hardware;
supports Amazon S3 and Google Cloud Storage backends (and all GCS and S3-compliant object storages);
can be used as a fast tier or a cache for GCS and S3; can be populated on-demand and/or via separate prefetch and download APIs;
can be used as a standalone highly-available protected storage;
includes MapReduce extension for massively parallel resharding of very large datasets.

Last but not least, AIS runs natively on Kubernetes and features open format and, therefore, freedom to copy or move your data off of AIS at any time using familiar Linux tar(1), scp(1), rsync(1) and similar.

For detailed overview, design philosophy, and components, please see this document where you can also find 5 (five) alternative ways to populate AIStore with existing datasets.

Containerized Deployments: Host Resource Sharing

Performance Monitoring

Prerequisites

Linux (with gcc, sysstat and attr packages, and kernel 4.15+)
Go 1.13 or later
Extended attributes (xattrs - see below)
Optionally, Amazon (AWS) or Google Cloud Platform (GCP) account(s)

Depending on your Linux distribution, you may or may not have gcc, sysstat, and/or attr packages.

The capability called extended attributes, or xattrs, is a long time POSIX legacy and is supported by all mainstream filesystems with no exceptions. Unfortunately, extended attributes (xattrs) may not always be enabled (by the Linux distribution you are using) in the Linux kernel configurations - the fact that can be easily found out by running setfattr command.

If disabled, please make sure to enable xattrs in your Linux kernel configuration.

Getting Started

AIStore runs on commodity Linux machines with no special hardware requirements.

It is expected, though, that all AIS target machines are identical, hardware-wise.

The implication is that the number of possible deployment options is practically unlimited. This section covers 3 (three) ways to deploy AIS on a single Linux machine and is intended for developers and development, and/or for a quick trial.

Deployment: Local non-Containerized

Assuming that Go is already installed, the remaining getting-started steps are:

$ cd $GOPATH/src
$ go get -v github.com/NVIDIA/aistore/ais
$ cd github.com/NVIDIA/aistore
$ make deploy
$ go test ./tests -v -run=Mirror

where:

go get installs sources and dependencies under your $GOPATH.
make deploy deploys AIStore daemons locally and interactively, for example:

$ make deploy
Enter number of storage targets:
10
Enter number of proxies (gateways):
3
Number of local cache directories (enter 0 to use preconfigured filesystems):
2
Select Cloud Provider:
1: Amazon Cloud
2: Google Cloud
3: None
Enter your choice:
3

Or, you can run all of the above non-interactively:

$ make kill; make deploy <<< $'10n3n2n0'

The example deploys 3 gateways and 10 targets, each with 2 local simulated filesystems. Also notice the "Cloud Provider" prompt above, and the fact that access to Cloud storage is specified at the deployment time.

make kill will terminate local AIStore if it's already running.

Run make help for many other useful commands, including those that build AIS CLI, FUSE, and benchmarks (binaries), deploy AIS cluster, and run some/all tests.

To enable an optional AIStore authentication server, execute instead $ CREDDIR=/tmp/creddir AUTHENABLED=true make deploy. For information on AuthN server, please see AuthN documentation.

Finally, the go test (above) will create an ais bucket, configure it as a two-way mirror, generate thousands of random objects, read them all several times, and then destroy the replicas and eventually the bucket as well.

Alternatively, if you happen to have Amazon and/or Google Cloud account, make sure to specify the corresponding (S3 or GCS) bucket name when running go test commands. For example, the following will download objects from your (presumably) S3 bucket and distribute them across AIStore:

$ BUCKET=myS3bucket go test ./tests -v -run=download

Here's a minor variation of the above:

$ BUCKET=myS3bucket go test ./tests -v -run=download -args -numfiles=100 -match='ad+'

This command runs a test that matches the specified string ("download"). The test then downloads up to 100 objects from the bucket called myS3bucket, whereby the names of those objects match ad+ regex.

In addition to AIStore - the storage cluster, you can also deploy aisfs - to access AIS objects as files, and AIS CLI - to monitor, configure and manage AIS nodes and buckets.

AIS CLI is an easy-to-use command-line management tool supporting a growing number of commands and options (one of the first ones you may want to try could be ais status - show the state and status of an AIS cluster). The CLI is documented in the readme; getting started with it boils down to running make cli and following the prompts.

For more testing commands and command-line options, please refer to the corresponding README.

For tips and help on local non-containerized deployment, please see the tips.

For info on how to run AIS executables, see command-line arguments.

For helpful links and background on Go, AWS, GCP, and Deep Learning, please see helpful links.

And again, run make help to find out how to build and run AIS executables and tests.

Deployment: Local Docker-Compose

The 2nd option to run AIS on your local machine requires Docker and Docker-Compose. It also allows for multi-clusters deployment with multiple separate networks. You can deploy a simple AIS cluster within seconds or deploy a multi-container cluster for development.

To get started with AIStore and Docker, see: Getting started with Docker.

Deployment: Local Kubernetes

The 3rd and final local-deployment option makes use of Kubeadm and is documented here.

Containerized Deployments: Host Resource Sharing

The following applies to all containerized deployments:

AIS nodes always automatically detect containerization.
If deployed as a container, each AIS node independently discovers whether its own container's memory and/or CPU resources are restricted.
Finally, the node then abides by those restrictions.

To that end, each AIS node at startup loads and parses cgroup settings for the container and, if the number of CPUs is restricted, adjusts the number of allocated system threads for its goroutines.

This adjustment is accomplished via the Go runtime GOMAXPROCS variable. For in-depth information on CPU bandwidth control and scheduling in a multi-container environment, please refer to the CFS Bandwidth Control document.

Further, given the container's cgroup/memory limitation, each AIS node adjusts the amount of memory available for itself.

Limits on memory may affect dSort performance forcing it to "spill" the content associated with in-progress resharding into local drives. The same is true for erasure-coding that also requires memory to rebuild objects from slices, etc.

For technical details on AIS memory management, please see this readme.

Performance Monitoring

As is usually the case with storage clusters, there are multiple ways to monitor their performance.

AIStore includes aisloader - the tool to stress-test and benchmark storage performance. For background, command-line options, and usage, please see AIS Load Generator.

For starters, AIS collects and logs a fairly large and constantly growing number of counters that describe all aspects of its operation, including (but not limited to) those that reflect cluster recovery/rebalancing, all extended long-running operations, and, of course, object storage transactions.

In particular:

For dSort monitoring, please see dSortFor Downloader monitoring, please see Internet Downloader

The logging interval is called stats_time (default 10s) and is configurable on the level of both each specific node and the entire cluster.

However. Speaking of ways to monitor AIS remotely, the two most obvious ones would be:

AIS CLI
Graphite/Grafana

As far as Graphite/Grafana, AIS integrates with these popular backends via StatsD - the daemon for easy but powerful stats aggregation. StatsD can be connected to Graphite, which then can be used as a data source for Grafana to get a visual overview of the statistics and metrics.

The scripts for easy deployment of both Graphite and Grafana are included (see below).

For local non-containerized deployments, use ./ais/setup/deploy_grafana.sh to start Graphite and Grafana containers. Local deployment scripts will automatically "notice" the presence of the containers and will send statistics to the Graphite.

For local docker-compose based deployments, make sure to use -grafana command-line option. The deploy_docker.sh script will then spin-up Graphite and Grafana containers.

In both of these cases, Grafana will be accessible at localhost:3000.

For information on AIS statistics, please see Statistics, Collected Metrics, Visualization

Configuration

AIS configuration is consolidated in a single JSON template where the configuration sections and the knobs within those sections must be self-explanatory, whereby the majority of those (except maybe just a few) have pre-assigned default values. The configuration template serves as a single source for all deployment-specific configurations, examples of which can be found under the folder that consolidates both containerized-development and production deployment scripts.

AIS production deployment, in particular, requires careful consideration of at least some of the configurable aspects. For example, AIS supports 3 (three) logical networks and will, therefore, benefit, performance-wise, if provisioned with up to 3 isolated physical networks or VLANs. The logical networks are:

user (aka public)
intra-cluster control
intra-cluster data

with the corresponding JSON names, respectively:

ipv4
ipv4_intra_control
ipv4_intra_data

Guides and References

Selected Package READMEs

上一篇：NeMo

下一篇：tensorflow-determinism

用户评价

全部评价

还没有评论，说两句吧！