Use Case: Distributed TensorFlow on a Cluster of 1&1 Cloud Servers

Table of Contents

Introduction

Learn how the 1&1 Cloud Server platform can help support and expand your TensorFlow project. There are several use cases for running Distributed TensorFlow on a cluster of 1&1 Cloud Servers. Using a cluster allows you to increase training throughput by harnessing the computing power of multiple servers. Google also recommends using a cluster when dealing with a very large data set, or with very large models.

Use Case: Set up a Cluster of 1&1 Cloud Servers to run Distributed TensorFlow

The 1&1 Cloud Server platform makes it easy to build out a cluster of Cloud Servers to run Distributed TensorFlow.

First, create a "template" server by installing TensorFlow and any necessary dependencies. Then, simply clone this server to create as many nodes in the cluster as you require.

Note that you will also need to create a Firewall Policy to allow TensorFlow traffic to the required ports on the servers. The standard TensorFlow port used in most examples is is 2222.

Related articles:

Scenario 1: Large Data Set

Google recommends using a clustered solution to handle a large TensorFlow data set. Many TensorFlow data sets can be several gigabytes in size. Not only is this large data set difficult to house on a single server, it can take a very long time to be processed by a single server.

For this situation, we recommend:

  • Data set housed on a Shared Storage volume.
  • Multiple servers in the processing cluster, each accessing that volume.

This will greatly increase the processing throughput, thanks to the ability to use parallel processing to speed up the computational workload. It also reduces the bottleneck, because the Shared Storage volume can be accessed by many servers at the same time.

Related articles:

Scenario 2: Large Model

A basic Distributed TensorFlow cluster might consist of one parameter server (ps) and one worker server (worker). As the workload of the model grows, you can add worker servers to the cluster as needed. One parameter server can handle all of the reads and update requests from a small number of worker servers.

However, as the cluster grows, the parameter server can become a bottleneck. Each cluster will have slightly different requirements, because the size of the model is also a factor. If you have a number of workers and a large model, the throughput may begin to slow dramatically as the parameter server becomes unable to handle the workload.

In this situation, Google recommends adding a second parameter server to handle the additional traffic from the worker servers.

The 1&1 Cloud Server platform allows you to easily build out new Cloud Servers with a few clicks. Most servers will be provisioned in 55 seconds or less.

For more information on adapting your Distributed TensorFlow code to specify multiple parameter servers, see the official Distributed TensorFlow documentation.

Comments

Tags: TensorFlow