09 April 2007

Today I posted a public AMI which can be used to run a small beowulf cluster on Amazon EC2 and do some parallel computations with C, Fortran, or Python. If you prefer another language (Java, Ruby, etc) just install the appropriate MPI library and rebundle the EC2 image. The following set of Python scripts automate the launch and configuration of an MPI cluster on EC2 (currently limited to 20 nodes while EC2 is in beta):

Update (3-19-08): Code for running a cluster with large or xlarge 64 bit EC2 instances is now hosted on google code. The new images include NFS, ganglia, IPython1, and other useful python packages.

http://code.google.com/p/elasticwulf/

Update (7-24-07): I've made some important bug fixes to the scripts to address issues mentioned in the comments. See the README file for details

The file contains some quick scripts I threw together using the AWS Python example code. This is the approach I'm using to bootstrap an MPI cluster until one of the major linux cluster distros is ported to run on EC2. Details on what is included in the public AMI were covered in Part 1 of the tutorial, Part 3 will cover cluster operation on EC2 in more detail and show how to use Python to carry out some neat parallel computations.

The cluster launch process is pretty simple once you have an Amazon EC2 account and keys, just download the Python scripts and you can be running a compute cluster in a few minutes. In a later post I will look at cluster bandwidth and performance in detail. If you have only an occasional need for running large jobs, $2/hour for a 20 node MPI cluster on EC2 is not a bad deal considering the ~ $20K price for building your own comparable system.

Prerequisites:

  1. Get a valid Amazon EC2 account
  2. Complete the most recent "getting started guide" tutorial on Amazon EC2 and create all needed web service accounts, authorizations, and keypairs
  3. Download and install the Amazon EC2 Python library
  4. Download the Amazon EC2 MPI cluster management scripts

Launching the EC2 nodes

First , unzip the cluster management scripts and modify the configuration parameters in '''EC2config.py''', substituting your own EC2 keys and changing the cluster size if desired:

#replace these with your AWS keys
AWS_ACCESS_KEY_ID = 'YOUR_KEY_ID_HERE'
AWS_SECRET_ACCESS_KEY = 'YOUR_KEY_HERE'
#change this to your keypair location (see the EC2 getting started guide tutorial on using ec2-add-keypair)
KEYNAME = "gsg-keypair"
KEY_LOCATION = "/Users/pskomoroch/id_rsa-gsg-keypair"
# remove these next two lines when you've updated your credentials.
print "update %s with your AWS credentials" % sys.argv[0]
sys.exit()

MASTER_IMAGE_ID = "ami-3e836657"
IMAGE_ID = "ami-3e836657"

DEFAULT_CLUSTER_SIZE = 5

Launch the EC2 cluster by running the '''ec2-start_cluster.py''' script from your local machine:

 
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-start-cluster.py 

image ami-3e836657
master image ami-3e836657
----- starting master -----
RESERVATION r-275eb84e  027811143419    default
INSTANCE    i-0ed33167  ami-3e836657            pending
----- starting workers -----
RESERVATION r-265eb84f  027811143419    default
INSTANCE    i-01d33168  ami-3e836657            pending
INSTANCE    i-00d33169  ami-3e836657            pending
INSTANCE    i-03d3316a  ami-3e836657            pending
INSTANCE    i-02d3316b  ami-3e836657            pending

Verify the EC2 nodes are running with '''./ec2-check-instances.py''':


peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-check-instances.py 
----- listing instances -----

RESERVATION     r-aec420c7      027811143419    default
INSTANCE        i-ab41a6c2      ami-3e836657    domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com      running
INSTANCE        i-aa41a6c3      ami-3e836657    domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com      running
INSTANCE        i-ad41a6c4      ami-3e836657    domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com      running
INSTANCE        i-ac41a6c5      ami-3e836657    domU-12-31-33-00-04-19.usma1.compute.amazonaws.com      running
INSTANCE        i-af41a6c6      ami-3e836657    domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com      running

Cluster Configuration and Booting MPI

Run '''ec2-mpi-config.py''' to configure MPI on the nodes, this will take a minute or two depending on the number of nodes.


peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-mpi-config.py 

---- MPI Cluster Details ----
Numer of nodes = 5
Instance= i-ab41a6c2 hostname= domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com state= running
Instance= i-aa41a6c3 hostname= domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com state= running
Instance= i-ad41a6c4 hostname= domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com state= running
Instance= i-ac41a6c5 hostname= domU-12-31-33-00-04-19.usma1.compute.amazonaws.com state= running
Instance= i-af41a6c6 hostname= domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com state= running

The master node is ec2-72-44-46-78.z-2.compute-1.amazonaws.com 


... ...

Configuration complete, ssh into the master node as lamuser and boot the cluster:
$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com 
> mpdboot -n 5 -f mpd.hosts 
> mpdtrace

Login to the master node, boot the MPI cluster, and test the connectivity:



peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com 



Sample Fedora Core 6 + MPICH2 + Numpy/PyMPI compute node image 

http://www.datawrangling.com/on-demand-mpi-cluster-with-python-and-ec2-part-1-of-3

---- Modified From Marcin's Cool Images: Cool Fedora Core 6 Base + Updates Image v1.0 ---

see http://developer.amazonwebservices.com/connect/entry.jspa?externalID=554&categoryID=101


Like Marcin's image, standard disclaimer applies, use as you please...

Amazon EC2 MPI Compute Node Image
Copyright (c) 2006 DataWrangling. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

    * Redistributions of source code must retain the above copyright
       notice, this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above
       copyright notice, this list of conditions and the following
       disclaimer in the documentation and/or other materials provided
       with the distribution.

    * Neither the name of the DataWrangling nor the names of any
       contributors may be used to endorse or promote products derived
       from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
[lamuser@domU-12-31-33-00-02-5A ~]$ 
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdboot -n 5 -f mpd.hosts 
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdtrace
domU-12-31-33-00-02-5A
domU-12-31-33-00-01-E3
domU-12-31-33-00-03-E3
domU-12-31-33-00-03-AA
domU-12-31-33-00-04-19

The results of the mpdtrace command show we have an MPI cluster running on 5 nodes. In the next section, we will verify that we can run some basic MPI tasks. For more detailed information on these mpi commands (and MPI in general), see the MPICH2 documentation.

Testing the MPI Cluster

Next we execute a sample C program bundled with MPICH2 which estimates pi using the cluster:


[lamuser@domU-12-31-33-00-02-5A ~]$  mpiexec -n 5 /usr/local/src/mpich2-1.0.5/examples/cpi
Process 0 of 5 is on domU-12-31-33-00-02-5A
Process 1 of 5 is on domU-12-31-33-00-01-E3
Process 2 of 5 is on domU-12-31-33-00-03-E3
Process 3 of 5 is on domU-12-31-33-00-03-AA
Process 4 of 5 is on domU-12-31-33-00-04-19
pi is approximately 3.1415926544231230, Error is 0.0000000008333298
wall clock time = 0.007539

Test the message travel time for the ring of nodes you just created:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpdringtest 100
time for 100 loops = 0.14577794075 seconds

Verify that the cluster can run a multiprocess job:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpiexec -l -n 5 hostname
3: domU-12-31-33-00-03-AA
0: domU-12-31-33-00-02-5A
1: domU-12-31-33-00-01-E3
4: domU-12-31-33-00-04-19
2: domU-12-31-33-00-03-E3

Testing PyMPI

Lets verify that the PyMPI install is working with our running cluster of 5 nodes. Execute the following on the master node:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpirun -np 5 pyMPI /usr/local/src/pyMPI-2.4b2/examples/fractal.py
Starting computation (groan)

process 1 done with computation!!
process 3 done with computation!!
process 4 done with computation!!
process 2 done with computation!!
process 0 done with computation!!
Header length is  54
BMP size is  (400, 400)
Data length is  480000
[lamuser@domU-12-31-33-00-02-5A ~]$ ls
hosts  id_rsa.pub  mpd.hosts  output.bmp

This produced the following fractal image (output.bmp):

output.bmp

We will show some more examples using PyMPI in the next post.

Changing the Cluster Size

If we want to modify the number of nodes in the cluster we first need to kill the mpi cluster from the master node as follows:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpdallexit
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdcleanup

Once this is done, you can start additional instances of the public AMI from your local machine, then re-run the ec2-mpi-config.py script and reboot the cluster.

Cluster Shutdown

Run '''ec2-stop-cluster.py''' to stop all EC2 MPI nodes. If you just want to stop the slave nodes, run ec2-stop-slaves.py



peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-stop-cluster.py
This will stop all your EC2 MPI images, are you sure (yes/no)? yes
----- listing instances -----
RESERVATION     r-aec420c7      027811143419    default
INSTANCE        i-ab41a6c2      ami-3e836657    domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com      running
INSTANCE        i-aa41a6c3      ami-3e836657    domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com      running
INSTANCE        i-ad41a6c4      ami-3e836657    domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com      running
INSTANCE        i-ac41a6c5      ami-3e836657    domU-12-31-33-00-04-19.usma1.compute.amazonaws.com      running
INSTANCE        i-af41a6c6      ami-3e836657    domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com      running

---- Stopping instance Id's ----
Stoping Instance Id = i-ab41a6c2 
Stoping Instance Id = i-aa41a6c3 
Stoping Instance Id = i-ad41a6c4 
Stoping Instance Id = i-ac41a6c5 
Stoping Instance Id = i-af41a6c6 

Waiting for shutdown ....
----- listing new state of instances -----
RESERVATION     r-aec420c7      027811143419    default
INSTANCE        i-ab41a6c2      ami-3e836657    domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-aa41a6c3      ami-3e836657    domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-ad41a6c4      ami-3e836657    domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-ac41a6c5      ami-3e836657    domU-12-31-33-00-04-19.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-af41a6c6      ami-3e836657    domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com      shutting-down