Today I posted a public AMI which can be used to run a small beowulf cluster on Amazon EC2 and do some parallel computations with C, Fortran, or Python. If you prefer another language (Java, Ruby, etc) just install the appropriate MPI library and rebundle the EC2 image. The following set of Python scripts automate the launch and configuration of an MPI cluster on EC2 (currently limited to 20 nodes while EC2 is in beta):
Update (3-19-08): Code for running a cluster with large or xlarge 64 bit EC2 instances is now hosted on google code. The new images include NFS, ganglia, IPython1, and other useful python packages.
Update (7-24-07): I've made some important bug fixes to the scripts to address issues mentioned in the comments. See the README file for details
The file contains some quick scripts I threw together using the AWS Python example code. This is the approach I'm using to bootstrap an MPI cluster until one of the major linux cluster distros is ported to run on EC2. Details on what is included in the public AMI were covered in Part 1 of the tutorial, Part 3 will cover cluster operation on EC2 in more detail and show how to use Python to carry out some neat parallel computations.
The cluster launch process is pretty simple once you have an Amazon EC2 account and keys, just download the Python scripts and you can be running a compute cluster in a few minutes. In a later post I will look at cluster bandwidth and performance in detail. If you have only an occasional need for running large jobs, $2/hour for a 20 node MPI cluster on EC2 is not a bad deal considering the ~ $20K price for building your own comparable system.
- Get a valid Amazon EC2 account
- Complete the most recent "getting started guide" tutorial on Amazon EC2 and create all needed web service accounts, authorizations, and keypairs
- Download and install the Amazon EC2 Python library
- Download the Amazon EC2 MPI cluster management scripts
Launching the EC2 nodes
First , unzip the cluster management scripts and modify the configuration parameters in '''EC2config.py''', substituting your own EC2 keys and changing the cluster size if desired:
Liquid error: undefined method `join' for #<String:0x1019573b8>
Launch the EC2 cluster by running the '''ec2-start_cluster.py''' script from your local machine:
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-start-cluster.py image ami-3e836657 master image ami-3e836657 ----- starting master ----- RESERVATION r-275eb84e 027811143419 default INSTANCE i-0ed33167 ami-3e836657 pending ----- starting workers ----- RESERVATION r-265eb84f 027811143419 default INSTANCE i-01d33168 ami-3e836657 pending INSTANCE i-00d33169 ami-3e836657 pending INSTANCE i-03d3316a ami-3e836657 pending INSTANCE i-02d3316b ami-3e836657 pending
Verify the EC2 nodes are running with '''./ec2-check-instances.py''':
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-check-instances.py ----- listing instances ----- RESERVATION r-aec420c7 027811143419 default INSTANCE i-ab41a6c2 ami-3e836657 domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com running INSTANCE i-aa41a6c3 ami-3e836657 domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com running INSTANCE i-ad41a6c4 ami-3e836657 domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com running INSTANCE i-ac41a6c5 ami-3e836657 domU-12-31-33-00-04-19.usma1.compute.amazonaws.com running INSTANCE i-af41a6c6 ami-3e836657 domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com running
Cluster Configuration and Booting MPI
Run '''ec2-mpi-config.py''' to configure MPI on the nodes, this will take a minute or two depending on the number of nodes.
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-mpi-config.py ---- MPI Cluster Details ---- Numer of nodes = 5 Instance= i-ab41a6c2 hostname= domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com state= running Instance= i-aa41a6c3 hostname= domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com state= running Instance= i-ad41a6c4 hostname= domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com state= running Instance= i-ac41a6c5 hostname= domU-12-31-33-00-04-19.usma1.compute.amazonaws.com state= running Instance= i-af41a6c6 hostname= domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com state= running The master node is ec2-72-44-46-78.z-2.compute-1.amazonaws.com ...
... Configuration complete, ssh into the master node as lamuser and boot the cluster: $ ssh email@example.com > mpdboot -n 5 -f mpd.hosts > mpdtrace
Login to the master node, boot the MPI cluster, and test the connectivity:
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ssh firstname.lastname@example.org Sample Fedora Core 6 + MPICH2 + Numpy/PyMPI compute node image http://www.datawrangling.com/on-demand-mpi-cluster-with-python-and-ec2-part-1-of-3 ---- Modified From Marcin's Cool Images: Cool Fedora Core 6 Base + Updates Image v1.0 --- see http://developer.amazonwebservices.com/connect/entry.jspa?externalID=554&categoryID=101 Like Marcin's image, standard disclaimer applies, use as you please... Amazon EC2 MPI Compute Node Image Copyright (c) 2006 DataWrangling. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the DataWrangling nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. [lamuser@domU-12-31-33-00-02-5A ~]$ [lamuser@domU-12-31-33-00-02-5A ~]$ mpdboot -n 5 -f mpd.hosts [lamuser@domU-12-31-33-00-02-5A ~]$ mpdtrace domU-12-31-33-00-02-5A domU-12-31-33-00-01-E3 domU-12-31-33-00-03-E3 domU-12-31-33-00-03-AA domU-12-31-33-00-04-19
The results of the mpdtrace command show we have an MPI cluster running on 5 nodes. In the next section, we will verify that we can run some basic MPI tasks. For more detailed information on these mpi commands (and MPI in general), see the MPICH2 documentation.
Testing the MPI Cluster
Next we execute a sample C program bundled with MPICH2 which estimates pi using the cluster:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpiexec -n 5 /usr/local/src/mpich2-1.0.5/examples/cpi Process 0 of 5 is on domU-12-31-33-00-02-5A Process 1 of 5 is on domU-12-31-33-00-01-E3 Process 2 of 5 is on domU-12-31-33-00-03-E3 Process 3 of 5 is on domU-12-31-33-00-03-AA Process 4 of 5 is on domU-12-31-33-00-04-19 pi is approximately 3.1415926544231230, Error is 0.0000000008333298 wall clock time = 0.007539
Test the message travel time for the ring of nodes you just created:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdringtest 100 time for 100 loops = 0.14577794075 seconds
Verify that the cluster can run a multiprocess job:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpiexec -l -n 5 hostname 3: domU-12-31-33-00-03-AA 0: domU-12-31-33-00-02-5A 1: domU-12-31-33-00-01-E3 4: domU-12-31-33-00-04-19 2: domU-12-31-33-00-03-E3
Lets verify that the PyMPI install is working with our running cluster of 5 nodes. Execute the following on the master node:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpirun -np 5 pyMPI /usr/local/src/pyMPI-2.4b2/examples/fractal.py Starting computation (groan) process 1 done with computation!! process 3 done with computation!! process 4 done with computation!! process 2 done with computation!! process 0 done with computation!! Header length is 54 BMP size is (400, 400) Data length is 480000 [lamuser@domU-12-31-33-00-02-5A ~]$ ls hosts id_rsa.pub mpd.hosts output.bmp
This produced the following fractal image (output.bmp):
We will show some more examples using PyMPI in the next post.
Changing the Cluster Size
If we want to modify the number of nodes in the cluster we first need to kill the mpi cluster from the master node as follows:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdallexit [lamuser@domU-12-31-33-00-02-5A ~]$ mpdcleanup
Once this is done, you can start additional instances of the public AMI from your local machine, then re-run the ec2-mpi-config.py script and reboot the cluster.
Run '''ec2-stop-cluster.py''' to stop all EC2 MPI nodes. If you just want to stop the slave nodes, run ec2-stop-slaves.py
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-stop-cluster.py This will stop all your EC2 MPI images, are you sure (yes/no)? yes ----- listing instances ----- RESERVATION r-aec420c7 027811143419 default INSTANCE i-ab41a6c2 ami-3e836657 domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com running INSTANCE i-aa41a6c3 ami-3e836657 domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com running INSTANCE i-ad41a6c4 ami-3e836657 domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com running INSTANCE i-ac41a6c5 ami-3e836657 domU-12-31-33-00-04-19.usma1.compute.amazonaws.com running INSTANCE i-af41a6c6 ami-3e836657 domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com running ---- Stopping instance Id's ---- Stoping Instance Id = i-ab41a6c2 Stoping Instance Id = i-aa41a6c3 Stoping Instance Id = i-ad41a6c4 Stoping Instance Id = i-ac41a6c5 Stoping Instance Id = i-af41a6c6 Waiting for shutdown .... ----- listing new state of instances ----- RESERVATION r-aec420c7 027811143419 default INSTANCE i-ab41a6c2 ami-3e836657 domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com shutting-down INSTANCE i-aa41a6c3 ami-3e836657 domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com shutting-down INSTANCE i-ad41a6c4 ami-3e836657 domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com shutting-down INSTANCE i-ac41a6c5 ami-3e836657 domU-12-31-33-00-04-19.usma1.compute.amazonaws.com shutting-down INSTANCE i-af41a6c6 ami-3e836657 domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com shutting-down