MPI Cluster with Python and Amazon EC2 (part 2 of 3)
Today I posted a public AMI which can be used to run a small beowulf cluster on Amazon EC2 and do some parallel computations with C, Fortran, or Python. If you prefer another language (Java, Ruby, etc) just install the appropriate MPI library and rebundle the EC2 image. The following set of Python scripts automate the launch and configuration of an MPI cluster on EC2 (currently limited to 20 nodes while EC2 is in beta):
Update (3-19-08): Code for running a cluster with large or xlarge 64 bit EC2 instances is now hosted on google code. The new images include NFS, ganglia, IPython1, and other useful python packages.
http://code.google.com/p/elasticwulf/
Update (7-24-07): I’ve made some important bug fixes to the scripts to address issues mentioned in the comments. See the README file for details
The file contains some quick scripts I threw together using the AWS Python example code. This is the approach I’m using to bootstrap an MPI cluster until one of the major linux cluster distros is ported to run on EC2. Details on what is included in the public AMI were covered in Part 1 of the tutorial, Part 3 will cover cluster operation on EC2 in more detail and show how to use Python to carry out some neat parallel computations.
The cluster launch process is pretty simple once you have an Amazon EC2 account and keys, just download the Python scripts and you can be running a compute cluster in a few minutes. In a later post I will look at cluster bandwidth and performance in detail. If you have only an occasional need for running large jobs, $2/hour for a 20 node MPI cluster on EC2 is not a bad deal considering the ~ $20K price for building your own comparable system.
Prerequisites:
- Get a valid Amazon EC2 account
- Complete the most recent “getting started guide” tutorial on Amazon EC2 and create all needed web service accounts, authorizations, and keypairs
- Download and install the Amazon EC2 Python library
- Download the Amazon EC2 MPI cluster management scripts
Launching the EC2 nodes
First , unzip the cluster management scripts and modify the configuration parameters in ‘'’EC2config.py”’, substituting your own EC2 keys and changing the cluster size if desired:
#replace these with your AWS keys
AWS_ACCESS_KEY_ID = ‘YOUR_KEY_ID_HERE’
AWS_SECRET_ACCESS_KEY = ‘YOUR_KEY_HERE’
#change this to your keypair location (see the EC2 getting started guide tutorial on using ec2-add-keypair)
KEYNAME = "gsg-keypair"
KEY_LOCATION = "/Users/pskomoroch/id_rsa-gsg-keypair"
# remove these next two lines when you’ve updated your credentials.
print "update %s with your AWS credentials" % sys.argv[0]
sys.exit()
MASTER_IMAGE_ID = "ami-3e836657"
IMAGE_ID = "ami-3e836657"
DEFAULT_CLUSTER_SIZE = 5
Launch the EC2 cluster by running the ‘'’ec2-start_cluster.py”’ script from your local machine:
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-start-cluster.py
image ami-3e836657
master image ami-3e836657
—– starting master —–
RESERVATION r-275eb84e 027811143419 default
INSTANCE i-0ed33167 ami-3e836657 pending
—– starting workers —–
RESERVATION r-265eb84f 027811143419 default
INSTANCE i-01d33168 ami-3e836657 pending
INSTANCE i-00d33169 ami-3e836657 pending
INSTANCE i-03d3316a ami-3e836657 pending
INSTANCE i-02d3316b ami-3e836657 pending
Verify the EC2 nodes are running with ‘'’./ec2-check-instances.py”’:
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-check-instances.py
—– listing instances —–
RESERVATION r-aec420c7 027811143419 default
INSTANCE i-ab41a6c2 ami-3e836657 domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com running
INSTANCE i-aa41a6c3 ami-3e836657 domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com running
INSTANCE i-ad41a6c4 ami-3e836657 domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com running
INSTANCE i-ac41a6c5 ami-3e836657 domU-12-31-33-00-04-19.usma1.compute.amazonaws.com running
INSTANCE i-af41a6c6 ami-3e836657 domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com running
Cluster Configuration and Booting MPI
Run ‘'’ec2-mpi-config.py”’ to configure MPI on the nodes, this will take a minute or two depending on the number of nodes.
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-mpi-config.py
—- MPI Cluster Details —-
Numer of nodes = 5
Instance= i-ab41a6c2 hostname= domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com state= running
Instance= i-aa41a6c3 hostname= domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com state= running
Instance= i-ad41a6c4 hostname= domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com state= running
Instance= i-ac41a6c5 hostname= domU-12-31-33-00-04-19.usma1.compute.amazonaws.com state= running
Instance= i-af41a6c6 hostname= domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com state= running
The master node is ec2-72-44-46-78.z-2.compute-1.amazonaws.com
…<snip> …
Configuration complete, ssh into the master node as lamuser and boot the cluster:
$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com
> mpdboot -n 5 -f mpd.hosts
> mpdtrace
Login to the master node, boot the MPI cluster, and test the connectivity:
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com
Sample Fedora Core 6 + MPICH2 + Numpy/PyMPI compute node image
http://www.datawrangling.com/on-demand-mpi-cluster-with-python-and-ec2-part-1-of-3.html
—- Modified From Marcin’s Cool Images: Cool Fedora Core 6 Base + Updates Image v1.0 —
see http://developer.amazonwebservices.com/connect/entry.jspa?externalID=554&categoryID=101
Like Marcin’s image, standard disclaimer applies, use as you please…
Amazon EC2 MPI Compute Node Image
Copyright (c) 2006 DataWrangling. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* Neither the name of the DataWrangling nor the names of any
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
[lamuser@domU-12-31-33-00-02-5A ~]$
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdboot -n 5 -f mpd.hosts
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdtrace
domU-12-31-33-00-02-5A
domU-12-31-33-00-01-E3
domU-12-31-33-00-03-E3
domU-12-31-33-00-03-AA
domU-12-31-33-00-04-19
The results of the mpdtrace command show we have an MPI cluster running on 5 nodes. In the next section, we will verify that we can run some basic MPI tasks. For more detailed information on these mpi commands (and MPI in general), see the MPICH2 documentation.
Testing the MPI Cluster
Next we execute a sample C program bundled with MPICH2 which estimates pi using the cluster:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpiexec -n 5 /usr/local/src/mpich2-1.0.5/examples/cpi
Process 0 of 5 is on domU-12-31-33-00-02-5A
Process 1 of 5 is on domU-12-31-33-00-01-E3
Process 2 of 5 is on domU-12-31-33-00-03-E3
Process 3 of 5 is on domU-12-31-33-00-03-AA
Process 4 of 5 is on domU-12-31-33-00-04-19
pi is approximately 3.1415926544231230, Error is 0.0000000008333298
wall clock time = 0.007539
Test the message travel time for the ring of nodes you just created:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdringtest 100
time for 100 loops = 0.14577794075 seconds
Verify that the cluster can run a multiprocess job:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpiexec -l -n 5 hostname
3: domU-12-31-33-00-03-AA
0: domU-12-31-33-00-02-5A
1: domU-12-31-33-00-01-E3
4: domU-12-31-33-00-04-19
2: domU-12-31-33-00-03-E3
Testing PyMPI
Lets verify that the PyMPI install is working with our running cluster of 5 nodes. Execute the following on the master node:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpirun -np 5 pyMPI /usr/local/src/pyMPI-2.4b2/examples/fractal.py
Starting computation (groan)
process 1 done with computation!!
process 3 done with computation!!
process 4 done with computation!!
process 2 done with computation!!
process 0 done with computation!!
Header length is 54
BMP size is (400, 400)
Data length is 480000
[lamuser@domU-12-31-33-00-02-5A ~]$ ls
hosts id_rsa.pub mpd.hosts output.bmp
This produced the following fractal image (output.bmp):

We will show some more examples using PyMPI in the next post.
Changing the Cluster Size
If we want to modify the number of nodes in the cluster we first need to kill the mpi cluster from the master node as follows:
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdallexit
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdcleanup
Once this is done, you can start additional instances of the public AMI from your local machine, then re-run the ec2-mpi-config.py script and reboot the cluster.
Cluster Shutdown
Run ‘'’ec2-stop-cluster.py”’ to stop all EC2 MPI nodes. If you just want to stop the slave nodes, run ec2-stop-slaves.py
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-stop-cluster.py
This will stop all your EC2 MPI images, are you sure (yes/no)? yes
—– listing instances —–
RESERVATION r-aec420c7 027811143419 default
INSTANCE i-ab41a6c2 ami-3e836657 domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com running
INSTANCE i-aa41a6c3 ami-3e836657 domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com running
INSTANCE i-ad41a6c4 ami-3e836657 domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com running
INSTANCE i-ac41a6c5 ami-3e836657 domU-12-31-33-00-04-19.usma1.compute.amazonaws.com running
INSTANCE i-af41a6c6 ami-3e836657 domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com running
—- Stopping instance Id’s —-
Stoping Instance Id = i-ab41a6c2
Stoping Instance Id = i-aa41a6c3
Stoping Instance Id = i-ad41a6c4
Stoping Instance Id = i-ac41a6c5
Stoping Instance Id = i-af41a6c6
Waiting for shutdown ….
—– listing new state of instances —–
RESERVATION r-aec420c7 027811143419 default
INSTANCE i-ab41a6c2 ami-3e836657 domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com shutting-down
INSTANCE i-aa41a6c3 ami-3e836657 domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com shutting-down
INSTANCE i-ad41a6c4 ami-3e836657 domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com shutting-down
INSTANCE i-ac41a6c5 ami-3e836657 domU-12-31-33-00-04-19.usma1.compute.amazonaws.com shutting-down
INSTANCE i-af41a6c6 ami-3e836657 domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com shutting-down

[…] Part 2 of 3 […]
Excellent stuff! I’ve gotten started with EC2 and I’ll be trying your images out soon. I doubt that I’ll be trying to make ParallelKnoppix work on EC2, because your approach is the right one, I think. PK is designed to use when the hardware is not known ahead of time. With EC2, the hardware is known, so a tailor-made image is the way to go. Your scripts allow an on-demand cluster to be created in minutes, and that’s all that PK offers, anyway. PK usually needs some remastering so that users can add their own packages. Re-bundling an EC2 image is completely analogous. I’m planning on doing just that, probably starting with your images, and doing some testing of latency on tasks that require different degrees of internode communication. Thanks for all this, it’ll make the rest an easy job.
One question, do you know if something like an NFS shared home directory is possible. Using S3, possibly?
A little report on my trial.
1) ./ec2-start_cluster.py is not always successful in getting the requested number of nodes to come up. The instances sometimes have status “terminated” before anything is done with them.
2) When the 5 nodes all come up, I still get a problem with ./ec2-mpi-config.py requesting a root password:
michael@yosemite:~/ec2/AmazonEC2_MPI_scripts$ ./ec2-mpi-config.py
—- MPI Cluster Details —-
Numer of nodes = 5
Instance= i-e39c7a8a hostname= ec2-72-44-45-138.z-2.compute-1.amazonaws.com state= running
Instance= i-e29c7a8b hostname= ec2-72-44-45-185.z-2.compute-1.amazonaws.com state= running
Instance= i-e59c7a8c hostname= ec2-72-44-45-186.z-2.compute-1.amazonaws.com state= running
Instance= i-e49c7a8d hostname= ec2-72-44-45-122.z-2.compute-1.amazonaws.com state= running
Instance= i-e79c7a8e hostname= ec2-72-44-45-60.z-2.compute-1.amazonaws.com state= running
The master node is ec2-72-44-45-138.z-2.compute-1.amazonaws.com
Writing out mpd.hosts file
nslookup ec2-72-44-45-138.z-2.compute-1.amazonaws.com
(0, ‘Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-138.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.138\n’)
nslookup ec2-72-44-45-185.z-2.compute-1.amazonaws.com
(0, ‘Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-185.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.185\n’)
nslookup ec2-72-44-45-186.z-2.compute-1.amazonaws.com
(0, ‘Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-186.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.186\n’)
nslookup ec2-72-44-45-122.z-2.compute-1.amazonaws.com
(0, ‘Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-122.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.122\n’)
nslookup ec2-72-44-45-60.z-2.compute-1.amazonaws.com
(0, ‘Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-60.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.60\n’)
Warning: Permanently added ‘ec2-72-44-45-138.z-2.compute-1.amazonaws.com,72.44.45.138′ (RSA) to the list of known hosts.
id_rsa.pub 100% 1675 1.6KB/s 00:00
root@ec2-72-44-45-138.z-2.compute-1.amazonaws.com’s password:
This is as far as I can get at the moment. Looks like a minor problem. Cheers, M.
Michael,
I haven’t had the scripts prompt me for a password before, are you running them from your local machine? The mpi-config script expects the keyname and keypair location to match what was used to start the instance. Take a look at your EC2config.py file and make sure the instances were all started with your own keypair (i used the gsg keypair I created on my laptop in the Amazon “getting started guide” tutorial):
AWS_ACCESS_KEY_ID = ‘YOUR_KEY_ID_HERE’
AWS_SECRET_ACCESS_KEY = ‘YOUR_KEY_HERE’
MASTER_IMAGE_ID = “ami-3e836657″
IMAGE_ID = “ami-3e836657″
KEYNAME = “gsg-keypair”
KEY_LOCATION = “~/id_rsa-gsg-keypair”
DEFAULT_CLUSTER_SIZE = 5
I’m working on an updated version of the scripts and EC2 image which should make things a bit cleaner. Sorry the code is ugly right now in terms of error handling…I just wanted to toss something together to get people started
Yep, I run the mpi-config script right after creating the instances, doing just what you suggest. The fact that the instances start up at all seems to me to mean that the keypair information is ok. Do you know if anyone but you has been able to launch a cluster? Very cool stuff. I’m going to be looking into making a Debian AMI that works the same way.
Mike Cariaso modified my scripts to fix some path issues and got it working on a windows laptop, he might have also fixed some other errors I didn’t notice. I haven’t had a chance to try them yet, but you can download the modified scripts here:
http://mpiblast.pbwiki.com/AmazonEC2
===== DO NOT USE THESE SCRIPTS! =====
This section of ec2-mpi-config.py is a bit problematic:
os.system(’cp %s ~/id_rsa.pub’ % KEY_LOCATION )
os.system(’cp ~/id_rsa.pub ~/.ssh/id_rsa’)
This will clobber any existing rsa key on the initiating machine’s account, and with break normal auth on the next login if you have a different default rsa key!
The script should instead copy the private key directly from KEY_LOCATION to the nodes.
===== DO NOT USE THESE SCRIPTS! =====
Otherwise, way cool. Thanks for putting this tutorial together. We’re trying EC2 clusters out as a way to get quicker feedback from regression tests after changes to our software. Unfortunately, with the one hour granularity I don’t think it will be price competitive. We want 20-100 nodes for about 5 minutes at a time.
Ralph,
Good catch. Thanks for pointing that out. I just lifted those passwordless ssh lines straight from an MPI tutorial.
This might solve the clobbering as well (from http://www.maclife.com/forums/topic/61520):
cat id_rsa.pub >> .ssh/authorized_keys“The above command will create the “authorized_keys” file in the “.ssh” directory if that file doesn’t already exist, and it will append the new id_rsa.pub file to it if it does already exist.”
I’ll add that change to the scripts. Good luck with the regression cluster, I heard Oracle developers do something like that using Condor on otherwise idle desktops (see http://www.cs.wisc.edu/condor/doc/nmi-lisa2006-slides.pdf).
-Pete
Yeah, that would work better. Some more detailed comments:
Your image has /home/lamuser/.mpd.conf owned by root. I had to chown it to lamuser before I could start mpd.
You script passes the public dns names for the nodes into mpd.hosts. For that to work, a hole has to be opened in the firewall for the ports the mpi daemon is using. A simpler solution is to just pass the internal dns names. Then all the traffic happens behind the firewall, which probably also improves latency. (Although my ringtest was noticably slower than yours, averaging 2.2e-3 seconds/loop so who knows?)
I was surprised that when I originally ran ec2-add-keypair in the EC2 tutorial that it uploaded the public key (ok) and printed out the private key (ok I guess) but didn’t print out the public key locally (weird). Your scripts seem to assume the public key is available as id_rsa.pub on the client machine. Shouldn’t this first be copied either from /root/.ssh/authorized_keys on the master node (as installed by amazon) or retrieved through the query interface?
Is the mutual ssh access required for more than just launching the MPI daemon? If all subsequent traffic goes through the mpi daemons, starting mpd from the client machine, or automatically from the init scripts after pulling mpd.hosts from S3 would save the whole trouble, including uploading the private key at all.
Ralph,
More good points. I’ve been tied up with some other projects, but it sounds like enough feedback is in to make a revised version of the image and scripts. I expect the latency to vary a bit depending on the random EC2 network topology when a cluster is launched…(instances on the same box vs. over ethernet) that might explain the ringtest. The mutual ssh access was set up since we do a lot of file/data shuffling between nodes outside of MPI.
Thanks again, looking forward to hearing how the regression test system works out.
-Pete
Update (7-24-07): I’ve made some important bug fixes to the scripts to address issues mentioned in the comments.
Specific changes made:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=552&categoryID=85
After I run some benchmarks, I’m hoping to find some time to add LAM and OpenMPI to the EC2 image along with NFS configuration, C3 cluster tools, Ganglia, and a benchmarking package.
What about that Part 3?
the first two parts really set the stage … Part 3?
:)
Does the 5 month hiatus in this project mean that it was a bad idea and you guys have learnt enough to waste no more time on it?
Given the virtualization uncertainty, finding the right communication/computation balance for typical MPI programs appears to be very unrewarding. Secondly, MPI development and debug and then QA and scale out are not addressed, which doesn’t bode well. It appears most productive to have a local small cluster for development and debug, and then do QA and scale out on EC2, but some benchmarking numbers would really help.
If EC2 is only robust for embarrassingly parallel problems, then MapReduce style programs are more attractive. There the size of the data set and how well it integrates in a distributed file system appear to be the problems to focus on. Or BOINC like approaches if there is no integrated DFS. Anyone have operational data on these approaches?
Theo,
Sorry for the delay in posting this and responding. I’ve been working on a startup for the past 7 months and was in serious crunch mode. Don’t read too much into the large gap in posts, it is just me working on this as a side-project. I finished moving the blog to another host and finally have some time to get back to the EC2 work. This experience has taught me to never name a series of blog posts “part 1 of N”
You make some excellent points. One thing that has changed since I wrote the first post is that EC2 now offers larger 64bit machine images with better I/O (you can provision an entire physical server and not be limited by sharing network resources in the virtual instance). I’d like to see if this improves the network performance. I’m giving a talk on this in March, so I’m on the hook to have some benchmarks by then.
I also agree on the mapreduce side. For embarrassingly parallel problems, hadoop on ec2 is potentially much more attractive…more robust, easier for most people to program. Ideally, I would like to do some comparisons between the two approaches and run the numbers.
The performance of an EC2 MPI cluster is definitely going to be worse than your own custom hardware, but it still might fit certain niche situations. In my case, I needed to run some MPI code for a large problem and didn’t have access to a large enough cluster. The performance on EC2 was nowhere near what you get on a high-end cluster, but it got the job done for a reasonable price.
This discussion on the beowulf list goes into more detail on the pros/cons:
http://www.beowulf.org/pipermail/beowulf/2008-January/020490.html
-Pete
Can’t get the ec-mpi-config to work. Says list index out of range for mpi-externalnames[0] on line 108
start cluster and check instances are OK so I think that python, EC, elementree
are OK
Any ideas why? Has AWS changed the format of the response you’re parsing (yes I have had a look at the python code but since I haven’t used python before I can’t see anything obvious to me)
BTW you have a typo in mpi config Numer of nodes as opposed to Number of nodes , it even shows in your example above.
Otherwise I like what you’ve done, I’d just like it to work for me.
Thanks,
Pete
pete found the error… the image Ids he entered into the config module inadvertently contained a capital letter. This doesn’t cause any problems for starting images since string case is ignored by Amazon. The corresponding image id response string from AWS is always lowercase, so the python script comparison on image ID string fails.
In the next version of the scripts, I will handle upper/lowercase differences in the ami strings. For now, just make sure to use all lower case or call the python .lower() method,
Found another typo too, ok I’m nit picking. In the stop-cluster script the message says Stoping as opposed to stopping. A year ago when you first posted this stuff you mentioned that the reason why the non-root user was called lamuser was that the scripts were used for LAM in some previous incarnation. Since I’m actually trying to use LAM, if you have any LAM stuff around that might help me to iron out one or two problems I still have.
Anyway, thanks again,
Pete
No problem, thanks for finding the typos. These were meant to be some quick hacks, but took on a life of their own after a while.
I found this worked for configuring LAM, I’ll send you more details in an email…
The contents of bash_profile should be as follows:
-bash-3.1# more .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs LAMRSH="ssh -x" export LAMRSH LD_LIBRARY_PATH="/usr/local/lam-7.1.2/lib/" export LD_LIBRARY_PATH MPICH_PORT_RANGE="2000:8000" export MPICH_PORT_RANGE PATH=$PATH:$HOME/bin PATH=/usr/local/lam-7.1.2/bin:$PATH MANPATH=/usr/local/lam-7.1.2/man:$MANPATH export PATH export MANPATHLaunch the cluster on EC2 and try booting LAM manually:
Why does it ask me for a password when i try to run the ec2-mpi-config.py file.?
it says root@xxx password:
And I get a lot of text on the terminal when I try running the file.
raghav,
I assume you were able to start the instances with ec2-start-cluster.py? The text on the terminal is normal, but it shouldn’t ask you for a password (I should probably add a verbose option instead of streaming out text by default). There was a path issue on windows with an earlier version of the scripts, so that may be the problem.
If you send me the script version number from the README and/or terminal output, I can try to track down what is going on…
peter.skomoroch@gmail.com
-Pete
raghav,
Another suggestion is to make sure the instances are running with ./ec2-check-instances.py and then retry the script, sometimes it takes a while for sshd to start up on EC2.
-Pete
Hey guys,
Actually I made a change in the ec2-mpi-cluster.py file. I have no clue about python and I dono why it worked but it worked.
I modified:
template = ssh -o “StrictHostKeyChecking no” %(user)s@%(host)s “%(cmd)s”
to
template = ’ssh -i “/home/id_rsa-gsg-keypair” %(user)s@%(host)s “%(cmd)s”
and
template = ‘%(cmd)s %(switches)s -o “StrictHostKeyChecking no” %(src)s %(user)s@%(host)s:%(dest)s’
to
template = ‘%(cmd)s %(switches)s -i “/home/id_rsa-gsg-keypair” %(src)s %(user)s@%(host)s:%(dest)s’
And it started working perfectly fine. I was able to log in to the master node and the pi problem executed perfectly fine.
Thanks a lot guys
Cheers,
Raghav
Thanks pete. For your prompt reply!!
Thanks Pete. I wish I had made the PyCon session, but these posts have been very helpful. The cluster went up pretty quickly and I have already used it to crunch a few minor data runs.
In setting everything up I also ran into a similar problem as Raghav and ended up solving it in a similar manner by forcing the -i credentials switch. I imagine it has something to do with the way I configured and placed my certs.
i am trying to compile a simple c mpi file “hellompi.c” using the command:
why does it give me the following error?
/usr/bin/ld: cannot open output file /usr/hellompi: Permission denied
collect2: ld returned 1 exit status
how do I get root priveledges?
Raghav,
You can ssh in as root instead of lamuser, or compile the output file into your home directory.
Check out the new AMI and managment code:
http://www.datawrangling.com/pycon-2008-elasticwulf-slides.html
The new AMI includes a preconfigured NFS mounted directory /home/beowulf. If you compiled the file there, hellompi would be available on all nodes.
Note that the new images default to the ‘large’ instance type which charges .40 cents/hour for each node.
-Pete
Peter,
Very useful tool! I’ve gotten a cluster up and running using the small instance type but am having difficulty launching the _64 AMIs.
$ ./ec2-start-cluster.py
m1.large
image ami-eb13f682
master image ami-e813f681
—– starting master —–
Traceback (most recent call last):
File “./ec2-start-cluster.py”, line 39, in ?
master_response = conn.run_instances(imageId=MASTER_IMAGE_ID, minCount=1, maxCount=1, keyName= KEYNAME, instanceType=INSTANCE_TYPE )
TypeError: run_instances() got an unexpected keyword argument ‘instanceType’
If I try to start the cluster without passing an INSTANCE_TYPE arg I get the following:
$ ./ec2-start-cluster.py
m1.large
image ami-eb13f682
master image ami-e813f681
—– starting master —–
InvalidParameterValue: The requested instance type’s architecture (i386) does not match the architecture in the manifest for ami-e813f681 (x86_64)
—– starting workers —–
InvalidParameterValue: The requested instance type’s architecture (i386) does not match the architecture in the manifest for ami-eb13f682 (x86_64)
Any ideas? Thanks!
Patrick,
Did you start with a clean install of the 64 bit scripts? I made some changes to EC2.py in the new scripts to handle the new instance types…
Peter:
I am diving into Hadoop with Map/Reduce as we speak. As you know Google implemented its environment in C++, so I was a bit disappointed that Hadoop had chosen Java VM to do its bidding. Java makes interfacing with hardcore numerical operations much harder. The particular problems I am looking at are large scale Lanczos solvers to find eigen values/vectors of large systems of equations. These systems are of interest in advertising, quantitative finance, and sensor networks. Problem is that they all are environments in which latency is of the essence. So you have a capacity component in terms of the size of the system and a latency issue in terms of the data rate coming in and the opportunity cost for somebody to get to the answer faster.
I would be interested in working on this particular benchmark problem: pick a big eigen value/vector problem and solve it on a cluster, EC2, and via Hadoop/Map-reduce. Clearly this is going to be a lot of work so this should be publishing worthy. I am sure many folks would be interested in this experiment, so let me know if this is something that could invest time in.
Theo
Thanks, Peter. The original EC2.py was the problem. I now have the large AMIs up and running. Thanks again for the article and help!
Patrick