By Moses Temu
What is a cluster?
To put it simply a cluster is an aggregation of several computers, the computational nodes that are networked together in a way that allows for them to act as one.
So why do we need clusters?
This is a bit of an extreme example but it I think it gets the idea across really well. Take the movie Avatar for example. James Cameron, its director, said each frame in the movie took between 30 and 50 hours to render. Movies have 24 frames every second, and the extended version of Avatar is 178 minutes long. Doing the math for that, using an average of 40 hours per frame, you’ll find it would have taken them over 1170 years to render the movie, if they had done it one frame at a time on a single processor. Weta Digitals’ 40,000 processor render farm made possible compete within a few years.
This is what clusters are for. They provided the computational power need to enable projects that would have otherwise be impossible, or would have taken much longer.
The iHub Cluster
The iHub cluster is a GPU cluster, which means that each node contains one or more graphics processing units that are used to do the computation. This differs from a CPU cluster where all the computation is done only on CPUs. GPU's being parallel processors means that they provide a significant advantage over CPU's when it comes to computational power. General-purpose graphics processing units (GPGPUs) are quickly gaining popularity due to this fact.
How much faster are graphics processors when it comes to calculations?
The fastest hex-core processor of 2010, Intel's Core i7 980XE, is capable of 109 FLOPS (FLoating-point Operations Per Second). The Nvidia GTX 480, the fastest single GPU card from the same year is capable of 672 gigaFLOPS. There’s more. The GFLOPS/watt values for the i7 and the GTX 480 are 0.838 and 2.688 respectively. So not only are GPUs more powerful but they are more efficient too. Each HD 7970 used in the iHub cluster is capable of near to 950 gigaFLOPS in double precision performance.
The computational nodes are fully-fledged computers of their own and the hardware that makes up each individual node is quite impressive. An Asus P8P67 WS Revolution motherboard, an Intel i5 2500k, 8GB of DDR3 memory, 3 HD 7970 graphics cards, an infiniBand adapter and a 128GB Samsung SSD all powered by an Enermax MaxRevo 1350 watt power supply. The iHub cluster will be made up of four of these nodes, all connected to and controlled by an Intel Modular Server.
How is the Cluster Managed, Serviced and Monitored?
The Intel Modular Server is where the management, service and monitoring of the cluster is done. The modular server has an integrated SAN (Storage Area Network) with support of up to 14 2.5” drives, a Management Module, used to configure the internal SAN, as well as the service modules and the compute servers. The service modules include a Gigabit Ethernet Switch Module, that provides switching and routing functionality to the modular server, as well as providing a way for the for the compute servers to connect to one another, and a Storage Control Module, which controls the partitioning of the drives in the server. Support is available for up to two of these service modules. Up to six compute servers can be installed each with 2 Intel Xeon processors. The iHub Cluster currently has three compute servers installed.
A Cisco SFS 7000 Series switch provides a high speed method of communication between computational nodes.
iHub Cluster Software
In terms of software, the cluster is running on Debian 6 and is diskful, which means that all the nodes have a full operating system installed on their individual hard drives. There are two reasons why I went with Debian on the cluster. The first being that Debian has a relatively long iteration period, and though this means it may not necessarily be at the cutting edge, it does mean that its very stable. The second reason is familiarity with Debian based systems. If an issue were to occur, it’s important that support can be provided and installing an unfamiliar operating system would make support more difficult.
With all the hardware that is in a cluster and the umpteen number of tasks it will potentially have to complete, there needs to be a way to ensure that cluster is performing as efficiently as possible. Tasks on the cluster need to be balanced so that resources aren’t wasted and performance can be maximized. Clusters are likely to have more than one user. This means that each user needs resources allocated to them in order for their jobs to be carried out. These needs are met through the use of cluster management software.
The management software installed on the cluster is SLURM. SLURM scales well, so it can perform management tasks on clusters at a small scale, but also allows for expansion of the cluster in the future.
It provides support for a large variety of operating systems and in the case of Debian was available through the package manager, once the appropriate repositories were added, making installation simple. It also did not require any modification to be made to the kernel.