Cluster Hardware

Cluster Specifications

Basic Details

Although several years old, Gemini is still a state-of-the-art high performance computing resource. The cluster consists of 18 compute nodes, plus two nodes that serve web and database applications. Each node contains 2x 8-core Xeon processors with 64 GB RAM and 4 TB local scratch space. In the default configuration there are 4 NVIDIA M2075 GPUs attached to each node via PCIe to a 16 unit C410x expansion chassis. High bandwidth, low-latency 40 Gbps QDR Infiniband interfaces are used for internode communications. There is also a dedicated gigabit ethernet network for traditional node communications over TCP.

There are two MD1200 storage units in a RAID 5 configuration that provide 40 TB of unified work space accessible to all of the nodes in the cluster. As mentioned, each node is provisioned with 4 TB of local scratch space for applications that may need to write-out to disk often. Gemini uses SLURM as its job scheduler.

Some Finer Details

Because the GPUs are not physically inside the compute nodes, they can be reconfigured on-the-fly to provide up to 8 GPUs to a node. All of the GPUs sit on the same PCIe switch within the C410x housing unit, which means that the GPUs can access memory across devices via Unified Memory Addressing without passing through the host bus controller (also called peer-to-peer communication).

Additionally, the software stack that supports the Infiniband interfaces can be adjusted to allow direct communication between GPUs on different nodes via GPUDirect Remote Direct Memory Access (GRDMA). In this configuration, all 72 GPUs in the cluster can access each others data with much lower latency than would otherwise be possible (the average latency on Gemini for GRDMA transfers over Infiniband is ~7µs for 64KB messages).

For applications that are easily distributed, or for running a large number of independent tasks, the entire cluster can function as a traditional batch processing cluster. At it's peak computing capacity, Gemini provides over 72 tera-FLOPs of computing power and consumes about 18,000 W of power.

The cluster is managed using the Bright Cluster Manager which allows for quickly provisioning nodes or reconfiguring the operating system (CentOS 7), drivers or other components of the software stack depending on the needs of a particular application.

The Resources and Help pages also give more implementation specific details for the various technologies and software currently used on Gemini.