Current Hardware Configuration

The Colonial One High-Performance Cluster currently consists of 213 compute nodes accessible through four login nodes. Login nodes provide remote access through SSH, and file transfer services through SCP/SFTP and Globus. The cluster uses Dell C8000 chassis, with both C8220 (CPU nodes) and C8220x (GPU nodes) models. Additional cluster services are provided by a pair of management servers, serving virtual machine images off a dedicated SAN. Project-specific services requiring tight coordination with the cluster can be hosted within these systems - for example, license servers and front-end web portals are kept running for specific groups' needs.

Fileservers

Two filesystems are available within Colonial One.The Dell NSS (NFS) redundant fileserver system provides 250TB (usable) of storage space for user home directories, group project space, and cluster-wide application installs.The Dell / Terascala Lustre HSS solution provides a high-speed (up to 7GB/s) filesystem for scratch data use. The current system is configured with 250TB usable disk space, and can be further expanded when necessary.Both filesystems are made accessible over the cluster IB fabric to both the login and compute nodes, and are available for remote file access using SCP/SFTP, as well as through Globus using the GW#ColonialOne endpoint.

Interconnect

The entire cluster accesses a shared Mellanox FDR Infiniband (56gbps) interconnect, which is run with a 2-to-1 over-subscription in the core. It is currently constructed from two SX6036 managed switches and fourteen SX6025 switches, connected with copper (for local connections) and active-fiber (cross-rack) cables. An additional SX6012 switch with VPI allows for cross-connection between the Infiniband network and the campus 10-Gigabit ethernet network.Tuned MPI stacks which take advantage of the low-latency RDMA capabilites are available through the modules system.

GPU Nodes

First Generation

There are 32 first-generation GPU nodes in Colonial One, accessible through the gpu queue in the Slurm scheduler.Each of these is a:

  • Dell C8220x Compute Node
  • Dual 6-Core 2.0GHz Intel Xeon E5-2620 CPUs
  • 128GB of 1333MHz DDR3 ECC Register DRAM
  • Dual NVIDIA K20 GPU accelerator cards
  • Mellanox FDR Infiniband controller
  • Dual 160GB Intel SSD (used for boot and local scratch space)

Second Generation

There are 21 second-generation GPU nodes in Colonial One, accessible through the ivygpu queue in the Slurm scheduler.Each of these is a:

  • Dell C8220x Compute Node
  • Dual 6-Core 2.1GHz Intel Xeon E5-2620v2 CPUs
  • 128GB of 1600MHz DDR3 ECC Register DRAM
  • Dual NVIDIA K20 GPU accelerator cards
  • Mellanox FDR Infiniband controller
  • 100GB Intel SSD (used for boot and local scratch space)

Compute Nodes

First Generation

There are 65 first-generation CPU nodes, split in to three categories based on the configured memory. These are accessible through the defq in Slurm, or can be specifically allocated by memory size with the 64gb128gb, or 256gbqueues respectively.There are eight 256GB nodes, twenty-five 128GB nodes, and thirty-two 64GB nodes. The defq can be used if your job requires less than 64GB to run, although it may be placed on any of the larger systems depending on availability.Each of these is a:

  • Dell C8220 Compute Node
  • Dual 8-Core 2.6GHz Intel Xeon E5-2670 CPUs
  • 256/128/64GB of 1600MHz DDR3 ECC Registered DRAM
  • Mellanox FDR Infiniband controller
  • 160GB Intel SSD (used for boot and local scratch space)

Second Generation

There are 94 second-generation 128GB CPU nodes.Each of these is a:

  • Dell C8220 Compute Node
  • Dual 8-Core 2.6GHz Intel Xeon E5-2650v2 CPUs
  • 128GB of 1866MHz DDR3 ECC Registered DRAM
  • Mellanox FDR Infiniband controller
  • 100GB Intel SSD (used for boot and local scratch space)

Large Memory Node

There is a single 2TB memory compute node.This system is a:

  • Dell PowerEdge R920 server
  • Quad 12-Core 3.0GHz Intel Xeon E7-8857v2 CPUs
  • 2 TB of DDR3 ECC Registered DRAM
  • Mellanox FDR Infiniband controller
  • 100GB Intel SSD (used for boot and local scratch space)