Today marks the last day the Dell and DDN teams will be on site meaning the end of deployment.

After a long day the team was able to complete:

  • Lustre configuration
  • Verified and checked both GPFS and Luster are good to go
  • Have the motherboard was replaced on an additional node.  Updated firmware and reapplied BIOS settings after the replacement was complete.
  • Completed final rack pair Linpack tests.
  • Connected entire cluster to GPFS and Lustre.

Today the team was able to start wrapping up deployment.

We were able to get the following completed:

  • NFS fixed, tuned and mounted on cpus
  • Lustre and client rpms were prepared and be ready tomorrow morning
  • Configured provisional external ip addresses on the head nodes as well as login nodes.
  • Tests were run to for GPFS NFS mount
  • Updates to firmware were done and reapplied BIOS settings to the machines that had motherboards replaced due to the power outage yesterday.
  • Started Linpack tests on another rack pair of rack. This is the last pair!

A Dell technician replaces the mother board on a node that was damaged during the power outage.

Today we were dealing with a power outage in the data center due to bad weather the night before. But we were able to:

  • Replace failed DIMM and reran single node Linpack.
  • Completed running Linpack (performance tests) on GPU nodes.
  • Completed cpu Linpack on three rack pairs.
  • Modified Z9100 config per DDNs request.
  • Completed DDN NFS storage configuration
  • Completed DDN Lustre storage configuration

Today we worked to balance the power load on the the data center power distribution units (PDU). Load tests were run to check the PDU's limits. Until PUD power limits can be fixed,  smaller scale Linpack (performance) tests have have to be completed for validation without bringing down a PDU. Lastly, IB AlltoAll tests were completed.

Today we ran into a bit of a snag...the new cluster is more powerful than what we originally thought! We were running performance tests and it doesn't appear that there was enough power to run the cluster at full power. We unfortunately learned this the hard way when power was lost to parts of the data center.

We were able to:

  • Verify all firmware up to date (Dell and DDN storage).
  • Successfully ran stream to test memory bandwidth with no errors encountered.
  • Successfully ran bibw for bi-directional bandwidth to ensure that all nodes receive expected bandwidth.
  • Completed DDN network configuration

Members from the HPC team, facilities and electricians try to evaluate power restraints of the power distribution units.

Today we were able to: completed firmware updates, completed re-cabling, completed DDN storage firmware update and we have DDN storage and network configuration is in progress.

One of the DDN team members configures the storage for the new Colonial One.

Dell team members perform further firmware upgrades throughout the cluster.

To prevent damage to the disks, the disks were shipped separately. Today the storage enclosures for Lustre and NFS was populated totaling 4 petabytes!

 

The team also managed to:

  • Partially update firmware
  • Configure all of the switches
  • Rack the DDN controllers
  • Install all DDN drive
  • Power up entire storage systems
  • Completed DDN storage health check and all systems looks good
  • Started re-running cables thru top of rack instead cables ladder initially ran/planned

 

Today is the first day of set up and configuration with the Dell and DDN storage team coming on site to wire and plug everything in. Today we completed 1 Gb ethernet, 10GbE and 40GbE rack to rack cabling. We powered on 12 racks (have to wait for the the storage folks to finish to power on the other 2 racks). Lastly we completed blue light power on testing - making sure nothing was shaken to badly during transport and delivery.