ଜାତୀୟ ବିଜ୍ଞାନ ଶିକ୍ଷା ଏବଂ ଗବେଷଣା ପ୍ରତିଷ୍ଠାନ
ପରମାଣୁ ଶକ୍ତି ବିଭାଗ, ଭାରତ ସରକାରଙ୍କ ଏକ ସ୍ବୟଂଶାସିତ ପ୍ରତିଷ୍ଠାନ

राष्ट्रीय विज्ञान शिक्षा एवं अनुसंधान संस्थान
परमाणु ऊर्जा विभाग, भारत सरकार का एक स्वयंशासित संस्थान

National Institute of Science Education and Research
AN AUTONOMOUS INSTITUTE UNDER DAE, GOVT. OF INDIA

 
Cluster (High-Perf.) Computing
  • KALINGA Cluster
    • User Manual
  • HARTREE Cluster
    • User Manual
  • VIKRAM Cluster
    • User Manual
  • DIRAC Cluster
    • User Manual
  • Do's and Don'ts for HPC usage
  • Infrastructure
Kalinga
Kalinga

An HPC of 64 new compute nodes, 32 old compute nodes and 01 GPU node with 04 nos. of NVIDIA Tesla K40c has been named KALINGA for the School of Physical Sciences. It has 205 TB of storage with Lustre® Parallel File System and interconnected with 216-Port EDR 100Gb/s Infiniband Smart Director Switch populated with 108 ports.

Old compute nodes: Based on Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz. Two sockets for each node (32-cores per node) with Memory: 512 GB (8 * 64 GB) per node.

New compute nodes: Based on Intel® Xeon® Gold 6248 CPU @ 2.50GHz. Two sockets for each node (40-cores per node) with Memory: 192 GB (12 * 16 GB) per node.

This cluster has been listed in the Jan-2021 top-supercomputer list in India maintained by C-DAC Bangalore. https://topsc.cdacb.in/filterdetailstry?page=30&slug=January2021
  • New Master node:
    • Processor: 2 x 20 cores (Intel® Xeon® Gold 6248 CPU @ 2.50GHz)
    • Memory: 192 GB (12 * 16 GB)
  • Old Master node:
    • Processor: 2 x 16 cores (Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz)
    • Memory: 512 GB (8 * 64 GB)
  • New Computer nodes:(64 nodes)
    • Processor: 2 x 20 cores (Intel® Xeon® Gold 6248 CPU @ 2.50GHz)
    • Memory: 192 GB (12 * 16 GB)
  • Old Computer nodes:(32 nodes)
    • Processor: 2 x 16 cores (Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz)
    • Memory: 512 GB (8 * 64 GB)
  • GPU node (NVIDIA) (1 node)
    • NVIDIA Tesla K40c (4 cards)
  • Storage Nodes MDS 1 & OSS1 (2 nodes)
  • Interconnect:
    • Primary Interconnect: 216-Port EDR 100Gb/s Infiniband Smart Director Switch populated with 108 ports (CS7520).
    • Secondary Interconnect: Ethernet (SSE-G2252) - 48 Port Gigabit Switch
  • Storage: 205TB with Lustre® Parallel File System.
  • Resource Manager: SLURM with fair-share scheduling.
  • Monitoring Tool: Ganglia
  • KVM Switch and KVM Console.

Unable to display PDF file.

Hartree
Hartree

New HPC cluster under school of chemical science has been installed and is named as Hartree cluster. Hartree cluster is inside the smart row solution.
SmartRow by itself a mini-datacenter which contains almost all the subsystem of Datacenter.

This cluster has been listed in the July-2018 top-supercomputer list in India maintained by C-DAC Bangalore. https://topsc.cdacb.in/filterdetailstry?page=30&slug=July2018

Subsystems in the SmartRow:

  • Electrical (Row Power, UPS Power) - 150 KVA Modular Frame with 90KVA Loaded with 30 minutes backup.
  • HVAC (CRV - Computer room vertical) - 2 nos. 35kW In-Row PAC.
  • Fire Detection system (Zone 1 & Zone 2) - Two zones
  • Fire suppression system (NOVEC 1230 suppression system)
  • Access control system (For Door opening)
  • Rodent repellent system - 3 nos. Standalone devices
  • CCTV
  • Monitoring with Rack Data Unit (RDU)

Hartree Cluster:

Hartree cluster is a HPC of 40 compute nodes and 2 master nodes in HA configuration.
It is based on Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz. Two sockets for each node (32-cores per node) on Supermicro SuperServer SYS-6028R-TR with 64 GB of RAM per node.

HPL Benchmarking Results (for 36 nodes - 1152 cores):

  • Rmax of 38.81 Terra-Flops.
  • Rpeak of 53.24 Teraflops

The nodes are connected by Mellanox 108 port FDR chassis switch (MSX6506-NR) that can provide 56Gbps of throughput.
It also has 230 Terabytes of storage in Lustre® Parallel File system with an aggregate performance of around 5GBps on write performance.

  • Total IO Nodes: 4 (2 MDS & 2 OSS)
  • MDT Storage: Qscan (4TB in RAID 10)
  • OST Storage: DDN (230 TB in RAID 6)
  • Storage is from DDN with dual controller both working in active active mode.
  • Controller Enclosure : SFA 7700x
  • Disk Enclosure: SS7700

Unable to display PDF file.

Vikram
Vikram

The Vikram High-Performance Computing (HPC) Cluster is a modern computational infrastructure deployed at the School of Physical Sciences (SPS), NISER Bhubaneswar. Built with AMD EPYC processors, HDR InfiniBand networking, and a scalable parallel file system, Vikram provides high-throughput computing, low-latency communication, and robust storage capacity for demanding computational workloads.

  • Master Nodes (2 Units):
    • Make/Model: Tyrone / Camarero SDA200A2W-18 (1U Rack)
    • Processor: AMD EPYC™ 9554P (64 cores / 128 threads, 3.1 GHz, 256 MB cache, 360W SP5)
    • Memory: 4 x 32 GB DDR5-4800
    • Storage: 2 x 960 GB SATA 6Gb/s V4 MLC SSD (3 DWPD) in RAID 1
  • Standard Compute Nodes (34 Units):
    • Make/Model: Tyrone / Camarero SDA200A2W-18 (1U Rack)
    • Processor: AMD EPYC™ 9554P (64 cores / 128 threads, 3.1 GHz, 256 MB cache)
    • Memory: 12 x 32 GB DDR5-4800
    • Storage: 2 x 960 GB SATA 6Gb/s V4 MLC SSD (3 DWPD) in RAID 1
  • High-Memory Compute Nodes (2 Units):
    • Make/Model: Tyrone / Camarero SDA200A2W-18 (1U Rack)
    • Processor: AMD EPYC™ 9554P (64 cores / 128 threads, 3.1 GHz, 256 MB cache)
    • Memory: 16 x 64 GB DDR5-4800
    • Storage: 2 x 960 GB SATA 6Gb/s V4 MLC SSD (3 DWPD) in RAID 1
  • InfiniBand Fabric:
    • NVIDIA MQM8790-HS2F 200Gbps HDR
    • Topology: Non-blocking fabric
    • Total IB Port Requirement (42 ports (HDR-to-2x100G Y-cables): Master Node-2, Computer Node-34, HM Node-2, PFS Controller-4
  • Management Network:
    • Switch Model: D-Link DGS-1052
    • Ports: 48 x 10/100/1000 Mbps + 4 x Combo 1000Base-T/SFP
  • Free Ports:
    • Management Switch (1GbE): 2 ports available
    • BMC Switch (1GbE): 2 ports available
  • Storage Controller: 500 TB Parallel File System
    • Model: ES200NVX2 - 2U (850mm depth)
    • NVMe Slots: 24
    • Expansion: SAS4 ports
    • Network: 4 x NDR200 QSFP112 ports
  • Disk Encloser:
    • Model: SS9024-SBOD-T - 4U (1066.7mm depth)
    • Drive Bays: 90 SAS4
    • Modules: Dual SAS4 I/O
    • Cabling: 4 x Copper Mini-SAS HD (CBL-SAS4-MSHD-3.5m)
  • Storage Media:
    • HDDs: 84 x 10 TB, 7.2K RPM, SAS-3, 4K sector
    • SSDs: 6 x 3.84 TB NVMe Gen4 (1 DWPD, SED capable)
    • Total Usable Capacity: 500 TB
    • Use Cases: high-bandwidth writes, metadata acceleration via NVMe tier

Unable to display PDF file.

Dirac
Dirac

The Dirac High-Performance Computing (HPC) Cluster is a state-of-the-art computational platform at the School of Chemical Sciences (SCS), NISER Bhubaneswar. Equipped with modern AMD EPYC processors, HDR InfiniBand networking, and a high-performance parallel file system, the Dirac cluster delivers exceptional computing power, ultra-low latency, and scalable storage for advanced scientific computing.

  • Master Nodes (2 Units):
    • Make/Model: Tyrone / Camarero SDA200A2W-18 (1U Rack)
    • Processor: AMD EPYC™ 9554P (64 cores / 128 threads)
    • Memory: 384 GB DDR5-4800
    • Storage: 2 x 960 GB MLC SSD (Hardware RAID 1)
    • Network: NVIDIA ConnectX-6 HDR 200 Gbps
  • Compute Nodes (32 Units):
    • Make/Model: Tyrone / Camarero SDA200A2W-18 (1U Rack)
    • Processor: AMD EPYC™ 9554P (64 cores)
    • Total Compute Nodes: 2048
    • Memory: 384 GB DDR5-4800 per Node
    • Boot Storage: 960 GB MLC SSD
    • Network: HDR InfiniBand 200 Gbps
  • InfiniBand Fabric:
    • NVIDIA MQM8790-HS2F 200Gbps HDR
    • Topology: Non-blocking high-speed fabric
    • IB Endpoints: 42 ports (HDR-to-2x100G Y-cables)
  • Storage Controller: 500 TB Parallel File System
    • Model: ES200NVX2 - 2U (850mm depth)
    • NVMe Slots: 24
    • Expansion: SAS4 ports
    • Network: 4 x NDR200 IB ports
  • Disk Encloser:
    • Model: SS9024-SBOD-T - 4U (1066.7mm depth)
    • Drive Bays: 90 SAS4
    • Modules: Dual SAS4 I/O
  • Storage Media:
    • HDDs: 84 x 10 TB, 7.2K RPM, SAS-3, 4K sector
    • SSDs: 6 x 3.84 TB NVMe Gen4 (for metadata and acceleration)
    • Performance: Up to 10 GB/s sustained throughput

Unable to display PDF file.

Please Do

  • Remember that you share the login nodes (and remote desktop environments where they are provided) with many other users. Be mindful of that. Users should therefore use an optimum number of processing cores by testing for scale up.
  • If you need to run something "strenuous" interactively, then use the batch system's qsub -I ... mechanism. This would apply to compile and/or testing code, pre-processing or post-processing data.
  • If you need to transfer a lot of small files, it is smarter to lump them into a single tar or zip file and transfer them as part of a larger lump of data.
  • If you are running jobs that are identical except for the input parameter and input file name, consider using job arrays instead of submitting many individual jobs of an almost identical form.
  • It is strongly recommended that users backup their files-folders periodically, as NISER will not be having a mechanism to backup users data.

Please Don't

  • Run processing for sustained periods on login nodes (or remote desktops). Your access to resources (CPU and RAM) is capped, but you should nonetheless avoid inconveniencing other users by running work on login nodes. Use the batch system instead.
  • Just because you can write a shell or Python script to rapidly fire qsub commands at the batch system, you should avoid doing so. Rapidly firing jobs at HPC can cause issues for the batch submission sub-system. Having large numbers of small jobs queued can also impact scheduler performance. Please use Job Arrays or Nimrod instead when you need to run large numbers of similar jobs.
  • Avoid abusing the login environment with the watch command. In many circumstances, the output of the command line that the watch is watching does not change anywhere near as frequently as the watch command will report.

Unable to display PDF file.