2024¶
2024-12-27¶
Added¶
| Dell PowerEdge R760 | |
|---|---|
| Nodes | 11 |
| CPU | Intel Xeon Gold 6246Y (16C/32T) |
| RAM | 192 GB per node |
| Storage (SAS HDD) | 12×22 TB per node |
| Storage (NVMe) | 2×15.36 TB per node |
Changed¶
-
Storage Expansion: Expanded raw Ceph storage capacity and I/O performance:
- HDD SAS 22 TB × 12 × 11 = 2 904 PB raw
- NVMe 15.36 TB × 2 × 11 = 337.92 TB raw
-
Filesystem capacity now:
- HDD: 4.2 PiB
- NVMe: 325 TiB
- SSD: 279 TiB
- TOTAL: 4.8 PiB
User Impact
No downtime
Increased filesystem capacity and space
2024-12-20¶
Changed¶
- New Datacenter Launch:
- Support for high electrical loads
- Improved cooling
- Enhanced robustness and security
2024-12-12¶
Changed¶
-
Hardware Migration: Relocation of the existing infrastructure into the new container
-
All hardware (nodes, services, Kubernetes) migrated from the old container to the new one
2024-11-18¶
Changed¶
- Moved fast filesystem mount point from
/fastto/orfeo/cephfs/fast.
User Impact
All references to /fast will break and must be changed to point to /orfeo/cephfs/fast instead.
2024-08-12¶
Changed¶
- Moved all computational nodes without accelerator from F37/F38 to F40, the following core package were affected:
| module | previous version | new version |
|---|---|---|
| Python | <= 3.11 | 3.12 |
| Linux kernel | <= 6.3 | 6.10 |
| Glibc | <= 2.36 | 2.39 |
| gcc | <= 12.3 | 14.2 |
-
Cluster wide slurm update from 22.05 to 24.05. The number of changes featurewise is considerable; please consider visiting the official news file for more details. For more details on the server deployment visit this link. Details on the clients will be made available soon.
-
Modules update, among the most relevant:
| module | old version | new version | comments |
|---|---|---|---|
| openmpi | 4.1.5 | 4.1.6 | |
| openblas | 0.3.23 | 0.3.26 | |
| R | 4.2.3 | 4.3.3 | rshared-lib flag was added |
| IGV | 2.16.2 | 2.18.0 | |
| hwloc | 2.8.0 | 2.10.0 | |
| Star | 2.7.9a | 2.7.11b | |
| picard | 3.0.0 | 3.2.0 | |
| foldseek | 8-ef4e960 | 9-427df8a | |
| bedtools2 | 2.21.1 | 2.31.1 |
The new java default is 21.0.2, previous versions are available if needed.
If you are interested in a particular module and how it was compiled please visit the following link. We will open source repositories as they reach a stable state.
Fixed¶
- ZFS storage missing kernel modules
- changed BRTFS to ZFS in previous changelog
Tested¶
- Slurm/MPI essential feature performances, results are in line with previous measurements
Security¶
- Removed mitigation for CVE-2024-6387 and updated
sshd
User Impact
- Codes compiled by you that rely on older versions of libraries will need to be recompiled (this might include some R modules)
- Python virtual environment that relies on the OS python version will need to be recompiled
- Some older codes may stop working due to their deprecated dependency. Please use newer versions whenever possible or transition to a containerized approach.
2024-07-08¶
Changed¶
- Proxmox VE updated
- Splitted InfiniBand interfaces on PVE machines
- Changed Operating System from CentOS 7 to Alma Linux 9 on LTS head node
Fixed¶
- Applied mitigation for CVE-2024-6387
User Impact
- Reduced InfiniBand bandwith on login node might impact data trasfer to compute nodes.
- LTS volumes are now served on the 25 GbE network.
- Now ZFS can be set up on LTS (only on group leader request).
- Mount points under
/mntare deprecated and have been moved to/orfeo/. Please contact us if something is missing. - SSH connections might be refused as a consequence of the CVE mitigation.
2024-06-10¶
Changed¶
- Proxmox VE updated
Fixed¶
- Enable
gresconstrains in Slurm
User Impact
- Any request you make for a node with GPUs must now be revised to book the accelerator correctly. In particular, command line submission must have either the
--gpus={{ # of gpus }}or the more specific--gres=gpu:{{gpu model}}:{{ # of gpu }}flag. The GPU model is the V100 in the GPU partition and the A100 in the DGX partition. You can add the same directives after a#SBATCHkeyword in your script.
2024-05-13¶
Changed¶
- User in slurm servers are now managed via LDAP+kerberos
- Switch reconfiguration to accommodate new VLANs throughout the whole fabric.
Fixed¶
- Updated firewall rules for higher security in the HPC environment
User Impact
- None, in theory. In practice, if previously running jobs now experience permission-related issues, let us know. There might have been misalignment between different user DBs.
2024-04-19¶
Added¶
- Installed podman on devices distributed for Fedora and Alma OSes.
Changed¶
- Reinstalled DGX OS 6.2.0 on DGX001.
- Install nvidia-cuda-toolkit from apt-get on DGX devices
Fixed¶
- subuid/subgid behaviour in nss.
User Impact
- No need to issue "module load cuda" when running jobs on DGX
- Test users can run podman on the compute nodes, if you want to try this feature contact us.
2024-04-08¶
Changed¶
- Moved slurm deployment to a temporary Kubernetes cluster. The normal situation will be restored once the Kubernetes production cluster is ready
- Updated both DGX nodes to DGX OS version 6.2.0 (based on Ubuntu 22.04 with Linux kernel version 5.15.0-1048-nvidia)
- Enrolled both DGX nodes in FreeIPA
- Changed default GPU queue walltime from 150 to 18 hours
2024-03-11¶
Changed¶
- Updated firewall rules for the HPC environment
- Proxmox VE updated
- Partial replacement of the DNS with bind9 as provided by FreeIpA
Tested¶
- FreeIpA redundancy: in case of a server failure a backup one will automatically replace it
Fixed¶
- Fixed network configuration for faulty nodes