Skip to content

2024

2024-12-27

Added

Dell PowerEdge R760
Nodes 11
CPU Intel Xeon Gold 6246Y (16C/32T)
RAM 192 GB per node
Storage (SAS HDD) 12×22 TB per node
Storage (NVMe) 2×15.36 TB per node

Changed

  • Storage Expansion: Expanded raw Ceph storage capacity and I/O performance:

    • HDD SAS 22 TB × 12 × 11 = 2 904 PB raw
    • NVMe 15.36 TB × 2 × 11 = 337.92 TB raw
  • Filesystem capacity now:

    • HDD: 4.2 PiB
    • NVMe: 325 TiB
    • SSD: 279 TiB
    • TOTAL: 4.8 PiB

User Impact

No downtime
Increased filesystem capacity and space

2024-12-20

Changed

  • New Datacenter Launch:
    • Support for high electrical loads
    • Improved cooling
    • Enhanced robustness and security

2024-12-12

Changed

  • Hardware Migration: Relocation of the existing infrastructure into the new container

  • All hardware (nodes, services, Kubernetes) migrated from the old container to the new one

2024-11-18

Changed

  • Moved fast filesystem mount point from /fast to /orfeo/cephfs/fast.

User Impact

All references to /fast will break and must be changed to point to /orfeo/cephfs/fast instead.

2024-08-12

Changed

  • Moved all computational nodes without accelerator from F37/F38 to F40, the following core package were affected:
module previous version new version
Python <= 3.11 3.12
Linux kernel <= 6.3 6.10
Glibc <= 2.36 2.39
gcc <= 12.3 14.2
  • Cluster wide slurm update from 22.05 to 24.05. The number of changes featurewise is considerable; please consider visiting the official news file for more details. For more details on the server deployment visit this link. Details on the clients will be made available soon.

  • Modules update, among the most relevant:

module old version new version comments
openmpi 4.1.5 4.1.6
openblas 0.3.23 0.3.26
R 4.2.3 4.3.3 rshared-lib flag was added
IGV 2.16.2 2.18.0
hwloc 2.8.0 2.10.0
Star 2.7.9a 2.7.11b
picard 3.0.0 3.2.0
foldseek 8-ef4e960 9-427df8a
bedtools2 2.21.1 2.31.1

The new java default is 21.0.2, previous versions are available if needed.

If you are interested in a particular module and how it was compiled please visit the following link. We will open source repositories as they reach a stable state.

Fixed

  • ZFS storage missing kernel modules
  • changed BRTFS to ZFS in previous changelog

Tested

  • Slurm/MPI essential feature performances, results are in line with previous measurements

Security

User Impact

  • Codes compiled by you that rely on older versions of libraries will need to be recompiled (this might include some R modules)
  • Python virtual environment that relies on the OS python version will need to be recompiled
  • Some older codes may stop working due to their deprecated dependency. Please use newer versions whenever possible or transition to a containerized approach.

2024-07-08

Changed

  • Proxmox VE updated
  • Splitted InfiniBand interfaces on PVE machines
  • Changed Operating System from CentOS 7 to Alma Linux 9 on LTS head node

Fixed

User Impact

  • Reduced InfiniBand bandwith on login node might impact data trasfer to compute nodes.
  • LTS volumes are now served on the 25 GbE network.
  • Now ZFS can be set up on LTS (only on group leader request).
  • Mount points under /mnt are deprecated and have been moved to /orfeo/. Please contact us if something is missing.
  • SSH connections might be refused as a consequence of the CVE mitigation.

2024-06-10

Changed

  • Proxmox VE updated

Fixed

  • Enable gres constrains in Slurm

User Impact

  • Any request you make for a node with GPUs must now be revised to book the accelerator correctly. In particular, command line submission must have either the --gpus={{ # of gpus }} or the more specific --gres=gpu:{{gpu model}}:{{ # of gpu }} flag. The GPU model is the V100 in the GPU partition and the A100 in the DGX partition. You can add the same directives after a #SBATCH keyword in your script.

2024-05-13

Changed

  • User in slurm servers are now managed via LDAP+kerberos
  • Switch reconfiguration to accommodate new VLANs throughout the whole fabric.

Fixed

  • Updated firewall rules for higher security in the HPC environment

User Impact

  • None, in theory. In practice, if previously running jobs now experience permission-related issues, let us know. There might have been misalignment between different user DBs.

2024-04-19

Added

  • Installed podman on devices distributed for Fedora and Alma OSes.

Changed

  • Reinstalled DGX OS 6.2.0 on DGX001.
  • Install nvidia-cuda-toolkit from apt-get on DGX devices

Fixed

  • subuid/subgid behaviour in nss.

User Impact

  • No need to issue "module load cuda" when running jobs on DGX
  • Test users can run podman on the compute nodes, if you want to try this feature contact us.

2024-04-08

Changed

  • Moved slurm deployment to a temporary Kubernetes cluster. The normal situation will be restored once the Kubernetes production cluster is ready
  • Updated both DGX nodes to DGX OS version 6.2.0 (based on Ubuntu 22.04 with Linux kernel version 5.15.0-1048-nvidia)
  • Enrolled both DGX nodes in FreeIPA
  • Changed default GPU queue walltime from 150 to 18 hours

2024-03-11

Changed

  • Updated firewall rules for the HPC environment
  • Proxmox VE updated
  • Partial replacement of the DNS with bind9 as provided by FreeIpA

Tested

  • FreeIpA redundancy: in case of a server failure a backup one will automatically replace it

Fixed

  • Fixed network configuration for faulty nodes