GSI-Computing Users Meeting

Europe/Berlin
Christopher Huhn (Gesellschaft für Schwerionenforschung mbH), Kilian Schwarz (GSI Ges. für Schwerionenforschung mbH)
Beschreibung
Regelmäßiges Meeting zwischen IT und den wissenschaftlichen Nutzern der GSI-Computing-Umgebung (Batch-Farm, Lustre-File-System,...): 1. es werden Probleme diskutiert und Verbesserungen vorgeschlagen. 2. sowohl IT als auch Experimente sollen die jeweilige Planung für die nächsten 4 Wochen kurz vorstellen. Jede interessierte Nutzergruppe sollte mindestens einen Repräsentanten für die Teilnahme an technischen Diskussionen über den gegenwärtigen Zustand und die Zukunft der GSI-Computing-Umgebung schicken. Das heutige Meeting wird von Christopher Huhn geleitet.
Participants:

 

  • Benjamin Dönigus, Alice
  • Paul Görgen, Beschleunigerphysik
  • Klaus Götzen, Panda
  • Christopher Huhn, HPC
  • Peter Malzacher, SC
  • Jochen Markert, Hades
  • Thomas Neff, Theory
  • Bastian Neuburger, HPC
  • Victor Penso, HPC
  • Carsten Preuss, SC
  • Jochen Thaeder, Alice
  • Jan Trautmann, HPC

 

Status Mini Cube/Prometheus/Hera

 

  • Overall operation status is quite satisfactory
  • Hades and Alice need Project definitions for GridEngine to get more jobs on Prometheus
    • Action item: Implement Projects on Prometheus (BN, VP)
    • Exact share are to be defined later (task for the GSI FAIR Computing Meeting)
  • MPI jobs on Prometheus
    • GridEngine master sometimes crashes after big MPI jobs finish
      • May be caused by overload on the master but no concrete evidence
      • Action item: BN and VP apply a patch that promises to fix the problem
    • MPI jobs with resource requirements exceeding the standard queue never start?
      • Only if resources per user are defined (this is the default for all users)
      • Action item: Needs more investigation (CP, TN, BN, VP)
  • GridEngine queue with 3GB memory limit?
    • Alice has jobs slightly larger than 2GB that don't need the full 8GB of the current highmem queue
    • Resource limits should be (semi-automatically?) adjusted to suit the work load of the farm and the specified as well as actual resource requirements of the jobs
  • Automatic cleanup of /tmp partitions on Prometheus
    • Existing mechanism needs to be deployed to the new farm
  • Reduction of the job scheduling interval
    • Would provide faster feedback on job submission success
    • Action Item: Find out the minimal job submission interval (BN, VP)
  • Handling of jobs with impossible resource requirements
    • These jobs are never scheduled and stay in the queue
    • Action Item: (Semi-)automatic cleanup of jobs that are not scheduled during a given interval (HPC)
  • Access to gStore from Prometheus/Hera
    • Tight bandwidth limitations when copying via worker node
    • Full 40Gbit/s when writing directly from gStore data movers to Hera
    • Action Item: Check the status of Hera/gStore connection? (Horst Göhringer, Thomas Roth)

 

Other Topics

 

  • Backup of certain Lustre subtrees?
    • Desired by Alice for software development
      • aliroot not ready for out-of-source-tree installs
    • Copying of the software from a backed up location (/u?) seems to be the superior solution
  • Alice Storage Element on top of Lustre
    • 10 Gbit link to Karlsruhe idle for now
    • Requires (budget for) procurement of additional Lustre storage
  • Backup of CVMFS server
    • VP points out that the experiment groups should take care of the reproducibility of their software deployment
    • Anyhow re-deployment after a disaster may take some time
    • Backup of the build host(s) maybe the best solution
    • Action Item: investigate the possibilities (HPC)
  • Action Item: TN requests environment variables that indicate to which cluster a worker node belongs to (HPC)

 

 

Open Action Items

 

  • Limiting the physical memory usage on GridEngine worker nodes (12.9.11: IT)
    • Problem: GridEngine/ulimit can only limit the virtual stack size (VSS)
    • Aliroot reserves more RAM than it actually uses afterwards, therefore VSS limit must be higher than actual RAM limit per job slot.
    • CP: Cron job for monitoring is ready for production but not neccessary atm. on Prometheus
    • Action item closed
  • Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
    • JT still needs to test if Alice sw built on Squeeze runs on Lenny
    • Not discussed
  • Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
    • Not discussed
  • Shutdown of Etch32 and Etch64, Lenny and Squeeze 32bit queues (16.01.12: FOPI, Experimente)
    • FOPI software runs on Lenny64
    • No requests for 32bit batch queues atm.
    • Action item closed
  • Wrong Lustre paths in Perl scripts (06.02.12: JM, IT)
    • The problem is neither related to Perl nor to Lustre, but occurs when e.g. calling pwd from inside a script in a directory path that contains symlinks: The CWD path gets normalized (eg. /lustre/ becomes /SAT/lustre/ on NFS Lustre clients).
    • Solution: Use the environment variable PWD instead.
    • Action item closed
  • CVMFS on Squeeze desktops (06.02.12: JT, IT)
    • No progress yet
  • Farm-wide availability of Modules (06.02.12: JT, IT)
    • No progress yet
  • WM would like to know when the CBM desktop (and group server) machines can be moved to Squeeze (05.03.12: WM, FU, CH)
    • No progress: No CBM representative attended the meeting
    • Upgrade to Squeeze possible at any time
    • Action item closed
  • CVMFS server for the test Cluster (B) runs on old hardware. No funding for new hardware atm. (05.03.12: VP, WS)
    • CH reminds WS to try to coordinate funding and procurement of new hardware
Es gibt eine zugehörige Notiz zu dieser Veranstaltung Anzeigen.