GSI-Computing Users Meeting

Name: GSI-Computing Users Meeting
Start: 2012-05-07T14:00:00+02:00
End: 2012-05-07T15:00:00+02:00
Location: No location set

Montag 07.05.2012, 14:00 → 15:00 Europe/Berlin

Christopher Huhn (Gesellschaft für Schwerionenforschung mbH), Kilian Schwarz (GSI Ges. für Schwerionenforschung mbH)

Beschreibung

Regelmäßiges Meeting zwischen IT und den wissenschaftlichen Nutzern der GSI-Computing-Umgebung (Batch-Farm, Lustre-File-System,...): 1. es werden Probleme diskutiert und Verbesserungen vorgeschlagen. 2. sowohl IT als auch Experimente sollen die jeweilige Planung für die nächsten 4 Wochen kurz vorstellen. Jede interessierte Nutzergruppe sollte mindestens einen Repräsentanten für die Teilnahme an technischen Diskussionen über den gegenwärtigen Zustand und die Zukunft der GSI-Computing-Umgebung schicken. Das heutige Meeting wird von Christopher Huhn geleitet.

Support

CLUSTER-COMPUTING@gsi.de

Ausblenden

Participants:

Benjamin Dönigus, Alice
Paul Görgen, Beschleunigerphysik
Klaus Götzen, Panda
Christopher Huhn, HPC
Peter Malzacher, SC
Jochen Markert, Hades
Thomas Neff, Theory
Bastian Neuburger, HPC
Victor Penso, HPC
Carsten Preuss, SC
Jochen Thaeder, Alice
Jan Trautmann, HPC

Status Mini Cube/Prometheus/Hera

Overall operation status is quite satisfactory
Hades and Alice need Project definitions for GridEngine to get more jobs on Prometheus
- Action item: Implement Projects on Prometheus (BN, VP)
- Exact share are to be defined later (task for the GSI FAIR Computing Meeting)
MPI jobs on Prometheus
- GridEngine master sometimes crashes after big MPI jobs finish
  - May be caused by overload on the master but no concrete evidence
  - Action item: BN and VP apply a patch that promises to fix the problem
- MPI jobs with resource requirements exceeding the standard queue never start?
  - Only if resources per user are defined (this is the default for all users)
  - Action item: Needs more investigation (CP, TN, BN, VP)
GridEngine queue with 3GB memory limit?
- Alice has jobs slightly larger than 2GB that don't need the full 8GB of the current highmem queue
- Resource limits should be (semi-automatically?) adjusted to suit the work load of the farm and the specified as well as actual resource requirements of the jobs
Automatic cleanup of /tmp partitions on Prometheus
- Existing mechanism needs to be deployed to the new farm
Reduction of the job scheduling interval
- Would provide faster feedback on job submission success
- Action Item: Find out the minimal job submission interval (BN, VP)
Handling of jobs with impossible resource requirements
- These jobs are never scheduled and stay in the queue
- Action Item: (Semi-)automatic cleanup of jobs that are not scheduled during a given interval (HPC)
Access to gStore from Prometheus/Hera
- Tight bandwidth limitations when copying via worker node
- Full 40Gbit/s when writing directly from gStore data movers to Hera
- Action Item: Check the status of Hera/gStore connection? (Horst Göhringer, Thomas Roth)

Open Action Items

Limiting the physical memory usage on GridEngine worker nodes (12.9.11: IT)
- Problem: GridEngine/ulimit can only limit the virtual stack size (VSS)
- Aliroot reserves more RAM than it actually uses afterwards, therefore VSS limit must be higher than actual RAM limit per job slot.
- CP: Cron job for monitoring is ready for production but not neccessary atm. on Prometheus
- Action item closed
Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
- JT still needs to test if Alice sw built on Squeeze runs on Lenny
- Not discussed
Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
- Not discussed
Shutdown of Etch32 and Etch64, Lenny and Squeeze 32bit queues (16.01.12: FOPI, Experimente)
- FOPI software runs on Lenny64
- No requests for 32bit batch queues atm.
- Action item closed
Wrong Lustre paths in Perl scripts (06.02.12: JM, IT)
- The problem is neither related to Perl nor to Lustre, but occurs when e.g. calling pwd from inside a script in a directory path that contains symlinks: The CWD path gets normalized (eg. /lustre/ becomes /SAT/lustre/ on NFS Lustre clients).
- Solution: Use the environment variable PWD instead.
- Action item closed
CVMFS on Squeeze desktops (06.02.12: JT, IT)
- No progress yet
Farm-wide availability of Modules (06.02.12: JT, IT)
- No progress yet
WM would like to know when the CBM desktop (and group server) machines can be moved to Squeeze (05.03.12: WM, FU, CH)
- No progress: No CBM representative attended the meeting
- Upgrade to Squeeze possible at any time
- Action item closed
CVMFS server for the test Cluster (B) runs on old hardware. No funding for new hardware atm. (05.03.12: VP, WS)
- CH reminds WS to try to coordinate funding and procurement of new hardware

Es gibt eine zugehörige Notiz zu dieser Veranstaltung Anzeigen.

- 1
  agenda
  - a) Compute Cluster Status and Plans
    
    Sprecher: HPC department
  - b) Action Items Status
  - c) Plans for the next weeks
    
    Sprecher: Experiments and IT (GSI)
  - d) Date for next meeting
- 2
  
  AOB

GSI-Computing Users Meeting

Status Mini Cube/Prometheus/Hera

Other Topics

Open Action Items