GSI-Computing Users Meeting
→
Europe/Berlin
,
Beschreibung
Regelmäßiges Meeting zwischen IT und den wissenschaftlichen Nutzern der GSI-Computing-Umgebung (Batch-Farm, Lustre-File-System,...):
1. es werden Probleme diskutiert und Verbesserungen vorgeschlagen.
2. sowohl IT als auch Experimente sollen die jeweilige Planung für die nächsten 4 Wochen kurz vorstellen.
Jede interessierte Nutzergruppe sollte mindestens einen Repräsentanten für die Teilnahme an technischen Diskussionen über den gegenwärtigen Zustand und die Zukunft der GSI-Computing-Umgebung schicken.
Das heutige Meeting wird von Christopher Huhn geleitet.
Support
Participants:
- Benjamin Dönigus, Alice
- Paul Görgen, Beschleunigerphysik
- Klaus Götzen, Panda
- Christopher Huhn, HPC
- Peter Malzacher, SC
- Jochen Markert, Hades
- Thomas Neff, Theory
- Bastian Neuburger, HPC
- Victor Penso, HPC
- Carsten Preuss, SC
- Jochen Thaeder, Alice
- Jan Trautmann, HPC
Status Mini Cube/Prometheus/Hera
- Overall operation status is quite satisfactory
-
Hades and Alice need Project definitions for GridEngine to get more jobs on Prometheus
- Action item: Implement Projects on Prometheus (BN, VP)
- Exact share are to be defined later (task for the GSI FAIR Computing Meeting)
-
MPI jobs on Prometheus
-
GridEngine master sometimes crashes after big MPI jobs finish
- May be caused by overload on the master but no concrete evidence
- Action item: BN and VP apply a patch that promises to fix the problem
-
MPI jobs with resource requirements exceeding the standard queue never start?
- Only if resources per user are defined (this is the default for all users)
- Action item: Needs more investigation (CP, TN, BN, VP)
-
GridEngine master sometimes crashes after big MPI jobs finish
-
GridEngine queue with 3GB memory limit?
- Alice has jobs slightly larger than 2GB that don't need the full 8GB of the current highmem queue
- Resource limits should be (semi-automatically?) adjusted to suit the work load of the farm and the specified as well as actual resource requirements of the jobs
-
Automatic cleanup of
/tmppartitions on Prometheus- Existing mechanism needs to be deployed to the new farm
-
Reduction of the job scheduling interval
- Would provide faster feedback on job submission success
- Action Item: Find out the minimal job submission interval (BN, VP)
-
Handling of jobs with impossible resource requirements
- These jobs are never scheduled and stay in the queue
- Action Item: (Semi-)automatic cleanup of jobs that are not scheduled during a given interval (HPC)
-
Access to gStore from Prometheus/Hera
- Tight bandwidth limitations when copying via worker node
- Full 40Gbit/s when writing directly from gStore data movers to Hera
- Action Item: Check the status of Hera/gStore connection? (Horst Göhringer, Thomas Roth)
Other Topics
-
Backup of certain Lustre subtrees?
-
Desired by Alice for software development
- aliroot not ready for out-of-source-tree installs
- Copying of the software from a backed up location (/u?) seems to be the superior solution
-
Desired by Alice for software development
-
Alice Storage Element on top of Lustre
- 10 Gbit link to Karlsruhe idle for now
- Requires (budget for) procurement of additional Lustre storage
-
Backup of CVMFS server
- VP points out that the experiment groups should take care of the reproducibility of their software deployment
- Anyhow re-deployment after a disaster may take some time
- Backup of the build host(s) maybe the best solution
- Action Item: investigate the possibilities (HPC)
- Action Item: TN requests environment variables that indicate to which cluster a worker node belongs to (HPC)
Open Action Items
-
Limiting the physical memory usage on GridEngine worker nodes (12.9.11: IT)
- Problem: GridEngine/ulimit can only limit the virtual stack size (VSS)
- Aliroot reserves more RAM than it actually uses afterwards, therefore VSS limit must be higher than actual RAM limit per job slot.
- CP: Cron job for monitoring is ready for production but not neccessary atm. on Prometheus
- Action item closed
-
Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
- JT still needs to test if Alice sw built on Squeeze runs on Lenny
- Not discussed
-
Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
- Not discussed
-
Shutdown of Etch32 and Etch64, Lenny and Squeeze 32bit queues (16.01.12: FOPI, Experimente)
- FOPI software runs on Lenny64
- No requests for 32bit batch queues atm.
- Action item closed
-
Wrong Lustre paths in Perl scripts (06.02.12: JM, IT)
-
The problem is neither related to Perl nor to Lustre, but occurs when e.g. calling
pwdfrom inside a script in a directory path that contains symlinks: The CWD path gets normalized (eg./lustre/becomes/SAT/lustre/on NFS Lustre clients). -
Solution: Use the environment variable
PWDinstead. - Action item closed
-
The problem is neither related to Perl nor to Lustre, but occurs when e.g. calling
-
CVMFS on Squeeze desktops (06.02.12: JT, IT)
- No progress yet
-
Farm-wide availability of Modules (06.02.12: JT, IT)
- No progress yet
-
WM would like to know when the CBM desktop (and group server) machines can be moved to Squeeze (05.03.12: WM, FU, CH)
- No progress: No CBM representative attended the meeting
- Upgrade to Squeeze possible at any time
- Action item closed
-
CVMFS server for the test Cluster (B) runs on old hardware. No funding for new hardware atm. (05.03.12: VP, WS)
- CH reminds WS to try to coordinate funding and procurement of new hardware
Es gibt eine zugehörige Notiz zu dieser Veranstaltung
Anzeigen.