Participants:
-
Benjamin Dönigus, Alice
-
Paul Görgen, Beschleunigerphysik
-
Klaus Götzen, Panda
-
Christopher Huhn, HPC
-
Peter Malzacher, SC
-
Jochen Markert, Hades
-
Thomas Neff, Theory
-
Bastian Neuburger, HPC
-
Victor Penso, HPC
-
Carsten Preuss, SC
-
Jochen Thaeder, Alice
-
Jan Trautmann, HPC
Status Mini Cube/Prometheus/Hera
-
Overall operation status is quite satisfactory
-
Hades and Alice need Project definitions for GridEngine to get more jobs on Prometheus
-
Action item: Implement Projects on Prometheus (BN, VP)
-
Exact share are to be defined later (task for the GSI FAIR Computing Meeting)
-
MPI jobs on Prometheus
-
GridEngine master sometimes crashes after big MPI jobs finish
-
May be caused by overload on the master but no concrete evidence
-
Action item: BN and VP apply a patch that promises to fix the problem
-
MPI jobs with resource requirements exceeding the standard queue never start?
-
Only if resources per user are defined (this is the default for all users)
-
Action item: Needs more investigation (CP, TN, BN, VP)
-
GridEngine queue with 3GB memory limit?
-
Alice has jobs slightly larger than 2GB that don't need the full 8GB of the current highmem queue
-
Resource limits should be (semi-automatically?) adjusted to suit the work load of the farm and the specified as well as actual resource requirements of the jobs
-
Automatic cleanup of
/tmp partitions on Prometheus
-
Existing mechanism needs to be deployed to the new farm
-
Reduction of the job scheduling interval
-
Would provide faster feedback on job submission success
-
Action Item: Find out the minimal job submission interval (BN, VP)
-
Handling of jobs with impossible resource requirements
-
These jobs are never scheduled and stay in the queue
-
Action Item: (Semi-)automatic cleanup of jobs that are not scheduled during a given interval (HPC)
-
Access to gStore from Prometheus/Hera
-
Tight bandwidth limitations when copying via worker node
-
Full 40Gbit/s when writing directly from gStore data movers to Hera
-
Action Item: Check the status of Hera/gStore connection? (Horst Göhringer, Thomas Roth)
Other Topics
-
Backup of certain Lustre subtrees?
-
Desired by Alice for software development
-
aliroot not ready for out-of-source-tree installs
-
Copying of the software from a backed up location (/u?) seems to be the superior solution
-
Alice Storage Element on top of Lustre
-
10 Gbit link to Karlsruhe idle for now
-
Requires (budget for) procurement of additional Lustre storage
-
Backup of CVMFS server
-
VP points out that the experiment groups should take care of the reproducibility of their software deployment
-
Anyhow re-deployment after a disaster may take some time
-
Backup of the build host(s) maybe the best solution
-
Action Item: investigate the possibilities (HPC)
-
Action Item: TN requests environment variables that indicate to which cluster a worker node belongs to (HPC)
Open Action Items
-
Limiting the physical memory usage on GridEngine worker nodes (12.9.11: IT)
-
Problem: GridEngine/ulimit can only limit the virtual stack size (VSS)
-
Aliroot reserves more RAM than it actually uses afterwards, therefore VSS limit must be higher than actual RAM limit per job slot.
-
CP: Cron job for monitoring is ready for production but not neccessary atm. on Prometheus
-
Action item closed
-
Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
-
JT still needs to test if Alice sw built on Squeeze runs on Lenny
-
Not discussed
-
Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
-
Shutdown of Etch32 and Etch64, Lenny and Squeeze 32bit queues (16.01.12: FOPI, Experimente)
-
FOPI software runs on Lenny64
-
No requests for 32bit batch queues atm.
-
Action item closed
-
Wrong Lustre paths in Perl scripts (06.02.12: JM, IT)
-
The problem is neither related to Perl nor to Lustre, but occurs when e.g. calling
pwd from inside a script in a directory path that contains symlinks: The CWD path gets normalized (eg. /lustre/ becomes /SAT/lustre/ on NFS Lustre clients).
-
Solution: Use the environment variable
PWD instead.
-
Action item closed
-
CVMFS on Squeeze desktops (06.02.12: JT, IT)
-
Farm-wide availability of Modules (06.02.12: JT, IT)
-
WM would like to know when the CBM desktop (and group server) machines can be moved to Squeeze (05.03.12: WM, FU, CH)
-
No progress: No CBM representative attended the meeting
-
Upgrade to Squeeze possible at any time
-
Action item closed
-
CVMFS server for the test Cluster (B) runs on old hardware. No funding for new hardware atm. (05.03.12: VP, WS)
-
CH reminds WS to try to coordinate funding and procurement of new hardware
Es gibt eine zugehörige Notiz zu dieser Veranstaltung
Anzeigen.