GSI-Computing-Users-Meeting 04/06/2012
Participants:
- Mohammad Al-Turany, SC
- Benjamin Dönigus, Alice
- Tetyana Galatyuk, CBM
- Paul Görgen, Beschleunigerphysik
- Klaus Götzen, Panda
- Christopher Huhn, HPC
- Silvia Masciocchi, Alice
- Jochen Markert, Hades
- Matthias Pausch, HPC
- Carsten Preuss, SC
- Thomas Roth, HPC
- Kilian Schwarz, SC
- Jochen Thaeder, Alice
- Decommissioning of the LSF cluster
- old analysis software may still be required for some time (until all (3) dissertations are finished?)
- keeping an idle farm running consumes power and costs money
- Debian Etch batch nodes already decommissioned
- Shutdown of Lenny batch nodes until Q1/2013
- Recent enough hardware can be reinstalled with Squeeze and join Icarus
- Grid jobs can move from LSF to Icarus any time
- HPC proposes a Squeeze/GridEngine "convenience cluster" with shared home directories in the style of the LSF cluster
- interest seems to be very limited
- Action Item: Plan the LSF decommissioning and hardware recycling (HPC)
- Partially backup of Lustre
- Required for Software development on Lustre/Icarus/Prometheus
- Software should be in a revision control system and is not to be stored in Lustre due to meta-data performance impairment
- aliroot cannot be installed out-of-tree
- no other central file-systems available
- CVMFS not suitable for software development
- Action Item: Check if version control can serve as a replacement for backup (JT)
-
/lustre to /hera migration
- ~ 50% of Alice data already copied
- central data on
/lustre to be deleted soon
- Alice users have not migrated yet
- current plan for Grid jobs on Lustre unclear
- dedicated nodes for data migration fail quite often
- caused by interference of Alice and Hades?
- write access from gStore to hera via LNET routers also very slow
- room for manoeuvrings strongly limited by small free space
- lustre servers from C25 will supposably move to hera (TH) after being drained
- C25 downtime inevitable in November 2012
- move
- but the current priority is to empty old servers for decommissioning
- Squeeze desktops
- are ready for use
- printing requires new print-server
- not all printers are configured on the new server yet
- GridEngine
- MPI jobs not scheduled outside the default queue
- master jobs require more than 2 GB
- possible solutions
- memory overbooking?
- low-mem-queue?
- different scheduling order
- different default queues depending on the project - not implementable
- Group/project admins
- No beforehand support in GridEngine
- Action Item: check possibility for implementation via
sudo -u group_member [ qdel | qmod | ... ] (HPC)
- GridEngine training pointless for Alice users
- Small job limits still effective for Hades
- Action Item: Increase job limits for Hades (HPC)
- Monitoring of the farm still to be improved
- HPC has an open position dedicated to monitoring
Action Items
- Implement Projects on Prometheus (BN, VP)
-
Implemented for Alice
- Patch for the GridEngine master to fix crashes (BN, VP)
-
Fix ready but crashes disappeared
- MPI jobs never scheduled outside the default queue? (CP, TN, BN, VP)
- Tweak the minimal job submission interval (BN, VP)
-
Reduced to 2 seconds
- Cleanup of jobs that are not scheduled during a given interval (HPC)
- Status of Hera/gStore connection? (Horst Göringer, Thomas Roth)
- Implemented in test environment
- Investigate the possibilities for a backup of the CVMFS servers (HPC)
- High availability maybe more appropriate
- Backup of build hosts to be evaluated
- Environment variables that indicate to which cluster a worker node belongs to (HPC)
- Implemented but documentation missing
Other open action items:
- Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
-
Canceled
- Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
- CVMFS on Squeeze desktops (06.02.12: JT, IT)
- Farm-wide availability of Modules (06.02.12: JT, IT)
- Funding for new hardware, e.g. CVMFS server (05.03.12: VP, WS, CH)
-- ChristopherHuhn - 28 Jun 2012