GSI-Computing-Users-Meeting 04/06/2012

Participants:

Decommissioning of the LSF cluster
- old analysis software may still be required for some time (until all (3) dissertations are finished?)
- keeping an idle farm running consumes power and costs money
- Debian Etch batch nodes already decommissioned
- Shutdown of Lenny batch nodes until Q1/2013
- Recent enough hardware can be reinstalled with Squeeze and join Icarus
- Grid jobs can move from LSF to Icarus any time
- HPC proposes a Squeeze/GridEngine "convenience cluster" with shared home directories in the style of the LSF cluster
  - interest seems to be very limited
- Action Item: Plan the LSF decommissioning and hardware recycling (HPC)

Partially backup of Lustre
- Required for Software development on Lustre/Icarus/Prometheus
- Software should be in a revision control system and is not to be stored in Lustre due to meta-data performance impairment
- aliroot cannot be installed out-of-tree
- no other central file-systems available
- CVMFS not suitable for software development
- Action Item: Check if version control can serve as a replacement for backup (JT)

/lustre to /hera migration
- ~ 50% of Alice data already copied
  - central data on /lustre to be deleted soon
  - Alice users have not migrated yet
- current plan for Grid jobs on Lustre unclear
- dedicated nodes for data migration fail quite often
  - caused by interference of Alice and Hades?
- write access from gStore to hera via LNET routers also very slow
- room for manoeuvrings strongly limited by small free space
  - lustre servers from C25 will supposably move to hera (TH) after being drained
    - C25 downtime inevitable in November 2012
    - move
  - but the current priority is to empty old servers for decommissioning

Squeeze desktops
- are ready for use
  - rough edges still exist
- printing requires new print-server
  - not all printers are configured on the new server yet

GridEngine
- MPI jobs not scheduled outside the default queue
- master jobs require more than 2 GB
- possible solutions
  - memory overbooking?
  - low-mem-queue?
  - different scheduling order
  - different default queues depending on the project - not implementable
- Group/project admins
  - No beforehand support in GridEngine
  - Action Item: check possibility for implementation via sudo -u group_member [ qdel | qmod | ... ] (HPC)
- GridEngine training pointless for Alice users
  - GridEngine hidden underneath wrapper scripts
- Small job limits still effective for Hades
  - Action Item: Increase job limits for Hades (HPC)

Monitoring of the farm still to be improved
- HPC has an open position dedicated to monitoring

Implement Projects on Prometheus (BN, VP)
- Implemented for Alice
Patch for the GridEngine master to fix crashes (BN, VP)
- Fix ready but crashes disappeared
MPI jobs never scheduled outside the default queue? (CP, TN, BN, VP)
- Bug in GridEngine - currently no solution
Tweak the minimal job submission interval (BN, VP)
- Reduced to 2 seconds
Cleanup of jobs that are not scheduled during a given interval (HPC)
- No progress
Status of Hera/gStore connection? (Horst Göringer, Thomas Roth)
- Implemented in test environment
Investigate the possibilities for a backup of the CVMFS servers (HPC)
- High availability maybe more appropriate
- Backup of build hosts to be evaluated
Environment variables that indicate to which cluster a worker node belongs to (HPC)
- Implemented but documentation missing

Other open action items:

Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
- Canceled
Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
- Postponed, no progress
CVMFS on Squeeze desktops (06.02.12: JT, IT)
- In preparation
Farm-wide availability of Modules (06.02.12: JT, IT)
- In preparation
Funding for new hardware, e.g. CVMFS server (05.03.12: VP, WS, CH)
- Negotiations ongoing

-- ChristopherHuhn - 28 Jun 2012

Revision: r1.2 - 28 Jun 2012 - 16:13 - ChristopherHuhn

LXadmin > GSIComputingUsersMeeting20120604