GSI-Computing-Users-Meeting 04/06/2012

Participants:

  • Mohammad Al-Turany, SC
  • Benjamin Dönigus, Alice
  • Tetyana Galatyuk, CBM
  • Paul Görgen, Beschleunigerphysik
  • Klaus Götzen, Panda
  • Christopher Huhn, HPC
  • Silvia Masciocchi, Alice
  • Jochen Markert, Hades
  • Matthias Pausch, HPC
  • Carsten Preuss, SC
  • Thomas Roth, HPC
  • Kilian Schwarz, SC
  • Jochen Thaeder, Alice


  • Decommissioning of the LSF cluster
    • old analysis software may still be required for some time (until all (3) dissertations are finished?)
    • keeping an idle farm running consumes power and costs money
    • Debian Etch batch nodes already decommissioned
    • Shutdown of Lenny batch nodes until Q1/2013
    • Recent enough hardware can be reinstalled with Squeeze and join Icarus
    • Grid jobs can move from LSF to Icarus any time
    • HPC proposes a Squeeze/GridEngine "convenience cluster" with shared home directories in the style of the LSF cluster
      • interest seems to be very limited
    • Action Item: Plan the LSF decommissioning and hardware recycling (HPC)

  • Partially backup of Lustre
    • Required for Software development on Lustre/Icarus/Prometheus
    • Software should be in a revision control system and is not to be stored in Lustre due to meta-data performance impairment
    • aliroot cannot be installed out-of-tree
    • no other central file-systems available
    • CVMFS not suitable for software development
    • Action Item: Check if version control can serve as a replacement for backup (JT)

  • /lustre to /hera migration
    • ~ 50% of Alice data already copied
      • central data on /lustre to be deleted soon
      • Alice users have not migrated yet
    • current plan for Grid jobs on Lustre unclear
    • dedicated nodes for data migration fail quite often
      • caused by interference of Alice and Hades?
    • write access from gStore to hera via LNET routers also very slow
    • room for manoeuvrings strongly limited by small free space
      • lustre servers from C25 will supposably move to hera (TH) after being drained
        • C25 downtime inevitable in November 2012
        • move
      • but the current priority is to empty old servers for decommissioning

  • Squeeze desktops
    • are ready for use
      • rough edges still exist
    • printing requires new print-server
      • not all printers are configured on the new server yet

  • GridEngine
    • MPI jobs not scheduled outside the default queue
    • master jobs require more than 2 GB
    • possible solutions
      • memory overbooking?
      • low-mem-queue?
      • different scheduling order
      • different default queues depending on the project - not implementable
    • Group/project admins
      • No beforehand support in GridEngine
      • Action Item: check possibility for implementation via sudo -u group_member [ qdel | qmod | ... ] (HPC)
    • GridEngine training pointless for Alice users
    • Small job limits still effective for Hades
      • Action Item: Increase job limits for Hades (HPC)

  • Monitoring of the farm still to be improved
    • HPC has an open position dedicated to monitoring

Action Items

  • Implement Projects on Prometheus (BN, VP)
    • DONE Implemented for Alice
  • Patch for the GridEngine master to fix crashes (BN, VP)
    • DONE Fix ready but crashes disappeared
  • MPI jobs never scheduled outside the default queue? (CP, TN, BN, VP)
  • Tweak the minimal job submission interval (BN, VP)
    • DONE Reduced to 2 seconds
  • Cleanup of jobs that are not scheduled during a given interval (HPC)
    • No progress
  • Status of Hera/gStore connection? (Horst Göringer, Thomas Roth)
    • Implemented in test environment
  • Investigate the possibilities for a backup of the CVMFS servers (HPC)
    • High availability maybe more appropriate
    • Backup of build hosts to be evaluated
  • Environment variables that indicate to which cluster a worker node belongs to (HPC)
    • Implemented but documentation missing

Other open action items:

  • Upgrade of interactive ALICE nodes lxir35-38 to Squeeze (16.01.12: JT)
    • DONE Canceled
  • Upgrade of PANDA and CBM Storage Elements from Etch to Squeeze (16.01.12: KS)
    • Postponed, no progress
  • CVMFS on Squeeze desktops (06.02.12: JT, IT)
    • In preparation
  • Farm-wide availability of Modules (06.02.12: JT, IT)
    • In preparation
  • Funding for new hardware, e.g. CVMFS server (05.03.12: VP, WS, CH)
    • Negotiations ongoing

-- ChristopherHuhn - 28 Jun 2012

Revision: r1.2 - 28 Jun 2012 - 16:13 - ChristopherHuhn
LXadmin > GSIComputingUsersMeeting20120604
Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Impressum, Urheberrecht und Haftungsausschluss
Ideas, requests, problems regarding Wiki at GSI? Send feedback