The CMS “Data Management” team is looking for a highly motivated and skilled DevOps Engineer to join our team. In this role, you will play a crucial role in developing, operating and evolving the data management system of the Compact Muon Solenoid (CMS).
Come help us manage more than half an Exabyte of tape storage, 250 PB of disk, and about half a million CPU cores, provided by various data centers including High Performance Computers. We support the central operations of CMS. We are looking for one Research Intern employed by
UW-Madison but located at
Fermilab to operate the transfer system of CMS. On average, weekly about 10 PB is transferred between all CMS sites. The successful candidate would monitor the system, maintain its health and debug transfer problems together with site admins and transfer system developers. The successful candidate would be responsible for ensuring a smooth operation by developing and optimizing tools and monitoring the transfers using the
RUCIO data management system. The successful candidate would receive training and constant technical advice and would be working with development teams of other parts of the CMS computing infrastructure.
The Compact Muon Solenoid (CMS) is an experiment that investigates a wide range of physics using the Large Hadron Collider (LHC) at CERN in Geneva Switzerland. The CMS collaboration uses a large and distributed computing infrastructure to record, process and analyze CMS data and simulation at petabyte scales. Centrally, we ensure that data processing jobs are run correctly and their output safely stored. CMS Computing Operations manages these central workflows starting with processing freshly recorded data at CERN, to re-processing data and producing Monte Carlo simulations at worldwide distributed computing centers. CMS Computing Operations also maintains and manages large scale data movements between these different centers and collaborates with the admins of the more than 70 computing sites across the world to maintain and sustain functional services and resources used by central workflows and physicists performing analysis of detector data and simulations.
Computing systems used at and between the different sites are based on Grid technologies and use sophisticated software to achieve the required high scales of transfer volumes and processing throughput. State of the art programming technologies are used for the different systems. The processing infrastructure is based on distributed state-machines that handle job submission and tracking. The data transfer system is based on a highly performant agent infrastructure both centrally and at the individual computing centers. All systems use the latest database technologies, both local to the systems and distributed, both relational and NoSQL database systems. Central services use a hierarchical http cache infrastructure to provide access to alignment and calibration constants of the detector to all jobs running on the Grid. To check the performance of the sites, a sophisticated test suite is probing the different services of each Grid site regularly and allows for early discovery and mitigation of problems. The uninterrupted service of all sites and central services is the basis for successful delivery of all centrally produced samples to the CMS collaboration and to guarantee analysis access for all physicists to produce high quality physics results. Expert operators are needed to continually monitor all central workflows, transfers and site performance because most of the workflows are time critical and very important for the overall success of the CMS experiment. A good overview is shown in the following animation:
http://cds.cern.ch/record/1541893?ln=en