Availability - A. Apollonio

Scope is the Proton Run 2016; data has been prepared by the AWG and fault review experts using the AFT.

This presentation is a summary of three individual reports that were written for:

Combination of all of these;

213 days were considered, 153 days of which were dedicated to physics and special physics.

Availability ranged from a minimum of 30% to high of around 90%, which was stable over several weeks.  The best weeks achieve around 3fb-1.

Over the 175 + 4 = 179 fills reaching stable beams, 47% reached end of fill, 48% were aborted, 5% were aborted due to suspected radiation effects.  The main categories of premature aborts; UFO and FMCM.

Short duration stable beams are due to intensity ramp up.  Before MD1 and BCMS the machine was left to run for as long as possible.

In the period of physics production there were 779 faults, with 77 pre-cycles due to faults.  

Top 5 are:

The period was dominated by high-impact faults.

Big improvers versus 2015:

Conclusions:

J. Wenninger - what is in the operation section of the pie-chart? Can it be separated?

A, Apollonio - we can quantify it, but not automatically. 

L. Ponce- the column for operation mode can be extracted, but we need an automated means to correlate this.

M. Lamont - The operational conditions of the machine being stable appeasr to influence the stability of the LHC availability.  Does keeping the operational conditions stable mean that systems will keep (or have kept) the same availability?

A. Apollonio - we will see next year.  The comparison of 2015 to 2016 is difficult, as the things like BCMS and bunch spacing has changed the operational conditions of the machine.

L. Ponce- 2015 was dominated by cryogenic recovery and stability, 2016 has not had the same issues.  The sources of aborted fills, which are immediately repaired, are a factor which needs to be considered, for example, a fault which leads to a beam abort, which requires no repair, but the machine to be re-filled.

S. Redaelli - this year was one of the years where we lost the most number of operational days due to long faults.  Once this is corrected out, what is the characteristic of the fault data?

A. Apollonio - 2016 began with poor availability, with isolated faults, having a long duration, since then it appears that "random faults" have been the driving factor?

S. Redaelli - what about the R2E? why is it so few failures?

S. Danzeca - The TCL settings are one of the main contributors of R2E failures.

G. Rakness - how come at the end of the year there is high availability and yet not much physics produced?

L. Ponce - at the end of the year there were several areas of machine exploitation that meant the machine was not producing physics, for example there were several MDs.  It was noted that on 24th October in three consecutive days, there was the highest luminosity delivery of the year.

 

Technical Services - J. Nielsen

The five systems which are monitored by TIOC;

These categories are distributed across several elements of the AFT tree.

The events which occur are classified by groups, in the future this could be done by mapping systems and equipment instead of by group, matching the approach from the AFT.  This will help classify events more clearly.

For example, some systems are groups of systems, the classification could be improved and AFT could show groups of systems.  To achieve this the definition of a "system" should be improved.

TIOC meets every Wednesday to analyse the events that have occurred during the week, then recommendations are made to mitigate root causes.  TIOC coordinates the larger technical interventions.

If an accelerator is stopped, then a major event is created.  The data for such an event is taken once, and is not subsequently synchronised, this could be improved.  The major events are presented in the weekly TIOC meeting.  The machine or service operator fills in the first part of the information, then the user and/or group then fills in more information.

The fault information for 2016 shows major groups:

Breakdown by fault count (with duration):

Controls and Instrumentation:

Mostly PLC failures.

Equipment Faults:

Usually due to common-mode power supply faults, for example a failure which trips the power supply to several element (selectivity tripping at a higher level),

certain events are due to equipment not suitable for use (old installations being re-tasked), general equipment failure, or calibration problems.

Downtime is higher than 2015, but if you remove the weasel, it is lower (-30%).

Electrical Perturbations:

A general report from the mains supply services shows that 2016 has had -19% thunderstorms than a typical year.

Conclusions

M. Lamont - the weasel showed that there were some spares issues.

J. Nielsen - there were spares, but not in good condition.

D. Nisbet - how do we close the loop with equipment groups? How can we see improvements?

J. Nielsen  - next year we hope it the duration of fault assigned to the technical services will be lower, there have been long effect faults.  For the follow up it is the equipment groups and users.  An event is not closed in the TIOC unless it is not going to be mitigated, or that it has been mitigated.

L. Ponce - the TIOC is doing much more follow up than the machines do for the AFT.

 

Injector Complex - V. Kain and B. Mikulec

Injectors were the number one cause of 2016 LHC downtime, although it should be taken into account that there are four injectors before LHC.  If this was split, then the LHC "injector" would be a shorter bar per machine.  

It is not easy to find which accelerator is the source of LHC downtime, AFT is being discussed to be added to assist in this work.

138 faults were attributed to the injectors, with 15 days downtime.  This analysis was very time consuming, as the connection from LHC to the injectors logbooks is not automatic.

LINAC2 - 6h 20m as seen by LHC

3 faults, notably replacement of an ignitron

Booster - 11h 45m as seen by LHC 

several faults, mainly electro-valves, longest individual fault was 4 hours.

PS - 9 days 10 hours as seen by LHC

power converters, MPS and POPS, vacuum and radio frequency.  Power converter is over 6 days of this, vacuum over 1 day, and RF over 15 hours.

SPS - 4 days 19 hours as seen by LHC

power converters (no real systematics) over 1 day, 8 hours. targets and dumps 23 hours, radio frequency over 18 hours.  A lot of systematic issues effecting beam quality, but like degraded mode, not an actual fault.

If you contrast the overall performance, as LHC only needs beam during filling, considering each machine as a continuous operation;

LINAC2 - 97.3% uptime, 166h downtime.

Booster - 93.9% uptime, 384h downtime

PS - 88% uptime, 727h downtime

availability per user varies from 79-94%

SPS - 74.8% uptime, 1366h downtime

Issues with fault tracking in the injectors;

Injector AFT

Injector downtime appears to be correlated by a few longer uncorrelated breakdowns.

J. Jowett - the consideration of only the P+ run has hidden some issues which were observed during the P+ Pb run. Although there were other injectors used for the Pb injection.

M. Lamont - how come the LHC was not adversely effected by poor LINAC availability?

B. Mikulec - the LHC never asked for beam, and therefore no fault was logged.  

M. Lamont - are the breakdowns really uncorrelated? could be correlated with maintenance activities needing some improvement.

B. Goddard - note that sometimes the maintenance has led to lower availability (e.g. water valves).

L. Ponce - how to keep track of the degraded modes? it is something that was also abandoned in the LHC, in the AFT it was too difficult to track, and so was not done.

R. Steerenberg - having this degraded information for the whole period would make things clearer, at the moment the reality is obscured due to the incomplete capture of the degraded mode. 

L. Ponce - agrees, in addition, injector "downtime" can be flagged in AFT, for example, "prevents injection".  Following MKI problem this was added, this was not used in 2016.  For example the 35h fill, for example, was kept so long to avoid injector issues.

 

Cryogenics - K. Brodzinski

There are four cryogenic islands, 8 cryogenics plants. A = Low Load, B = High Load.

Cold boxes were tuned, achieving 175W per half cell capacity on the worst performing sectors (around point 8).  In sectors 2-3 the beam screen heat load cooling capacity can reach 195W.  The general limit is 160W

In 2016 - 94.4% availability.  If you exclude users and supply, it achieves 98.6%.

2015 - total downtime 273 hours

2016 - total downtime 79 hours

this improvement comes from four effects;

  1. Feed forward logic for beam screen heating
  2. points 2 & 8 optimisation
  3. point 8 cold box repairs
  4. DFB level adjustment

Overall around 60% of downtime was due to PLC failures, this is a known issue for some time.

4.5K - 1 x human factors, 2 x PLC

1.8K - 1 x mechanical failure, 1 x AMB CC

Helium Losses

Beam Screen Heat Load

2017 plans:

J. Wenninger- is the triplet limit 2.0e34 or 1.7e34?

K. Brodzinski - the limit is really 1.7e34.  After the tests carried out, a baseine of 300W heat load on triplet was expected, but once the re-calibration correcting factor was added, the actual load managed was only 240-250W.  There is still room for improvement, 1.75e34 is something that is known, and can be done.  To reach 2.0e34 tuning is needed.

 

Sources of Premature Beam Aborts - I. Romera / M. Zelauth

86 fills were aborted, a pareto of these has three large contributors:

Technical Services x 27:

Power converters x 15

Beam Losses / Unidentified Falling Objects (UFOs) x 14

Remaining

Small counts; interesting cases are:

There is no correlation obvious.

Conclusions

everything looks random, bottom of the bathtub curve!

G. Arduini - for the power converters which have a possible radiation effect, five out of six are in point 5 RRs.

M. Zerlauth - this is the case, these are planned to be changed, first by replacing the controller (FGClite) and then the converter power part.

S. Danzeca - the events in RR53 / 57, happen when the TCL settings are "closed", when the TCL are opened there are no events.

A. Lechner - concerning the UFO for the IR, thresholds have already been increased for the year. 

B. Goddard - putting the last presentations together, it's remarkable that there was only one dump from the dump kickers, and the dilution systems.  This is due to the reliability run, which has shown to be clearly beneficial.