Availability Analysis of Standby Safety Installations in Nuclear Facilities via Renewal Theory

UCL Stochastic Systems (Fall 2024)

Author

J. Tang

Abstract

The standby safety components in nuclear facilities are crucial for ensuring safe and reliable energy production. The stochastic nature of failure times of said components poses a challenge to the prompt detection and correction of failure. In this paper, we use an alternating renewal process to model the downtime of a system subject to interruptions by failure, inspection, and maintenance. We derive a general expression for expected unavailability using the renewal theory and obtain an equation whose roots provide an optimal inspection frequency that minimises unavailability under varying conditions. Additionally, we develop algorithms and implement code to simulate the renewal process, obtaining numerical results using the Monte Carlo method. By computing expectation and simulations, we find that for a system with a mean time between failures of \(17.72\) months, an inspection rate of \(3.33\) days\(^{-1}\), and a maintenance rate of \(0.14\) days\(^{-1}\), the optimal inspection frequency is approximately \(0.03\) days\(^{-1}\). Under the same conditions, we find that the difference between the average unavailability evaluated at inspection frequencies of \(0.01\) days\(^{-1}\) and \(0.03\) days\(^{-1}\) is not statistically significant, suggesting a more cost-efficient inspection and maintenance policy than the optimal solution.

1 Introduction

A range of safety equipment is installed in nuclear power plants with the fundamental objective to ensure safe and controlled fission reactions and to protect people and the environment from harmful effects of ionising radiation (pp. 4-5, in International Atomic Energy Agency 2006). Examples of back-up safety equipment include emergency diesel generators, valves, back-up pumps, DC batteries, and so on. Such systems act as a critical failsafe against loss of coolant during catastrophic reactor accidents — a lesson bitterly learned from Fukushima Daiichi in 2011, where a tsunami knocked out back-up diesel generators, resulting in a triple meltdown. Therefore, it is crucial to minimise the unavailability of safety equipment.

During the lifetime of a nuclear power plant, accidents are relatively rare, and hence back-up safety equipment is, for the most part, in an idle state. Without being active, any faults or failures in the system are not instantly detected. This means that it may be unavailable at any given time, including when it is needed. As such, nuclear safety equipment is periodically inspected to detect such failures and is repaired if any are found. Inspection shortens the period during which failures remain undetected and safety systems remain nonfunctional. However, safety inspection takes time, manpower and resources. On many occasions, the equipment also needs to be taken offline for the inspection (Tokyo Electric Power Company 2009). This suggests that inspections can neither be excessive nor infrequent to minimise the unavailability and avoid causing interruption to the operation.

How do we decide on the frequency of inspections of safety-critical instruments on-site of a nuclear facility? By mathematical analysis and modelling alternating renewal processes, we evaluate the performance of various inspection and maintenance schemes under different scenarios, with regard to the system’s average life expectancy, failure rate, and other reliability metrics. The results provide insights to help make safe and cost-efficient decisions about operational management of back-up safety systems.

Organisation. In \(\S2\) we provide a review of existing research related to this problem and how our report contributes to the literature. In \(\S3\) we will go over the set-up of a typical inspection problem and define key variables. We will also discuss how renewal theory can be extended to model the inspection problem and derive the expected unavailability function to attempt to evaluate the performance of an inspection and maintenance policy. \(\S4\) will demonstrate, by simulation, the performance of a system under different scenarios and inspection routines. We will compute the optimal inspection frequency given a realistic set of parameters in \(\S5\). These are fully implemented with Python, and code and simulation results are available at my GitHub repository. We will discuss the limitations and conclude the report in \(\S6\).

2 Literature Review

Studies into reliability engineering and maintenance planning date back to Waloddi Weibull (1951), whose Weibull distribution¹ is now widely used to provide accurate failure analysis and failure forecasts with extremely small samples. Caldarola (1977) analysed component availability and described the component behaviour as closed chain renewal processes in Unavailability and Failure Intensity of Components published in Nuclear Engineering and Design.

Vaurio (1996) studied the time-dependent and mean unavailability of periodically tested and aging equipment under various inspection and maintenance policies and developed cost functions for said policies and optimised the test and maintenance interval. The paper also suggests that economic optimisation may result in obtaining a solution that gives rise to an unacceptable accident risk. Therefore, as an extension to the paper, Vaurio introduced the unavailability threshold to formulate the problem into risk-constrained optimisation. The result provides a general formalism for selecting economically optimal test and maintenance intervals, with and without risk constraints.

Cui and Xie (2003) analysed both instantaneous and steady-state availability of periodically inspected systems with random repair time. No particular probabilistic distributional assumptions are made to time between failures and repair time. Cui and Xie primarily used the elementary renewal theory as a foundation to derive the formula for availability of a long-run, periodically inspected system.

More recently, van der Weide and Pandey (2015) estimated the instantaneous (or point) unavailability of standby safety equipment. They modelled the point unavailability subject to a given maintenance policy using stochastic alternating renewal processes, wherein failure time and repair time are modelled by general probability distributions. Based on the key renewal theorem, they derived an asymptotic solution of point unavailability. One of the illustrative examples the paper covers is where the time between failures is exponentially distributed with mean \(17.72\) months, which implies no aging effects on the equipment; inspections are conducted every \(4\) months; and repair time is exponentially distributed with mean \(0.25\) months. Under these conditions, by solving the recursive integral equation they proposed, the point unavailability is calculated as \(0.1269\). It means that if a nuclear facility decides to perform routine inspections on the standby equipment every \(4\) months, the expected proportion of time that the system is unavailable due to undetected failures and repair over the total elapsed time is about \(12.8\) percent. This number provides an anchor point for us to compare analytical and simulation results in later analysis.

The analytical derivations of expectation and distribution of downtime in a renewal process by Cui and Xie (2003) and van der Weide and Pandey (2015) provide useful insights into the instantaneous and long-run behaviour of downtime (and availability) in a complex system subject to interruptions such as failure and maintenance. We contribute to existing literature and to solving the “inspection problem” by optimising inspection frequency with the objective to minimise downtime of a standby safety system, in the context of nuclear facilities, which are typically considered to be a high-stakes situation where extended downtime could lead to serious repercussions.

3 Methodology

3.1 A Recap on the Renewal Theory

Alternating renewal process is briefly discussed in the lecture notes. Consider a system that encounters a failure after time \(F_i\) and then needs to be repaired for time \(R_i\). At this stage, we do not consider inspections; we assume that failures are detected as soon as they occur. Assume that the sequences \(F =\{F_i\}_{i \in \mathbb{N}}\) and \(R= \{R_i\}_{i \in \mathbb{Z}^{+}}\) are independent, and that each sequence is i.i.d. with common distributions given by \(F_F\) and \(F_R\) repectively. If we let \(N(t)\) be the number of repairs completed by time \(t\), then \(N\) is a renewal process, with inter-arrival times given by \(X_i= F_{i-1} + R_{i}\), where the renewal equation is given by \(F_X = F_F \cdot F_R\).

We are interested in the limiting behaviour of

\[ 1 - p(t) = 1 - \mathbb{P}(O_t) = \mathbb{P}(\text{the system is unavailable at time }t) \tag{1}\]

as \(t \to \infty\).

Consider the event \(\{F_0 \leq t\}\), i.e. the first failure occurs before time \(t\). By conditioning on \(X_1\), we have

\[ \begin{eqnarray*} p(t) &=& \mathbb{P}(O_t, F_0 > t) + \mathbb{P}(O_t, F_0 \leq t) \\ &=& 1 - F_F(t) + \mathbb{E}(\mathbb{P}(O_t , F_0 \leq t | X_1)). \end{eqnarray*} \tag{2}\]

Let \(\phi(x) = \mathbb{P}(O_t, F_0 \leq t| X_1=x)\). Then \(\phi(x) = p(t-x) \mathbf{1}[t \geq x]\). Thus

\[ \begin{eqnarray*} p(t) &=& 1- F_F(t) + \mathbb{E}(\phi(X_1)) \\ &=& 1- F_F(t) + \int_0^t p(t-x) dF_X(x). \end{eqnarray*} \tag{3}\]

Thus it gives

\[ p = (1- F_F) + p \cdot F_X \tag{4}\]

and from the uniqueness of renewal equation solutions, we have

\[ p = (1- F_F) + (1-F_F) \cdot \mathbb{E}(N). \tag{5}\]

By the key renewal theorem, we have that as \(t \to \infty\),

\[ 1 - p(t) \to 1 - \frac{1}{\mathbb{E}(X_1)}\int_0^{\infty} [1-F_F(x)]dx = \frac{\mathbb{E}(R_1)}{\mathbb{E}(F_0) + \mathbb{E}(R_1)}. \tag{6}\]

3.2 Problem Setup

As mentioned in \(\S1\), a standby safety system needs to be checked regularly to ensure its functionality. We use a Poisson process to simulate the failure of such a system. Suppose that the device starts running at \(t=0\), and the first failure occurs at \(t=X \sim \text{exponential}(\lambda)\), e.g. \(\lambda=\frac{1}{3600} (\text{days}^{-1})\) means that a failure may occur, on average, once every ten years. Each regular inspection takes the device offline for certain amount of time (e.g. deterministic or stochastic). The time needed to repair a fault found during a regular inspection is \(M \sim \text{exponential}(\gamma)\). After a failure is detected and service is restored, the system starts over anew. Assume that the system runs for a sufficiently long time, we want to find, analytically or otherwise, the optimal inspection frequency such that the system’s downtime can be minimised.

Response variables involved in the “inspection problem” are system uptime, denoted by \(T_{\text{up}}\); and system downtime, denoted by \(T_{\text{down}}\).

Mean time between failures², given by \(\frac{1}{\lambda}\), is an exogenous variable. Exogenous variable is system-specific. For example, a pressure bypass valve can have an expected time between failures of \(150\) days, while an emergency diesel generator on average takes \(500\) days to encounter a failure.

There are two (or three) parameters defining an inspection and maintenance plan, depending on whether or not inspections take time. Firstly, the inspection duration, denoted by \(I\), is the time taken to complete an inspection. \(I\) can be deterministic. For instance, the U.S. Nuclear Regulatory Commission (NRC) mandates the emergency diesel generators in nuclear power plants to undergo regular testing that is required to last for \(24\) hours, with a few exceptions made to approved sites to conduct \(8\)-hour testing (U.S. Nuclear Regulatory Commission 2009). In a more realistic setting, \(I\) is modelled by an exponential distribution with rate denoted by \(\mu\). Secondly, the maintenance (or repair) duration, denoted by \(M\), is the time spent to restore service to a component when a failure is detected. \(M\) is assumed to follow an exponential distribution with rate denoted by \(\gamma\). Finally, inspection frequency \(f\) is the frequency at which inspections are conducted throughout the lifetime of a system.

The evaluation criterion of the performance of an inspection and maintenance policy is the unavailability of a system during its lifetime. Unavailability, denoted by \(O\) in our case, is defined as the proportion of system downtime, \(T_{\text{down}}\), over the total elapsed time in a system. Unvailability should be minimised.

Notation	Description
\(T_{\text{up}}\)	System uptime
\(T_{\text{down}}\)	System downtime
\(O\)	Unavailability
\(f\)	Inspection frequency
\(\lambda\)	Failure rate
\(\mu\)	Inspection rate
\(\gamma\)	Repair rate

3.3 Alternating Renewal Process

3.3.1 Limiting Behaviour of Unavailability

In this part, we want to formalise the idea of “inspection problem” as discussed in \(\S3.2\) into a renewal process and derive the asymptotic expectation of unavailability in terms of inspection frequency \(f\) given a particular set of exogenous parameters.

Let \(T\) be the duration of time that a failure remains undetected in a system. Failures occur in a Poisson process with rate \(\lambda\). Markov property (i.e. memorylessness) of Poisson process suggests that at any given moment, the arrival of next failure is independent of the past. The probability that a failure occurs at time \(t\) in an inspection interval \([0, \frac{1}{f}]\) is uniform over the span of its interval. Hence the residual time, \(\frac{1}{f}-t\), where a failure goes undetected until the next inspection is also uniformly distributed with support \([0, \frac{1}{f}]\), which leads us to Lemma 1.

Lemma 1 Consider a system that has time between failures \(X_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\lambda)\) and some inspection frequency \(f \in (0, \infty)\). The time elapsed during an undetected failure is given by

\[ T \sim \text{U}(0, f^{-1}) \]

Let \(N\) be the number of inspections performed to a system before it encounters a failure. We have

\[ N = \lfloor{X\cdot f}\rfloor \tag{7}\]

and

\[ \mathbb{E}[N] = \mathbb{E}[X] \cdot f = \frac{f}{\lambda}. \tag{8}\]

We can use Monte Carlo approximations to further demonstrate this point, see results in Figure 1.

Lemma 2 Consider a system that has time between failures \(X_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\lambda)\) and some inspection frequency \(f \in (0, \infty)\). The number of inspections before a failure is given by

\[ N \sim \text{Poisson}(\frac{f}{\lambda}) \]

In Lemma 2, we state that the number of inspections that occur before a failure in each renewal cycle is Poisson distributed with rate \(\frac{f}{\lambda}\). Suppose that each inspection takes time \(I \sim \text{Exp}(\mu)\). If we assume \(N\), \(I\) are independent random variables, then by the Law of Total Probability,

\[ \mathbb{E}[N\cdot I]=\mathbb{E}[N]\mathbb{E}[I]=\frac{f}{\lambda\mu}. \tag{9}\]

Code

import math
import numpy as np
import matplotlib.pyplot as plt

n = 1000

tFailure = np.random.exponential(150, n)
N = []
for i in range(0, n):
    N.append(tFailure[i]//30)

def poisson(rate, k):
    pmf = (rate**k)*np.exp(-rate)/math.factorial(k)
    return pmf

rate = 150/30
x = np.linspace(0, int(np.max(N)), int(np.max(N))+1)
y = []
for i in range(0, int(np.max(N))+1):
    y.append(poisson(rate, i))

plt.hist(N, bins=round(np.max(N)-np.min(N)), density=True)
plt.axvline(np.mean(N), color='red', linestyle='--', label=f"simulated mean = {round(np.mean(N),3)}")
plt.plot(x, y, "ro", label=f"poisson mean = {rate}")
plt.xlabel("number of inspections")
plt.ylabel("probability density")
plt.legend()

Figure 1: A histogram of simulated \(N\) in comparison to the probability mass function of a \(\text{Poisson}(5)\).

Suppose also that upon the completion of inspection, if a failure is detected, repair starts immediately and takes time \(M \sim \text{Exp}(\gamma)\). Let \(O\) be the proportion of downtime over the total elapsed time in a system. This is given by

\[ \mathbb{E}[O] = \frac{\mathbb{E}[T_\text{down}]}{\mathbb{E}[T_\text{up}]} \tag{10}\]

From the result in Equation 6, we have

\[ \mathbb{E}[O] = \frac{\mathbb{E}[N \cdot I] + \mathbb{E}[T] + \mathbb{E}[I] + \mathbb{E}[M]}{\mathbb{E}[X] + \mathbb{E}[T] + \mathbb{E}[I] + \mathbb{E}[M]}. \tag{11}\]

Theorem 1 Consider a system that has time between failures \(X_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\lambda)\), inspection duration \(I_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\mu)\), repair time \(M_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\gamma)\), and inspection frequency \(f \in (0, \infty)\). Let \(T\) be the residual time of undetected failure and \(N\) be the number of inspections before a failure. The expectation of system unavailability, \(O\), is given by

\[ \mathbb{E}[O]=\frac{\mathbb{E}[N\cdot I]+\mathbb{E}[T]+\frac{1}{\mu}+\frac{1}{\gamma}}{\mathbb{E}[T]+\frac{1}{\lambda}+\frac{1}{\mu}+\frac{1}{\gamma}} \tag{12}\]

and using Lemma 1 and Lemma 2, we get

\[ \mathbb{E}[O]=\frac{\frac{f}{\lambda\mu}+\frac{1}{2f}+\frac{1}{\mu}+\frac{1}{\gamma}}{\frac{1}{\lambda}+\frac{1}{2f}+\frac{1}{\mu}+\frac{1}{\gamma}}. \tag{13}\]

In attempt to find a general solution, we will find the root(s) of the first derivative of the expectation function, as obtained in Theorem 1, set equal to zero.

Let \(g(f)=\frac{f}{\lambda\mu}+\frac{1}{2f}+\frac{1}{\mu}+\frac{1}{\gamma}\) and \(h(f)=\frac{1}{\lambda}+\frac{1}{2f}+\frac{1}{\mu}+\frac{1}{\gamma}\). We have \(g'(f)=\frac{1}{\lambda\mu}-\frac{1}{2f^2}\) and \(h'(f)=-\frac{1}{2f^2}\). By the quotient rule, it gives that

\[ \frac{d}{df}\mathbb{E}[O]=\left(\frac{g(f)}{h(f)}\right)'= \frac{g'(f)h(f)-g(f)h'(f)}{(h(f))^2}. \tag{14}\]

Let \(\alpha=\frac{1}{\lambda}+\frac{1}{\mu}+\frac{1}{\gamma}\) and \(\beta=\frac{1}{\mu}+\frac{1}{\gamma}\). We can simplify Equation 14 to

\[ \frac{d}{df}\mathbb{E}[O]=\frac{\frac{\alpha}{\lambda\mu}+\frac{1}{\lambda\mu f}-\frac{\alpha+\beta}{2f^2}}{(\frac{1}{2f}+\alpha)^2}. \tag{15}\]

Let \(\frac{d}{df}\mathbb{E}[O]=0\). The Equation 15 can further simplify to

\[ 2\alpha^3 f^4 + 4\alpha^2 f^3 + (\frac{5}{2}\alpha - \lambda^2 \mu \alpha^2)f^2 + (\frac{1}{2} - \lambda^2 \mu\alpha)f - \frac{1}{4}\lambda^2 \mu=0 \tag{16}\]

where \(\beta\) term drops out.

Theorem 2 For a system with \(X_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\lambda)\), \(I_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\mu)\), and \(M_i \overset{\text{i.i.d.}}{\sim} \text{Exp}(\gamma)\), the expected system unavailability \(\mathbb{E}[O]\) is minimised for frequency \(f\) that satisfies the equation:

\[ \xi(f)=2\alpha^3 f^4 + 4\alpha^2 f^3 + (\frac{5}{2}\alpha - \lambda^2 \mu \alpha^2)f^2 + (\frac{1}{2} - \lambda^2 \mu\alpha)f - \frac{1}{4}\lambda^2 \mu=0 \tag{17}\]

where \(\alpha = \frac{1}{\lambda}+\frac{1}{\mu}+\frac{1}{\gamma}\).

Finally, in Theorem 2 we present a quartic equation whose root(s) gives a general expression of optimal \(f\) that minimises \(\mathbb{E}[O]\) subject to parameters \((\lambda, \mu, \gamma)\). Given a set of real-valued parameters, one can use Newton-Ralphson method (that Prof Richard Chandler taught us in STAT0023: Computing for Practical Statistics, Fall 2023) to locate the roots.

3.3.2 An Estimator of Steady-state Unavailability

Suppose the same problem setup and distributional assumptions. We can also estimate the long-run unavailability \(O\) by

\[ \begin{eqnarray*} O &=& \left\{\begin{array}{lr} \frac{(\frac{1}{\lambda}-(\frac{1}{\lambda}\mod{\frac{1}{f}}))\frac{f}{\mu}+\frac{1}{\gamma}}{\frac{1}{\lambda} + \frac{1}{\mu} + \frac{1}{\gamma}}, & 0 \leq \frac{1}{\lambda}\mod{\frac{1}{f}} < \frac{1}{\mu} \\ \frac{(\frac{1}{\lambda}-(\frac{1}{\lambda}\mod{\frac{1}{f}}))\frac{f}{\mu}+\frac{1}{\mu}+\frac{1}{\gamma}}{\frac{1}{\lambda} + \frac{1}{f}-(\frac{1}{\lambda}\mod{\frac{1}{f}}) + \frac{1}{\mu} + \frac{1}{\gamma}}, & \frac{1}{\mu} \leq \frac{1}{\lambda}\mod{\frac{1}{f}} < \frac{1}{f} \end{array}\right. \end{eqnarray*} \tag{18}\]

The conceptual idea of this derivation is certainly valid for the approximation of the expectation via Monte Carlo methods. However, it is incomplete as an expression of statistical expectation because the expectation of unavailability is conditional on the result of \(\lambda\mod \alpha\) while the probabilities assigned to two events are unknown.

3.3.3 Analytical Solution

We suppose that mean time between failures is \(17.72\) months; mean inspection duration is \(8\) hours; and mean maintenance duration is \(7\) days. That is \(\lambda = (17.72 \cdot 30)^{-1}\), \(\mu = 3\), and \(\gamma = 7^{-1}\). We cite the empirical rates and inspection-related data given by relevant research and nuclear regulatory bodies (see pp. 102-103, in van der Weide and Pandey 2015; also pp. 3, in U.S. Nuclear Regulatory Commission 2008). The empirical rates are also used as function parameters for modelling and simulation in \(\S4\) and \(\S5\).

Given these conditions, Equation 13 and Equation 18 are plotted in Figure 2 and Figure 3 respectively.

Code

def exp(f, Lambda, mu, gamma):
    output=(f/(Lambda*mu)+1/(2*f)+1/mu+1/gamma)/(1/Lambda+1/(2*f)+1/mu+1/gamma)
    return output

x = np.linspace(1/3, 365, 365)
y = exp(x**-1, (17.72*30)**-1, 3, 7**-1)
plt.plot(x, y, label='expectation, $\mathbb{E}[O(f^{-1}, (17.72\cdot 30)^{-1}, 0.3^{-1}, 7^{-1})]$')
plt.xlabel('$f^{-1}$')
plt.ylabel('$\mathbb{E}[O]$')
plt.legend(loc='upper right')
plt.show()

Figure 2: A plot of the function of expected unavailability.

Code

def function(I):
    exp=[]
    for i in range(0, len(I)):
        if 0 <= 532%I[i] < 3:
            exp.append(((532/3)//I[i] + 7)/(532+1/3+7))
        elif 3 <= 532%I[i] < I[i]:
            exp.append(((532/3)//I[i] + 1/3 + 7 + (I[i] - 532%I[i]))/(532+1/3+7+(I[i]-532%I[i])))
    return exp

t=np.linspace(1/3, 365, 365)
y=function(t)

plt.plot(t,y, label='unavailability, $O(f^{-1}, (17.72\cdot 30)^{-1}, 0.3^{-1}, 7^{-1})$')
plt.xlabel('$f^{-1}$')
plt.ylabel('$O$')
plt.legend(loc='upper right')
plt.show()

Figure 3: A plot of the function of unavailability based on alternative derivations.

We believe that modulo terms in Equation 18 cause a zig-zag, wave-like pattern as shown in Figure 3. Because of this, the function is not continuously differentiable. We are hence unable to find a stationary solution. We will attempt to model the concerned renewal process in the next section to obtain a result using stochastic simulation.

Equation 17 does give us an optimal inspection frequency. A list of particular solutions subject to different inspection rate \(\mu\) is obtained and presented in Table 1.

In Figure 4, we observe that the solution point \((\)optimised \(f^{-1}\), \(\mu^{-1}\), minimised \(\mathbb{E}[O]\)\()\) = \((x, y, z)\) ascends in 3-dimensional space as exogenous \(\mu^{-1}\) increases. This suggests that the optimal inspection interval tends to increase as the inspection duration increases, resulting in higher expected unavailability in a system (to put it simply, the longer each inspection takes, the fewer inspection one must plan to compensate for the time lost due to extended period of inspection).

Code

# Response: `na`
# Explanatory: `step` (frequency-related), `mu` (duration-related)
# Controlled: `mtbf` (same subject under examination), `gamma` (repair time)

f_val = np.linspace(1, 60, 100)
mu_val = np.linspace(1/3, 1, 100)
xs, ys = np.meshgrid(f_val, mu_val)

zs = np.array([exp(x**-1, (17.72*30)**-1, y**-1, 7**-1) for x, y in zip(xs.flatten(), ys.flatten())])
zs = zs.reshape(xs.shape)

from matplotlib import cm

# set up a figure twice as wide as it is tall
fig = plt.figure(figsize=plt.figaspect(0.3))

# =============
# First subplot 
# =============
# set up the Axes for the first plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
# plot a 3D surface
surf = ax.plot_surface(xs, ys, zs, vmin=zs.min() * 2, cmap='coolwarm',
                       linewidth=0, antialiased=False)
ax.set_zlim(0, 1.0)
fig.colorbar(surf, shrink=0.5, aspect=10)
ax.set_title("(a) 3D Surface Plot")
ax.set_xlabel('$f^{-1}$')
ax.set_ylabel('$\mu^{-1}$')

# ==============
# Second subplot
# ==============
# set up the Axes for the second plot
ax = fig.add_subplot(1, 2, 2)

# plot a contour
levels = np.linspace(zs.min(), zs.max(), 10)
contour = ax.contourf(xs, ys, zs, levels=levels, cmap='coolwarm')
cbar = fig.colorbar(contour, shrink=0.5, aspect=10)
ax.set_title("(b) Contour Map")
ax.set_xlabel('$f^{-1}$')
ax.set_ylabel('$\mu^{-1}$')
cbar.set_label('$\mathbb{E}[O]$')

plt.show()

Figure 4: The effect of inspection rate, \(\mu^{-1}\), on the relationship between \(f^{-1}\) and \(\mathbb{E}[O]\). The left panel shows a 3D surface plot. The right panel shows a contour map.

Code

from IPython.display import Markdown
from tabulate import tabulate
import pandas as pd

f_val = np.linspace(1, 60, 8)
mu_val = np.linspace(0.3, 1.0, 8)
xs, ys = np.meshgrid(f_val, mu_val)

zs = np.array([round(exp(x**-1, (17.72*30)**-1, y**-1, 7**-1), 3) for x, y in zip(xs.flatten(), ys.flatten())])
zs = zs.reshape(xs.shape)

xs_min = []
ys_min = []
zs_min = []

for i in range(0,8):
    zs_min.append(np.min(zs[i]))
    xs_min.append(round(xs[i, np.argmin(zs[i])], 3))
    ys_min.append(ys[i, np.argmin(zs[i])])

df = pd.DataFrame({
    "unavailability": zs_min,
    "f_inv": xs_min,
    "mu_inv": ys_min
})

Markdown(tabulate(
  df, 
  headers=["$\mathbb{E}[O]$","$f^{-1}$ (in days)", "$\mu^{-1}$ (in days)"],
  showindex=False
))

Table 1: Analytical solution subject to varying average inspection duration.

\(\mathbb{E}[O]\)	\(f^{-1}\) (in days)	\(\mu^{-1}\) (in days)
0.046	17.857	0.3
0.052	17.857	0.4
0.056	26.286	0.5
0.06	26.286	0.6
0.063	26.286	0.7
0.067	26.286	0.8
0.07	34.714	0.9
0.073	34.714	1

4 Simulation

Continuing with the same set of definitions of variables given in \(\S3\), the names of variables used in code are outlined as follows.

Variables	Description
`uptime`	System uptime
`downtime`	System downtime
`na`	Unavailability
`freq`	Inspection frequency
`rFailure`	Failure rate (i.e. mean time between failures)
`rInspection`	Inspection rate
`rRepair`	Repair rate
`n`	Number of simulation steps sufficient for stable output, think relaxation time³, \(\tau\)

4.1 Renewal Model without Inspection Time

We start by writing functions that return uptime, downtime and unavailability for a simple alternating renewal process. Suppose that time between failures is exponentially distributed with rate rFailure, inspection takes no time, and maintenance takes exponential time with rate rRepair, we want to understand how system unavailability varies in relation to the inspection frequency freq.

We can simulate a single step of the renewal process by implementing the following in Python.

import numpy as np

# Set up the explanatory component
def checkpoint(freq, n):
    checkpoints = np.arange(freq**-1, n, freq**-1)
    return checkpoints

# Set up the stochastic component
def expFailure(rFailure):
    tFailure = np.random.exponential(rFailure**-1)
    return tFailure

def expRepair(rRepair):
    tRepair = np.random.exponential(rRepair**-1)
    return tRepair

# Function to return a single step of a simple renewal process
def single_renewal(freq, n, rFailure, rRepair):
    """Return system uptime (online) and system downtime (offline) during a single step in a renewal process."""
    checkpoints = checkpoint(freq, n)
    
    tFailure = expFailure(rFailure)
    tRepair = expRepair(rRepair)
    
    online = tFailure
    offline = 0
    
    for i in range(0, len(checkpoints)):
        if tFailure < checkpoints[i]:
            offline += checkpoints[i]-tFailure+tRepair
            break
        elif tFailure > checkpoints[i]:
            offline += 0
            continue
        
    return online, offline

# Inputs:
# freq        inspection frequency (days^-1)
# n           stopping time
# rFailure    rate of failure (days^-1)
# rRepair     rate of maintenance (days^-1)
#
# Outputs:
# online      a float, system uptime
# offline     a float, system downtime

For a demonstration of the timeline of critical events occurred in a single renewal cycle, see Figure 5. This example is based on the output of an instance of single_renewal(freq=30**-1, n=10*365, rFailure=(17.72*30)**-1, rRepair=7**-1).

Code

# Set up the explanatory component
def checkpoint(freq, n):
    checkpoints = np.arange(freq**-1, n, freq**-1)
    return checkpoints

# Set up the stochastic component
def expFailure(rFailure):
    tFailure = np.random.exponential(rFailure**-1)
    return tFailure

def expRepair(rRepair):
    tRepair = np.random.exponential(rRepair**-1)
    return tRepair

# Function to return a single step of a simple renewal process
def single_renewal(freq, n, rFailure, rRepair):
    
    checkpoints = checkpoint(freq, n)
    
    tFailure = expFailure(rFailure)
    tRepair = expRepair(rRepair)
    
    online = tFailure
    offline = 0
    
    for i in range(0, len(checkpoints)):
        if tFailure < checkpoints[i]:
            offline += checkpoints[i]-tFailure+tRepair
            break
        elif tFailure > checkpoints[i]:
            offline += 0
            continue
        
    return online, offline

interval = 30

# Generate online and offline values.
online, offline = single_renewal(freq=interval**-1, n=10*365, rFailure=(17.72*30)**-1, rRepair=7**-1)

labels = [f"{round(online//interval-1)}th inspection", f"{round(online//interval)}th inspection", f"{round(online//interval+1)}th inspection", "Failure", "Repair completed"]
points = [(online//interval-1)*interval, (online//interval)*interval, (online//interval+1)*interval, online, online+offline]
points, labels = zip(*sorted(zip(points, labels)))  # Sort by increasing order.
levels = [1,1,-1,-1,1]

def is_failure(label):
    return label == 'Failure'

# The figure and the axes.
fig, ax = plt.subplots(figsize=(8.8, 2), layout="constrained")

# The vertical stems.
ax.vlines(points, 0, levels,
          color="tab:red")

# The baseline.
ax.axhline(0, c="black")

# The markers on the baseline.
meso_dates = [point for point, label in zip(points, labels) if is_failure(label)]
micro_dates = [point for point, label in zip(points, labels) if not is_failure(label)]
ax.plot(micro_dates, np.zeros_like(micro_dates), "ko", mfc="white")
ax.plot(meso_dates, np.zeros_like(meso_dates), "ko", mfc="tab:red")

# Annotate the lines.
for point, level, label in zip(points, levels, labels):
    ax.annotate(label, xy=(point, level),
                xytext=(-3, np.sign(level)*3), textcoords="offset points",
                verticalalignment="bottom" if level > 0 else "top",
                bbox=dict(boxstyle='square', pad=0, lw=0, fc=(1, 1, 1, 0.7)))

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.spines[["left", "top", "right", "bottom"]].set_visible(False)

ax.margins(y=0.1)
plt.show()

Figure 5: A timeline showing the occurrence of failure, regular inspections and maintenance in a single step of a renewal process.

Building up from the single_renewal function, we model an alternating renewal process that can run for n units of time. The function is shown below.

import numpy as np

# Set up the explanatory component
def checkpoint(freq, n):
    checkpoints = np.arange(freq**-1, n, freq**-1)
    return checkpoints

# Set up the stochastic component
def expFailure(rFailure):
    tFailure = np.random.exponential(rFailure**-1)
    return tFailure

def expRepair(rRepair):
    tRepair = np.random.exponential(rRepair**-1)
    return tRepair

# Function to implement a simple renewal process
def renewal(freq, n, rFailure, rRepair):
    """Return system uptimes (online) and system downtimes (offline) in a renewal process."""
    checkpoints = checkpoint(freq, n)
    
    online = []
    offline = []
    
    runtime = 0
    repair_old = 0
    
    while runtime < checkpoints[-1]:
        
        tFailure = expFailure(rFailure)
        tRepair = expRepair(rRepair)
        online.append(tFailure)
        
        for i in range(0, len(checkpoints)):
            if tFailure < checkpoints[i] - repair_old:
                offline.append(checkpoints[i] - repair_old - tFailure + tRepair)
                break
            elif tFailure > checkpoints[i] - repair_old:
                offline.append(0)
                continue
                
        repair_old = tRepair
        runtime = sum(online + offline)
        
    return online, offline

# Inputs:
# freq        inspection frequency (days^-1)
# n           stopping time
# rFailure    rate of failure (days^-1)
# rRepair     rate of maintenance (days^-1)
#
# Outputs:
# online      an array, consisting of floats of system uptimes
# offline     an array, consisting of floats of system downtimes

Running renewal(freq=30**-1, n=30*365, rFailure=(17.72*30)**-1, rRepair=7**-1) simulates us an alternating renewal process running for approximately \(30\) years. The resulting timeline of one instance of such simulation is shown in Figure 6.

Code

# Set up the explanatory component
def checkpoint(freq, n):
    checkpoints = np.arange(freq**-1, n, freq**-1)
    return checkpoints

# Set up the stochastic component
def expFailure(rFailure):
    tFailure = np.random.exponential(rFailure**-1)
    return tFailure

def expRepair(rRepair):
    tRepair = np.random.exponential(rRepair**-1)
    return tRepair

# Function to implement a simple renewal process
def renewal(freq, n, rFailure, rRepair):
    
    checkpoints = checkpoint(freq, n)
    
    online = []
    offline = []
    
    runtime = 0
    repair_old = 0
    
    while runtime < checkpoints[-1]:
        
        tFailure = expFailure(rFailure)
        tRepair = expRepair(rRepair)
        online.append(tFailure)
        
        for i in range(0, len(checkpoints)):
            if tFailure < checkpoints[i] - repair_old:
                offline.append(checkpoints[i] - repair_old - tFailure + tRepair)
                break
            elif tFailure > checkpoints[i] - repair_old:
                offline.append(0)
                continue
                
        repair_old = tRepair
        runtime = sum(online + offline)
        
    return online, offline

online, offline = renewal(freq=30**-1, n=30*365, rFailure=(17.72*30)**-1, rRepair=7**-1)

points = []
t=0
for i in range(0, len(online)):
    t+=online[i]
    points.append(t)
    t+=offline[i]
    points.append(t)

labels = []
for i in range(0, len(online)+len(offline)):
    if i%2==0:
        labels.append("Down")
    else:
        labels.append("Up")

points, labels = zip(*sorted(zip(points, labels)))  # Sort by increasing order.

levels = []
l=0
for i in range(0, len(points)):
    if i%2==0:
        if l==0:
            levels.append(-1)
        elif l==1:
            levels.append(-0.5)

    else:
        if l==0:
            levels.append(1)
            l=1
        elif l==1:
            levels.append(0.5)
            l=0

def is_failure(label):
    return label == "Down"

# The figure and the axes.
fig, ax = plt.subplots(figsize=(8.8, 2), layout="constrained")

# The vertical stems.
ax.vlines(points, 0, levels,
          color="tab:red")

# The baseline.
ax.axhline(0, c="black")

# The markers on the baseline.
meso_dates = [point for point, label in zip(points, labels) if is_failure(label)]
micro_dates = [point for point, label in zip(points, labels) if not is_failure(label)]
ax.plot(micro_dates, np.zeros_like(micro_dates), "ko", mfc="white")
ax.plot(meso_dates, np.zeros_like(meso_dates), "ko", mfc="tab:red")

# Annotate the lines.
for point, level, label in zip(points, levels, labels):
    ax.annotate(label, xy=(point, level),
                xytext=(-3, np.sign(level)*3), textcoords="offset points",
                verticalalignment="bottom" if level > 0 else "top",
                bbox=dict(boxstyle='square', pad=0, lw=0, fc=(1, 1, 1, 0.7)))

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.spines[["left", "top", "right", "bottom"]].set_visible(False)

ax.margins(y=0.1)
plt.show()

Figure 6: A timeline showing the occurrence of failures and maintenance in a 30-year simulation of the renewal process.

Figure 7 reveals a positive correlation between inspection interval (\(f^{-1}\), in days) and unavailability, suggesting downtime of equipment reduces as we increase the frequency of inspections. Under this scenario, it is clear that the best strategy to reduce system downtime is to increase the inspection frequency. As \(f^{-1} \to 0\), failures are detected nearly instantaneously, thereby reducing the downtime due to unattended faults. However, more frequent inspections mean higher costs of operation. The management will need to find a balance between the cost induced by inspections and the unavailability of safety installations.

Code

def renewal(freq, n, rFailure, rRepair):
    
    checkpoints = checkpoint(freq, n)
    
    online = []
    offline = []
    
    runtime = 0
    repair_old = 0
    
    while runtime < checkpoints[-1]:
        
        tFailure = expFailure(rFailure)
        tRepair = expRepair(rRepair)
        online.append(tFailure)
        
        for i in range(0, len(checkpoints)):
            if tFailure < checkpoints[i] - repair_old:
                offline.append(checkpoints[i] - repair_old - tFailure + tRepair)
                break
            elif tFailure > checkpoints[i] - repair_old:
                offline.append(0)
                continue
                
        repair_old = tRepair
        runtime = sum(online + offline)
        
    na = sum(offline)/sum(online + offline)
    return na

intervals = np.linspace(1/3, 365, 20)
results = {i: [renewal(i**-1, n=30*365, rFailure=(17.72*30)**-1, rRepair=7**-1) for _ in range(10)] for i in intervals}

i_vals = []
na_vals = []
for i, values in results.items():
    i_vals.extend([i] * 10)
    na_vals.extend(values)

plt.scatter(i_vals, na_vals, alpha=0.6, label = "renewal($f^{-1}$, $30 \cdot 365$, $(17.72 \cdot 30)^{-1}$, $7^{-1}$)")
 
plt.xlabel("$f^{-1}$")
plt.ylabel("unavailability")

plt.legend(loc='upper right')
plt.show()

Figure 7: A scatterplot showing the relationship between inspection interval (i.e. the inverse of frequency) and system unavailability.

4.2 Renewal Model with Inspection Time

Below is an excerpt from Fukushima Daiichi Nuclear Power Plant in Japan (Tokyo Electric Power Company 2009). It shows that occasionally the inspection carries a heavy cost, both in terms of money and time, to the plant owner, risking a potential shutdown of the nuclear facility. Hence it is important to consider the inspection duration into our model.

A manual (reactor) shutdown was initiated during a start-up operation on February 25, 2009. An alarm indicating high reactor pressure, among other warnings, was triggered. It was confirmed that the alarm was caused by the shutting of a turbine bypass valve. The reactor was at 12% of full power when the alarm occurred at 4:03 AM (local time) due to a pressure increase to 7.1 MPa, exceeding the regulatory limit of 6.91 MPa. An inspection then confirmed that one of the 8 bypass valves had closed and that the valve had a bad driving fluid connection. To investigate the cause, at 8:49 AM, all control rods were inserted, and the reactor was manually shut down. The reactor had been starting up following its 25th regular inspection, which had begun on October 18, 2008.

In this part we will consider the case where each inspection takes certain amount of time. Time taken for each inspection can be deterministic, or approximated by a uniform distribution, exponential distribution or other viable probility distributions. We will assume that the duration of inspection follows exponential distribution with rate rInspection, along with other assumptions already made in \(\S4.1\), i.e. exponential time between failures and exponential maintenance time. We make the following adaptations to the renewal function to simulate processes with exponential inspection time.

import numpy as np

# Set up the explanatory component
def checkpoint(freq, n):
    checkpoints = np.arange(freq**-1, n, freq**-1)
    return checkpoints

# Set up the stochastic component
def expFailure(rFailure):
    tFailure = np.random.exponential(rFailure**-1)
    return tFailure

def expInspection(rInspection):
    tInspection = np.random.exponential(rInspection**-1)
    return tInspection

def expRepair(rRepair):
    tRepair = np.random.exponential(rRepair**-1)
    return tRepair

# Function for a renewal process
def inspectionProcess(freq, n, rFailure, rInspection, rRepair):
    
    checkpoints = checkpoint(freq, n)
    
    # Set up the response component
    uptime = []
    downtime = []
    
    # Initialise key logging timepoints
    tRenewal = 0
    tResidual = 0 # spillover time at inspection point
    
    while tRenewal < checkpoints[-1]:
        
        tFailure = expFailure(rFailure)
        
        for i in range(0, len(checkpoints)):
            
            if tRenewal + tFailure > checkpoints[i]:
                # No failure occurs **before the checkpoint**, Inspection conducted.
                tInspection = expInspection(rInspection)
                
                if tRenewal + tFailure > checkpoints[i] + tInspection:
                    # No failure occurs **during the inspection**, Loop continues
                    uptime.append(freq**-1 - tResidual)
                    downtime.append(tInspection)
                    tResidual = tInspection
                    continue
                elif tRenewal + tFailure < checkpoints[i] + tInspection:
                    # Failure occurs **during the inspection**, Repair performed, Loop breaks.
                    tRepair = expRepair(rRepair)
                    tLatent = 0 # No latent time
                    uptime.append(freq**-1 - tResidual - tLatent)
                    downtime.append(tLatent + tInspection + tRepair)
                    tResidual = tInspection + tRepair
                    tRenewal = tRenewal + tFailure + tLatent + tInspection + tRepair
                    break
        
            elif tRenewal + tFailure < checkpoints[i]:
                # Failure occurs, Inspection conducted, Repair performed, Loop breaks.
                tInspection = expInspection(rInspection)
                tRepair = expRepair(rRepair)
                tLatent = checkpoints[i] - (tRenewal + tFailure)
                uptime.append(freq**-1 - tResidual - tLatent)
                downtime.append(tLatent + tInspection + tRepair)
                tResidual = tInspection + tRepair
                tRenewal = tRenewal + tFailure + tLatent + tInspection + tRepair
                break
                
    return uptime, downtime

# Inputs:
# freq        inspection frequency (days^-1)
# n           stopping time, max.iter
# rFailure    rate of failure (days^-1)
# rInspection rate of inspection (days^-1)
# rRepair     rate of repair (days^-1)
#
# Outputs:
# uptime      an array of uptimes
# downtime    an array of downtimes

Running inspectionProcess(freq, n=1000, rFailure=(17.72*30)**-1, rInspection=3, rRepair=7**-1) simulates us an alternating renewal process that runs for approximately \(1000\) days. In Figure 8 we observe a positive correlation and heteroskedasticity between \(f^{-1}\) and unavailability. We take note of the function’s asymptotic behaviour as \(f^{-1} \to 0\) (similar to the vertical asymptote of a reciprocal function \(y = x^{-1}\)) and this is possibly due to the addition of inspection duration.

Code

# Set up the explanatory component
def checkpoint(freq, n):
    checkpoints = np.arange(freq**-1, n, freq**-1)
    return checkpoints

# Set up the stochastic component
def expFailure(rFailure):
    tFailure = np.random.exponential(rFailure**-1)
    return tFailure

def expInspection(rInspection):
    tInspection = np.random.exponential(rInspection**-1)
    return tInspection

def expRepair(rRepair):
    tRepair = np.random.exponential(rRepair**-1)
    return tRepair

# Function for a renewal process
def inspectionProcess(freq, n, rFailure, rInspection, rRepair):
    
    checkpoints = checkpoint(freq, n)
    
    # Set up the response component
    uptime = []
    downtime = []
    
    # Initialise key logging timepoints
    tRenewal = 0
    tResidual = 0 # spillover time at inspection point
    
    while tRenewal < checkpoints[-1]:
        
        tFailure = expFailure(rFailure)
        
        for i in range(0, len(checkpoints)):
            
            if tRenewal + tFailure > checkpoints[i]:
                # No failure occurs **before the checkpoint**, Inspection conducted.
                tInspection = expInspection(rInspection)
                
                if tRenewal + tFailure > checkpoints[i] + tInspection:
                    # No failure occurs **during the inspection**, Loop continues
                    uptime.append(freq**-1 - tResidual)
                    downtime.append(tInspection)
                    tResidual = tInspection
                    continue
                elif tRenewal + tFailure < checkpoints[i] + tInspection:
                    # Failure occurs **during the inspection**, Repair performed, Loop breaks.
                    tRepair = expRepair(rRepair)
                    tLatent = 0 # No latent time
                    uptime.append(freq**-1 - tResidual - tLatent)
                    downtime.append(tLatent + tInspection + tRepair)
                    tResidual = tInspection + tRepair
                    tRenewal = tRenewal + tFailure + tLatent + tInspection + tRepair
                    break
        
            elif tRenewal + tFailure < checkpoints[i]:
                # Failure occurs, Inspection conducted, Repair performed, Loop breaks.
                tInspection = expInspection(rInspection)
                tRepair = expRepair(rRepair)
                tLatent = checkpoints[i] - (tRenewal + tFailure)
                uptime.append(freq**-1 - tResidual - tLatent)
                downtime.append(tLatent + tInspection + tRepair)
                tResidual = tInspection + tRepair
                tRenewal = tRenewal + tFailure + tLatent + tInspection + tRepair
                break
                
    na = sum(downtime)/sum(uptime+downtime)
    return na

steps = np.linspace(1/3, 365, 20)
results = {s: [inspectionProcess(s**-1, n=1000, rFailure=(17.72*30)**-1, rInspection=3, rRepair=7**-1) for _ in range(10)] for s in steps}

s_val = []
na_val = []
for s, values in results.items():
    s_val.extend([s] * 10)
    na_val.extend(values)

plt.scatter(s_val, na_val, alpha=0.6, label = "inspectionProcess($f^{-1}$, $1000$, $(17.72 \cdot 30)^{-1}$, $0.3^{-1}$, $7^{-1}$)")
plt.xlabel("$f^{-1}$")
plt.ylabel("unavailability")
plt.legend(loc='upper right')
plt.show()

Figure 8: A scatterplot. Simulation runs for 1000 days.

As we zoom in on the horizontal axis, the simulated unavailability curve is flattened (mostly) for \(25 < f^{-1} < 75\). Figure 9 suggests that system downtime may be indifferent to more frequent inspection plans for some range of \(f\). We will test this hypothesis and examine the results further in the following section.

Code

small_steps = np.linspace(1/3, 180, 20)
results = {s: [inspectionProcess(s**-1, n=1000, rFailure=(17.72*30)**-1, rInspection=3, rRepair=7**-1) for _ in range(10)] for s in small_steps}

s_val = []
na_val = []
for s, values in results.items():
    s_val.extend([s] * 10)
    na_val.extend(values)

plt.scatter(s_val, na_val, alpha=0.6, label = "inspectionProcess($f^{-1}$, $1000$, $(17.72 \cdot 30)^{-1}$, $0.3^{-1}$, $7^{-1}$)")
plt.xlabel("$f^{-1}$")
plt.ylabel("unavailability")
plt.legend(loc='upper right')
plt.show()

Figure 9: A zoomed-in scatterplot. Simulation runs for 1000 days.

In Figure 10, a local minimum is obtained at \(f^{-1}\approx 38.158\) days, with corresponding unavailability of \(0.019\).

Code

mean_nas = []
for i in range(0, len(small_steps)):
    mean_nas.append(np.mean(na_val[i*10: (i+1)*10]))

plt.scatter(small_steps, mean_nas, alpha=0.6, label = "inspectionProcess($f^{-1}$, $1000$, $(17.72 \cdot 30)^{-1}$, $0.3^{-1}$, $7^{-1}$)")
plt.plot(small_steps[np.argmin(mean_nas)], np.min(mean_nas), 
         'bo',
         label = f"minimum: {round(small_steps[np.argmin(mean_nas)], 3), round(np.min(mean_nas), 3)}")
plt.xlabel("$f^{-1}$")
plt.ylabel("unavailability")
plt.legend(loc='upper right')
plt.show()

Figure 10: Asymptotic behaviour and particular solution.

5 Results

Figure 11 demonstrates the function derived in Equation 18 matches the mean curve obtained from simulation.

Code

t=np.linspace(1/3, 180,num=100)
y=function(t)
plt.plot(t,y,'r--',label='expectation')

plt.scatter(small_steps, mean_nas, alpha=0.6, label = "simulated average")
plt.plot(small_steps[np.argmin(mean_nas)], np.min(mean_nas),
         'bo',
         label = f"minimum: {round(small_steps[np.argmin(mean_nas)], 3), round(np.min(mean_nas), 3)}")
plt.xlabel("$f^{-1}$")
plt.ylabel("unavailability")
plt.legend(loc='upper right')
plt.show()

Figure 11: Asymptotic behaviour and particular solution cont’d.

Apparently, there is a positive correlation between unavailbility and inspection interval, for most cases. This relationship is, however, unclear as we zoom in to a small scale, say \(25 < f^{-1} < 50\).

We know that theorectically, unavailability is minimised at \(f^{-1} = 17.857\) days (see Table 1). A \(1000\)-day simulation of the renewal process obtains the optimal \(f^{-1}\) at \(38.158\) days. We would like to investigate if there is any significant difference in unavailability obtained at frequencies of \(17.857\) days per inspection and \(38.158\) days per inspection.

Hence we perform a hypothesis test for difference in means to determine if the average values of unavailability at \(f^{-1}=17.857\) and \(f^{-1}=38.158\) are significantly different. Because the data is heteroscedastic according to Figure 12, we cannot use a Student’s \(t\)-test, as it assumes equal variances in both groups. Welch’s \(t\)-test is used instead. The test statistic is based on the ratio of the difference between the means and the standard deviation of the groups.

Code

var=[]
for i in range(0, len(small_steps)):
    var.append(np.var(na_val[i*10: i*10+10]))

plt.scatter(small_steps, var, marker='s', color='red')
plt.xlabel("$f^{-1}$")
plt.ylabel("variance of unavailability")
plt.show()

Figure 12: A plot of sample variances in the range \(0<f^{-1}<180\).

Code

from scipy import stats

# Sample data
sample1 = na_val[20: 30]
sample2 = na_val[40: 50]

# Perform Welch's t-test
t_stat, p_value = stats.ttest_ind(sample1, sample2, equal_var=False)

print(f"Comparison of means at {round(small_steps[2], 3)} and {round(small_steps[4], 3)} | T-statistic: {t_stat}, p-value: {p_value}")

Comparison of means at 19.246 and 38.158 | T-statistic: 0.8941750456268369, p-value: 0.3855595583690893

Code

# Sample data
sample1 = na_val[40: 50]
sample2 = na_val[80: 90]

# Perform Welch's t-test
t_stat, p_value = stats.ttest_ind(sample1, sample2, equal_var=False)

print(f"Comparison of means at {round(small_steps[4], 3)} and {round(small_steps[8], 3)} | T-statistic: {t_stat}, p-value: {p_value}")

Comparison of means at 38.158 and 75.982 | T-statistic: -2.021441053831314, p-value: 0.061903541513097265

A p-value of \(0.3856\) suggests that there is no sufficient evidence to reject the null hypothesis that the two samples of unavailability at \(f^{-1}=19.246\) and \(f^{-1}=38.158\) have equal means. In fact, at a significance level of \(0.05\), we have no evidence to conclude that a system subject to \(38.158\) days per inspection and \(75.982\) days per inspection results in different (if not better) availability. This suggests, in terms of the cost efficiency of optimal inspection and maintenance solution, both analytical solution and simulated solution via Monte Carlo method may give sub-optimal results.

As discovered in \(\S3.3\), the optimal inspection frequency decreases as each inspection takes a longer span of time. This consequently prolongs system downtime and gives rise to higher rate of unavailability. In Figure 13 we come to the same conclusion about the effect of inspection duration on the optimal choice of inspection frequency and its resulting unavailability.

A list of simulated solutions subject to various inspection rate \(\mu\) is also shown in Table 2.

Code

# Response: `na`
# Explanatory: `step` (frequency-related), `mu` (duration-related)
# Controlled: `mtbf` (same subject under examination), `gamma` (repair time), `n` (same system running time)

step_val = np.linspace(1, 60, 100)
duration_val = np.linspace(0.3, 1, 8)

xs, ys = np.meshgrid(step_val, duration_val)

zs = np.array([inspectionProcess(x**-1, 1000, (17.72*30)**-1, y**-1, 7**-1) for x, y in zip(xs.flatten(), ys.flatten())])
zs = zs.reshape(xs.shape)

# set up a figure twice as wide as it is tall
fig = plt.figure(figsize=plt.figaspect(0.3))

# =============
# First subplot
# =============
# set up the Axes for the first plot
ax = fig.add_subplot(1, 2, 1, projection='3d')

# plot a 3D surface
surf = ax.plot_surface(xs, ys, zs, vmin=zs.min() * 2, cmap='coolwarm',
                       linewidth=0, antialiased=False)
ax.set_zlim(0, 1.0)
fig.colorbar(surf, shrink=0.5, aspect=10)
ax.set_title("(a) 3D Surface Plot")
ax.set_xlabel('$f^{-1}$')
ax.set_ylabel('$\mu^{-1}$')

# ==============
# Second subplot
# ==============
# set up the Axes for the second plot
ax = fig.add_subplot(1, 2, 2)

# plot a contour
levels = np.linspace(zs.min(), zs.max(), 10)
contour = ax.contourf(xs, ys, zs, levels=levels, cmap='coolwarm')
cbar = fig.colorbar(contour, shrink=0.5, aspect=10)
ax.set_title("(b) Contour Map")
ax.set_xlabel('$f^{-1}$')
ax.set_ylabel('$\mu^{-1}$')
cbar.set_label('unavailability')

plt.show()

Figure 13: The effect of inspection rate \(\mu\) on the relationship between inspection frequency and unavailability. The left panel shows a 3D surface plot. The right panel shows a contour map. Simulation results are obtained by running `inspectionProcess(x**-1, 1000, (17.72*30)**-1, y**-1, 7**-1)`.

Code

xs_min = []
ys_min = []
zs_min = []

for i in range(0,8):
    zs_min.append(round(np.min(zs[i]),3))
    xs_min.append(xs[i, np.argmin(zs[i])])
    ys_min.append(ys[i, np.argmin(zs[i])])

df = pd.DataFrame({
    "unavailability": zs_min,
    "f_inv": xs_min[::-1],
    "mu_inv": ys_min
})

Markdown(tabulate(
  df, 
  headers=["Unavailability","$f^{-1}$ (in days)", "$\mu^{-1}$ (in days)"],
  showindex=False
))

Table 2: Simulated solution via Monte Carlo method subject to varying average inspection duration.

Unavailability	\(f^{-1}\) (in days)	\(\mu^{-1}\) (in days)
0.009	59.404	0.3
0.011	50.4646	0.4
0.012	46.8889	0.5
0.013	46.2929	0.6
0.017	57.0202	0.7
0.019	59.404	0.8
0.021	58.2121	0.9
0.022	40.9293	1

6 Conclusions

The standby safety systems play a crucial role in ensuring the continuity of operation in a nuclear power plant, and minimising system downtime is essential for the safety and reliability of nuclear energy. Periodic inspections are conducted to detect and address faults or failures in the system as early as possible. We consider a scenario in which inspections take a random, non-negligible duration of time, and by modelling system uptime and downtime with alternating renewal processes, we find an optimal solution of inspection frequency that minimises system unavailability.

The key finding is that, under the conditions given by empirical parameters provided by van der Weide and Pandey (2015) and the U.S. NRC (2008), the optimal inspection interval is approximately \(38.16\) days. The result differs slighly from the one obtained through derivations of expected unavailability and its corresponding graphic solution for optimal inspection frequency. A secondary finding is that a range of inspection frequencies may result in a similar level of unavailability, in which case it is generally preferable to choose the less frequent inspection plan, given cost consideration and other practical constraints (if any).

We would also like to acknowledge some limitations surrounding the proposed model. Aging is an important aspect in a realistic modelling of mechanical units (and components) in engineering. The degradation of equipment from its ongoing usage, even if the most careful maintenance is carried out, is inevitable and is likely to prolong the duration of inspections and repair over time. Aging also has implications on the distribution of time between failures and its scale parameter, MTBF. To take the “wear and tear” factor into consideration, one possible approach would be to make the probability distribution of failure time time-dependent.

Secondly, if more relevant on-site data become available, our model can incorporate the element of costs — both direct costs associated with inspections and maintenance and those indirectly caused by production interruptions due to faulty safety components — as a cost function. This would allow the problem to be formulated as a constrained optimisation problem. As we have pointed out in Figure 4 and Figure 13, ceteris paribus, there is no statistically significant variations in system unavailability subject to \(38.158 < f^{-1} < 75.982\). Considering the element of operational costs, it is more preferable to pace the inspections every \(76\) days since it is undoubtedly a cheaper option. Hence in practice, the ideal inspection frequency is pushed further to the right side of the optimal solution obtained by long-run analytical solution, as well as our model.

Another important assumption in our model is the distribution of failure and maintenance times. We assumed that the time between failures, inspection durations, and repair time follow exponential distributions with respective parameters cited from existing literature as well as requirements set out by nuclear regulatory agencies. This assumption was made partly for ease of modelling and simulation. Further study is needed to determine whether alternative probabilistic distributions better describe these variables.

Let’s hope that someone like Homer Simpson will never be put in charge of inspection at a nuclear power plant in real life. Until now, we have assumed that inspections detect failures with \(100\) percent accuracy and that human error or negligence never occurs. But, in reality, they do. Human error is often the overlooked factor in scientifically and mathematically rigorous models, yet it remains a critical aspect of system reliability.

In summary, our proposed method for evaluating inspection and maintenance policies, along with the optimisation strategy to minimise system downtime, provides guidance to safety personnel at nuclear power plants and nuclear energy policymakers. We provide evidence to support informed decision-making on the optimal inspection frequency for standby safety components to ensure reliability and efficiency of power production. Additionally, further studies — such as time-dependent failure model or integrating on-site failure data — can contribute to the continuous improvement of nuclear safety standards.

References

Caldarola, L. 1977. “Unavailability and Failure Intensity of Components.” Nuclear Engineering and Design. https://doi.org/10.1016/0029-5493(77)90131-5.

Cui, Lirong, and Min Xie. 2003. “Availability of a Periodically Inspected System with Random Repair or Replacement Times.” Journal of Statistical Planning and Inference. https://doi.org/10.1016/j.jspi.2003.12.008.

International Atomic Energy Agency. 2006. Fundamental Safety Principles. IAEA, Vienna. https://doi.org/10.61092/iaea.hmxn-vw0a.

Lienig, Jens, and Hans Bruemmer. 2017. Fundamentals of Electronic Systems Design. Springer. https://doi.org/10.1007/978-3-319-55840-0.

Tokyo Electric Power Company. 2009. “Press Releases 2009-02-25.” TEPCO, Tokyo. https://www.tepco.co.jp/cc/press/09022501-j.html.

U.S. Nuclear Regulatory Commission. 2008. “Temporary Instruction 2515/176.” U.S. Nuclear Regulatory Commission, Washington, DC. https://www.nrc.gov/docs/ML0804/ML080420064.pdf.

———. 2009. “Emergency Diesel Generator Technical Specification Surveillance Requirements Regarding Endurance and Margin Testing: Summary Report.” U.S. Nuclear Regulatory Commission, Washington, DC. https://www.nrc.gov/docs/ML0933/ML093370252.pdf.

van der Weide, J. A. M., and Mahesh D. Pandey. 2015. “A Stochastic Alternating Renewal Process Model for Unavailability Analysis of Standby Safety Equipment.” Reliability Engineering and System Safety. https://doi.org/10.1016/j.ress.2015.03.005.

Vaurio, J. K. 1996. “On Time-Dependent Availability and Maintenance Optimization of Standby Units Under Various Maintenance Policies.” Reliability Engineering and System Safety. https://doi.org/10.1016/S0951-8320(96)00132-9.

Weibull, Waloddi. 1951. “A Statistical Distribution Function of Wide Applicability.” Journal of Applied Mechanics.

Footnotes

Waloddi Weibull invented the Weibull distribution in 1937 and delivered his hallmark American paper on this subject in 1951. Today, Weibull analysis is the leading method in the world for fitting life data [Abernethy, 2002].↩︎
This is referred to as the mean time between failures (MTBF) in the context of engineering and reliability analysis (see chap. 4, pp. 45-73, in Lienig and Bruemmer 2017).↩︎
See convergence theorem and MCMC [Sylvester, 2019].↩︎