# Introduction to Bayesian Networks and Predictive Maintenance — Part 1

Mohamed Benzerga (Data Scientist, PhD)

*A Bayesian Network is a Machine Learning model which captures dependencies between random variables as a Directed Acyclic Graph (DAG). It’s an explainable model which has many applications, especially in predictive maintenance.*

# Predictive Maintenance at Expleo

Expleo is a consulting and engineering company where data science experts are involved in many business areas (railway, automotive, health, aeronautics, finance…). In particular, a classical problem addressed at Expleo is Predictive Maintenance of various industrial machines or systems. It consists in using data science algorithms to predict failures before they happen, so as to trigger the right maintenance action at the right time. As illustrated below, predictive maintenance can be split in at least 3 main topics:

- Remaining Useful Life (RUL) estimation
- Failure detection, for instance using a health status indicator
- Diagnostic assistance, to determine the type and/or the origin of a detected failure.

This serie of 3 posts focuses on Bayesian networks, which can be an efficient tool for failure diagnostic.

# Some Theory about Bayesian Networks

## What is a Bayesian Network?

First, recall that a graph is **directed **if its edges are… directed, i.e. oriented from a vertex to the other one. Such a graph is **acyclic **if it doesn’t have any cycle, i.e. if you can’t find a path from a vertex to itself by following some edges.

Now, aBayesian Networkis a directed acyclic graph and:

- itsvertices (or nodes)are random variables

- each of itsarrowscorresponds to a conditional dependency relation: an arrow B → A indicates that A depends on B

- moreover, we attach to each node A theconditional probability distribution of the corresponding random variable A given its parents(i.e. given the nodes B for which there is an arrow B → A).

Let’s look at an example. The graph below is a very classical example (called the “Asia Bayesian network”) for a medical diagnosis purpose:

This Bayesian Network contains 8 nodes, corresponding to binary random variables which can be observed or diagnosed by a doctor on a patient:

- 4 variables can be observed: did the patient go to Asia? Is the patient a smoker? Does he suffer from dyspnoea? Does the X-ray (chest radiography) give a positive result?
- 3 other variables correspond to a diagnosis to be chosen by the doctor: does the patient suffer from tuberculosis? Lung cancer? Bronchitis?
- 1 intermediate variable “Either Tuberculosis or Lung cancer” which has deterministic relationship with the 2 variables “tuberculosis” and “lung cancer”.

To complete this Bayesian Network, we must add the conditional probability distribution of each node. Since all nodes correspond to *discrete *random variables (which isn’t necessary the case for a general bayesian network), we have to choose **Conditional Probability Tables (CPT) **for each node.

For instance, the first row of the CPT of dyspnoea tells us that:

ℙ(Dyspnoea = Yes | Tub or Lung = Yes, Bronchitis = Yes) = 0.9

This probability, like any other in the network above, may be either chosen by an expert of pulmonary diseases or learned by an algorithm, as we will see below.

**Remark: **if a node has a continuous conditional distribution (e.g. a Gaussian distribution) given its parents, then the parameters of this distribution (e.g. mean and variance in the Gaussian case) depend on the values of the parents.

Another example of a Bayesian Network is the **Naive Bayes classifier**. Indeed, if X1, … Xn are the observed variables and C is the class variable, a Naive Bayes classifier assumes that all the variables X1, …, Xn are independent given C and each of them depends only on C. This corresponds exactly to the following network:

## Some Elementary Properties of Bayesian Networks

A Bayesian Network is a compact and efficient way to encode the joint probability distribution of several random variables. For example, specifying explicitly the probabilities of all possible cases for the 8 binary variables of the Asia Bayesian Network corresponds to 2⁸ = 256 probabilities whereas the Bayesian Network above contains only 36 probabilities!

Moreover, a Bayesian Network depicts the dependency relations between its random variables. Indeed, these relations can be seen directly on the network, thanks to the:

Markov Property of Bayesian Networks:each variable is conditionally independent of its non-descendants given its parent variables.

For example, in the Asia network on the left, since the only parent of “Lung Cancer” is “Smoking” and its descendants are “Either Tub. or Lung Cancer”, “Positive X-ray” and “Dyspnoea”, we can see immediately that “Lung Cancer” is independent of “Bronchitis” given “Smoking”. Thus, these 2 diseases are independent events if we know if the patient is a smoker or not.

Finally, a bayesian network can be used to compute any probability concerning some of its variables. Usually, 2 properties are sufficient for this purpose:

- the
**Bayes formula :**

- the following property of Bayesian networks, consequence of their Markov property, which we will call the
**Factorization property of bayesian networks**(since it is a factorization of the joint distribution):

## Computing Probabilities with Bayesian Networks — Basic Inference Algorithms

Thanks to these properties, a Bayesian network can be used to compute any (conditional) probability involving some of its variables. More precisely, assuming that we know all the conditional distributions of a bayesian network, we want to compute a probability ℙ(Q|E = e) = ℙ(query|evidence), where Q is a node of a Bayesian network, E is one or several nodes of this network and e is a possible value of E.

For example, in the Asia network, if the patient :

- has a dyspnoea
- has a positive result on X-ray
- is a smoker
- and didn’t go to Asia,

does this patient have a bronchitis?

To answer this question, we can compute the probability below:

and conclude that the patient may have a bronchitis if this probability is greater than 0.5.

There are 2 main types of algorithms to compute probabilities with a Bayesian network:

**Exact inference algorithms**using exact probabilistic computations based on the formulas we have seen above.**Approximate inference algorithms**based on sampling methods.

## Variable Elimination: an Exact Inference Algorithm

**Idea: **if Q, E, X1, …, Xk are the variables of a Bayesian network, then:

where each xi can take any of the possible values of the corresponding variable Xi.

The Variable Elimination algorithm consists in rewriting the final (multiple) sum by:

- using the factorization property we saw before
- and factorize the sum as much as possible to reduce the number of computations, by rewriting it like this (for example):

Finally, the algorithm terminates by computing this factorized multiple sum from the most inner sum to the most outer sum, using the probabilities given by the Bayesian network. This process eliminates each of the variables Xk one by one: that’s why this algorithm is called the “Variable Elimination algorithm”.

Maybe it will be more comprehensive by looking at an example.

**Example: **Using the Asia network, let’s try to compute the probability:

First, note that:

Now, let’s introduce the other variables of the network thanks to the Law of Total Probability:

Using the Factorization property of Bayesian networks, this multiple sum becomes:

Our first goal will be to factorize this multiple sum as much as we can to isolate the probabilities which depend only on some of the 3 variables *Either, Tub, Lung.*

First, we can factor this multiple sum by ℙ(br = Y|smoke = Y) since it’s the only factor which doesn’t depend on any of the 3 variables *Either, Tub, Lung*:

Now, in the sum above, we factor out the probabilities depending only on *Either *(which are boxed in purple):

We make a last transformation by factoring ℙ(*Tub*|Asia = No) out of the parentheses, since it depends only on *Tub *and not on *Lung*:

That’s it for the factorization! Now, the Variable Elimination algorithm consists in calculating *p *from the most inner sum (the one based on *Lung*) to the most outer one (based on *Either*) using the CPTs of the Asia network on the left.

First, ℙ(*Lung = *Y|smoke = Y) = 0.1 and ℙ(*Lung* = No|smoke = Y) = 0.9, hence the sum becomes:

It removes all occurrences of *Lung *alone. Now, let’s do the same with *Tub *by using ℙ(*Tub* = Y|Asia = No) = 0.01 and ℙ(*Tub* = No|Asia = No) = 0.99. We see that *q *(the sum on *Tub*) equals to:

Note that *q *depends on the value given to the variable *Either. *Thus, we can simplify *q *as follows when *Either* = Y:

Looking at the CPT of *Either *(or recalling that *Either* means *Either Lung or Tub*), we have ℙ(*Either* = Y|*Lung = *Y, *Tub = *Y) = ℙ(*Either* = Y|*Lung = *No, *Tub = *Y) = ℙ(*Either* = Y|*Lung = *Y, *Tub = *No) = 1 but ℙ(*Either* = Y|*Lung = *No, *Tub = *No) = 0, thus:

A similar computation gives us:

Now, we can go back to *p *and replace *q *by these 2 values in the sum depending on *Either*:

Finally (!), we just have to read each of these probabilities in the CPTs:

In fact, *p *isn’t exactly the probability we want (!!) since:

Thus, to give a diagnosis to the patient, we would also need to compute with the same method the analogous probability with br = No (in order to compare it with *p*):

The same calculation as above (modified only by replacing “br = Y” by “br = No”) gives:

Thus, since *p *is the greatest of our 2 results, the patient has probably a bronchitis (in fact, this probability is 0.079/(0.079+0.053) = 0.598).

These tedious calculations can take too much time (it has an exponential complexity), especially when the network is “multiply connected”, i.e. when there are several paths from one node to another in the network, as in the figure below:

Thus, it can be useful to have approximate inference methods, based on samplings, as we will see now.

## Some Approximate Inference Algorithms

There are many approximate inference algorithms for Bayesian networks. Here, we’ll concentrate on the 2 simplest ones: the **rejection sampling **algorithm and the **likelihood weighting **algorithm.

Again, recall that we want to compute a probability ℙ(Q|E = e) = ℙ(query|evidence), where Q is a node of a Bayesian network, E is one or several nodes of this network and e is a possible value of E.

**1/ Rejection sampling (or logic sampling):**

The process goes as follows:

- Sample
*N*data points from the distribution of the Bayesian network (that is, create one point by sampling each node from top to bottom, then repeat again this process to have*N*points). - From these
*N*points, reject those which don’t satisfy the evidence E = e. - Use the remaining “good” samples (such that E = e) to estimate the conditional distribution of Q given E = e.

For this last step, if this distribution is discrete, you just have to compute the (empiric) proportion of the occurrences of each possible value of Q in the set of “good” samples.

This algorithm is of course very simple but also very inefficient! Indeed, applying this method leads to perform many useless samplings, which is a waste of computation time, of memory, of energy, etc.

**2/ Likelihood weighting:**

Let’s see how this algorithm works:

- Sample
*N*data points from the distribution of the Bayesian network as follows:

- sample each node
**except those of E**from top to bottom, using the distributions of these nodes **force each node of E**to take the values given in e- and
**give to the data point a weight**equal to the following product of probabilities:

where the … should be replaced by the values (sampled or coming from e) of the parents of Ei.

2. Use all these sampled points to estimate the conditional distribution of Q given E = e. In the discrete case, it suffices to compute the average of the weights of the samples corresponding to a fixed value of Q.

These sampling algorithms are the simplest ones. Other methods of approximate inference exist, especially Monte Carlo methods (like Gibbs sampling). Moreover, some approximate inference algorithms are not sampling-based, as the “loopy belief propagation” method, based on probabilistic computations which are exact in the particular case of trees.

All the algorithms we have seen up to now assume a complete knowledge of the Bayesian network to which they are applied. Now, we will focus on learning methods for Bayesian networks: how can we build (or learn) the graph structure and the probability distributions of a bayesian network from data?

## Parameters Learning Methods

Learning probability distributions of each node of a bayesian network (assuming we already know its structure) only from data is very easy if our dataset is complete (i.e. without missing values). Indeed, it’s a classical **Maximum Likelihood Estimation (MLE)**, and for **discrete **nodes, it boils down to a simple calculation: each probability can be estimated as the **number of favorable** **outcomes divided by the** **total number** **of possible outcomes**. More precisely, a probability ℙ(Xi = j | parents(Xi) = k) can be estimated by:

For example, with the Asia network on the left, if we have a dataset in which:

- 320 patients have a bronchitis and either tuberculosis or a lung cancer
- and among them, 112 have a dyspnoea,

then the MLE of ℙ(dyspnoea = Yes | Either = Yes, Bronchitis = Yes) is simply 112/320 = 0.35.

In the continous case, it’s not much difficult since it’s generally the classical MLE for a given type of continuous distribution. For instance, if you assume a Gaussian distribution for a given node, its mean and variance can be estimated by the usual formulas for empirical mean and variance.

Finally, if some data are missing in our training set, it’s a harder case which can be solved using **EM (Expectation-Maximization) algorithm**, which we won’t discuss here (cf. [1] for a detailed discussion of this algorithm for bayesian networks).

We can also learn probability distributions in a Bayesian network with… **Bayesian inference** (i.e. by combining domain expertise, encoded in a prior distribution, and learning from data). In the discrete case, the prior distribution (in our case, the distribution of the probabilities in the CPTs of nodes) is generally a Dirichlet distribution whose density is given by:

Then, each probability ℙ(Xi = j | parents(Xi) = k) can be estimated as a **Maximum A Posteriori (MAP). **It’s a slight variation of the MLE we found above, given by:

In fact, the parameters α of the Dirichlet distribution can be seen as “virtual counts” given by domain experts (i.e. each α_{ijk} can be seen as a “virtual” number of cases where X_i = j and parents(X_i) = k).

## Structure Learning Algorithms

We reach the last question to solve algorithmically for bayesian networks: how to learn their structure from data? There are 3 main classes of structure learning algorithms that we will present briefly.

**1/ Score-based algorithms**

**Principle:** optimize a score which evaluates how well a structure fits with data.

Thus, we need a scoring method and an optimization algorithm:

**Score:**in general, it’s the log-likelihood of the network G penalized by a term which expresses the complexity of the network (as a function of the number of observations of our dataset D and the number of probabilities to compute) to avoid having a network with too many arrows:

The simplest examples of scores are AIC and BIC, but there are also specific scores for Bayesian networks.

**Optimization:**many algorithms exist. The simplest of them is called**hill-climbing search.**It’s a local optimization method which begins with an initial graph (for example, the “empty” one, without any arrow). At each step, the algorithm chooses the best elementary operation (i.e. addition, removal or inversion of**one**arrow). Here, the “best” operation is the one which increases the most the score. This stops when arriving at a step where no elementary operation increases significantly the score. Finally, one may perform several loops like this, changing randomly the initial network, in order to avoid staying at a sub-optimal local maximum. There exist also other types of optimization algorithms used for structure learning, like genetic algorithms (cf. [2]).

**2/ Constraint-based algorithms**

**Principle**: perform conditional independence tests between the variables to decide the structure of the network.

The simplest algorithm for this purpose is the **PC-stable algorithm** described below (where X_i are the variables of the network, for i between 1 and N):

This approach needs enough data so that statistical tests are reliable.

**3/ Hybrid algorithms**

Finally, there are also some hybrid algorithms, i.e. algorithms which combine independence tests and score optimization. Their main idea can be generally decomposed into 2 main steps:

**reduction of the search space:**for each node/variable X, use conditional independence tests to select the (potential) parents and children of X. Some algorithms can find the exact set of parents and children of X, which means they give the complete*undirected*graph structure.**maximize score in this reduced search space,**by allowing only the elementary operations compatible with the (potential) parents/children of each node.

An example of hybrid algorithm is the **Max-Min Hill Climbing (MMHC) algorithm**, cf. [3].

*This post is the first of a series about Bayesian Networks. Our next post will present the main advantages and drawbacks of Bayesian Networks, with Predictive Maintenance examples.*

# References

[1] Serge Romaric Tembo Mouafo, Sandrine Vaton, Jean-Luc Courant, Stephane Gosselin. *A tutorial on the EM algorithm for Bayesian networks: application to self-diagnosis of GPON-FTTH networks.* IWCMC 2016: 12th International Wireless Communications & Mobile Computing Conference, Sep 2016, Paphos, Cyprus. pp.369–376, ff10.1109/IWCMC.2016.7577086ff. ffhal-01394337f

[2] Larranaga, Pedro & Poza, Mikel & Yurramendi, Yosu & Murga, Roberto & Kuijpers, Cindy. (1996). *Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters.* Pattern Analysis and Machine Intelligence, IEEE Transactions on. 18. 912–926. 10.1109/34.537345.

[3] Tsamardinos, I., Brown, L.E. & Aliferis, C.F. *The max-min hill-climbing Bayesian network structure learning algorithm.* Mach Learn **65, **31–78 (2006). https://doi.org/10.1007/s10994-006-6889-7

[4] Stuart Russell and Peter Norvig*. Artificial Intelligence: A Modern Approach (3rd. ed.). *Prentice Hall Press, 2009.