Introduction to Deep Evidential Regression for Uncertainty Quantification

to evidential deep learning (EDL), a framework for one-shot quantification of epistemic and aleatoric uncertainty. More specifically, we will focus on a subset: deep evidential regression (DER) as published in Amini et al. 2020. Don’t worry if these words are confusing, we will walk through them shortly.

This article assumes some prior experience with machine learning, statistics and calculus knowledge; we will build intuition for the algorithm along the way. Then, we will work through an example of approximating a cubic function and briefly touch on other applications. My goal isn’t to convince you that EDL is perfect; rather, I think it’s an interesting and developing subject that we should keep an eye out for the future. The code for the demo and visualizations are available here, I hope you enjoy!

Deep Evidential Regression diagram. Credit: Amini et al., 2020.

What is Uncertainty and Why is it Important?

Decision-making is hard. Humans use innumerable amount of factors from the surrounding environment and past experiences, often subconsciously, and in aggregation use it to inform our choices. This is known as intuition or vibes, which can be inversely framed as uncertainty. It’s common, even in disciplines such as surgery which are highly technical and grounded in scientific evidence. A 2011 study interviewed 24 surgeons, in which a high percentage of critical decisions were made using rapid intuition (46%) rather than a deliberate, comprehensive analysis of all alternative courses of action.

If it’s already hard for humans to quantify uncertainty, how could machines possibly go about it? Machine learning (ML) and especially deep learning (DL) algorithms are being increasingly deployed to automate decision-making normally performed by humans. In addition to medical procedures, it’s being used in high-stakes environments such as autonomous car navigation. In the final layer of most ML classification models, they typically use a nonlinear activation function. Softmax, for instance, converts logits to a categorical distribution summing to one via the following formula:

[s(vec{z})_{i}=frac{e^{vec{z}_{i}}}{sum_{j=1}^{N}e^{vec{z}_{j}}}]

It’s tempting to interpret softmax as returning probabilities of confidence or uncertainty. But this isn’t actually a faithful representation. Consider for a moment a training dataset that contains only black dogs and white cats. What happens if the model encounters a white dog or a black cat? It has no reliable mechanism to express uncertainty as it is forced to make a classification based on what it knows. In other words, out-of-distribution (OOD) datapoints cause big problems.

Formalizing Uncertainty and Uncertainty Quantification (UQ) Approaches

Now that we have established problems with naively taking softmax as a measure of uncertainty, we should formalize the concept of uncertainty. Researchers typically separate uncertainty into two categories: epistemic and aleatoric.

Epistemic: comes from a lack of knowledge of the data. Quantified through model disagreement, such as training multiple models on the same dataset and comparing predictions.
Aleatoric: inherent “noisyness” of the data. May be quantified through “heteroscedastic regression” where models output mean and variance for each sample.

Let’s see an example of what this might look like:

Approximating a cubic function. We would expect high aleatoric uncertainty where data is noisy but high epistemic uncertainty in out-of-distribution regions. Figure made by author.

Researchers have developed architectures capable of quantifying epistemic and/or aleatoric uncertainty to varying levels of success. Because this article is primarily focused on EDL, other approaches will receive relatively lighter coverage. I encourage you to read about these approaches in greater depth, and many amazing improvements are being made to these algorithms all the time. Three UQ techniques are discussed: deep ensembles, (bayesian) variational inference, and (split) conformal prediction. From now on, denote U_A and U_E as aleatoric and epistemic uncertainty respectively.

Deep ensembles: train M independent networks with different initializations, where each network outputs mean and variance. During inference, compute epistemic uncertainty as U_E=var(µ). Intuitively, we are computing model disagreement across different initializations by taking the variance over all the model mean outputs. Compute aleatoric uncertainty for one sample as U_A=E[σ]. Here, we are computing the noise inherent to the data by finding the average model output variance.

Variational inference (for Bayesian Neural Networks): instead of training M networks, we train one network where each weight has a learned posterior distribution (approximated as Gaussian with parameters µ and σ), optimized via evidence lower bound (ELBO). At inference, uncertainty is estimated by sampling multiple weight configurations and aggregating predictions.

Conformal prediction: this is a post-hoc UQ method that cannot natively disentangle epistemic and aleatoric uncertainty. Instead, it provides statistical guarantees that (1-α)% of your data will fall within a range. During training, create a network with “lower” and “upper” heads which are trained to capture the α/2th and 1-α/2th quantiles via pinball loss.

Again, this was a very quick overview of other UQ approaches so please read about them in greater depth if you’re interested (references at the end of the article). The important point is: all of these approaches are computationally expensive, often requiring several passes during inference or a post-hoc calibration step to capture uncertainty. EDL aims to solve this problem by quantifying both epistemic and aleatoric uncertainty in a single pass.

DER Theory

At a high level, EDL is a framework where we train models to output parameters to higher order distributions (i.e. distributions that when you sample them, you get the parameters of a lower order distribution like the Gaussian).

Before we continue, I will preface: we’ll skim over the math-heavy proofs but please read the original paper if you’re interested. In deep evidential regression (DER), we are modeling an unknown mean μ and variance σ^2. We assume that these parameters are themselves are distributed in a certain way. To do this, we want to predict the parameters to the Normal Inverse Gamma (NIG) for each sample in our dataset.

The NIG is a joint probability distribution between the Normal (Gaussian) and the Inverse Gamma distributions and its relationship with the standard Gaussian is shown below.

Relationship between Normal Inverse Gamma and Gaussian Distributions. Credit: Amini et al., 2020.

More formally, we define the NIG as the cartesian product between two likelihood functions for the Normal and Inverse Gamma distributions, respectively. The Normal distribution gives us the mean, whereas the Inverse Gamma distribution gives the variance.

[p(mu,sigma^2 mid gamma,lambda, alpha, beta)=N(mu mid gamma,lambda) times Gamma^{-1}(sigma^2 mid alpha,beta)]

Thus, γ, λ describe the expected mean and its scale (for normal) whereas α, β describe the shape and scale of the variance (for inverse gamma). In case this is still a bit confusing, here are a few visualizations to help (from my repository if you would like further experimentation).

Effects of adjusting gamma and lambda (normal). Reducing gamma moves the expected mean to the left, whereas increasing lambda shrinks the variance of the mean. Figure made by author.

Effects of adjusting alpha and beta (inverse gamma). Increasing alpha amounts to increasing degrees of freedom for the resulting t-distribution and smaller tails. Increasing beta scales the inverse gamma distribution while affecting tail behavior less. Figure made by author.

Once we have the parameters to the NIG, the authors of deep evidential regression reason that we can compute epistemic and aleatoric as follows:

[U_{A}=sqrt{{frac{beta}{alpha-1}}},U_{E}=sqrt{ frac{beta}{lambda(alpha-1)} }]

Intuitively, as more data is collected λ and α increase, driving epistemic uncertainty toward zero. Again, for curious readers, the proofs for these equations are provided in the original paper. This calculation is essentially instantaneous compared to deep ensembles or variational inference, where we would have to retrain models and run multiple iterations of inference! Note: redefinitions of epistemic/aleatoric uncertainty have been proposed in works like these for improved disentanglement and interpretation but we’re working with the standard formulation.

Now that we have an idea of what the NIG distribution does how do we get a neural network to predict its parameters? Let’s use maximum likelihood estimation — denote γ, λ, α, β as m, we want to minimize L_{NLL} where:

[L_{NLL}=-log(p(y mid m))]

To find p(y | m), we marginalize over μ and σ^2, weighting the likelihood of observing our data given all possible values of μ and σ^2 by the likelihood of getting those parameters from our NIG distribution. This simplifies nicely to a student’s t distribution.

[begin{align*}
p(y mid m)&=int_{sigma^2=0} int_{mu=-infty}p(y mid mu,sigma^2) cdot p(mu, sigma^2 mid m) ,dmu , dsigma^2 \
&=text{St}left(text{loc}=gamma, text{scale}=frac{beta(1+lambda)}{lambda alpha},text{df}=2alpha right)
end{align*}]

Finally, we can just take the negative log for our loss. We also use a regularization term that punishes high evidence with high error, giving our final loss as a weighted sum with hyperparameter λ_{reg} (so as to not conflict with the λ parameter for the NIG):

[begin{align*}
L_{reg}&=|y – gamma| cdot (2lambda + alpha) \
L&=L_{NLL}+lambda_{reg} L_{reg}
end{align*}]

Whew, with the statistics theory out of the way let’s figure out how to make a neural network learn the parameters to the NIG distribution. This is actually quite simple: use a linear layer, and output four parameters for each output dimension. Apply the softplus activation function to each parameter to force it to be positive. There is an additional constraint α > 1 so that aleatoric uncertainty exists (recall, the denominator is α-1).

class NormalInvGamma(nn.Module):
   def init(self, in_features, out_units):
      super().init()
      self.dense = nn.Linear(in_features, out_units * 4)
      self.out_units = out_units

   def evidence(self, x):
      return F.softplus(x)

   def forward(self, x):
      out = self.dense(x)
      # log-prefix to indicate pre-softplus, unconstrained values
      mu, logv, logalpha, logbeta = torch.split(out, self.out_units, dim=-1)
      v = self.evidence(logv)
      alpha = self.evidence(logalpha) + 1
      beta = self.evidence(logbeta)
      return mu, v, alpha, beta

Let’s move onto some examples!

Evidential Deep Learning Cubic Example

Here, we first follow the example detailed in the DER paper of estimating the cubic function, just like the example in the first section of this article. The neural network aims to model a simple cubic function y = x^3 and is given limited and noisy training data in a window around x=0.

Cubic function with added noise in training dataset, which is limited to the interval [-4,4].

In code, we define data gathering (optionally include other functions to approximate!):

def get_data(problem_type="cubic"):
	if problem_type == "cubic":
		x_train = torch.linspace(-4, 4, 1000).unsqueeze(-1)
		sigma = torch.normal(torch.zeros_like(x_train), 3 * torch.ones_like(x_train))
		y_train = x_train**3 + sigma
		x_test = torch.linspace(-7, 7, 1000).unsqueeze(-1)
		y_test = x_test**3
	else:
		raise NotImplementedError(f"{problem_type} is not supported")
	
	return x_train, y_train, x_test, y_test

Next, let’s make the main training and inference loop:

def edl_model(problem_type="cubic"):
    torch.manual_seed(0)
    x_train, y_train, x_test, y_test = get_data(problem_type)

    model = nn.Sequential(
        nn.Linear(1, 64),
        nn.ReLU(),
        nn.Linear(64, 64),
        nn.ReLU(),
        NormalInvGamma(64, 1),
    )

    optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
    dataloader = DataLoader(TensorDataset(x_train, y_train), batch_size=100, shuffle=True)

    for _ in tqdm(range(500)):
        for x, y in dataloader:
            pred = model(x)
            loss = evidential_regression(pred, y, lamb=3e-2)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    with torch.no_grad():
        pred = model(x_test)

    plot_results(pred, x_train, y_train, x_test, y_test, problem_type)

Now we define the first part of plot_results as follows:

def to_numpy(tensor):
    return tensor.squeeze().detach().cpu().numpy()    

def plot_results(pred, x_train, y_train, x_test, y_test, problem_type="cubic"):
    mu, v, alpha, beta = (d.squeeze() for d in pred)
    x_test = x_test.squeeze()
    epistemic = torch.sqrt(beta / (v * (alpha - 1)))
    aleatoric = torch.sqrt(beta / (alpha - 1))
    total = torch.sqrt(epistemic**2 + aleatoric**2)
    ratio = epistemic / (epistemic + aleatoric + 1e-8)

    x_np = to_numpy(x_test)
    y_true_np = to_numpy(y_test)
    mu_np = to_numpy(mu)
    total_np = to_numpy(total)
    ratio_np = to_numpy(ratio)

    x_train_np = to_numpy(x_train)
    y_train_np = to_numpy(y_train)
    
    std_level = 2
	ax.fill_between(
		x_np,
		(mu_np - std_level * total_np),
		(mu_np + std_level * total_np),
		alpha=0.5,
		facecolor="#008000",
		label="Total",
	)
	
	xlim, ylim = get_plot_limits(problem_type)
	if xlim is not None and ylim is not None:
		ax.set_xlim(*xlim)
		ax.set_ylim(*ylim)
	ax.legend(loc="lower right", fontsize=7)
	ax.set_title(f"DER for {problem_type}", fontsize=10, fontweight='normal', pad=6)
	fig.savefig(f"examples/{problem_type}.png")

Here, we’re simply computing epistemic and aleatoric uncertainty according to the formulas mentioned earlier, then converting everything to numpy arrays. Afterwards, we plot two standard deviations away from the predicted mean to visualize the uncertainty. Here is what we get:

Uncertainty overlay on plot. Figure made by author.

It works, amazing! As expected, the uncertainty is high in the regions with no training data. How about the epistemic / aleatoric uncertainty? In this case, we would expect low aleatoric in the central region. Actually, EDL is known for sometimes providing unreliable absolute uncertainty estimates — high aleatoric uncertainty usually leads to high epistemic uncertainty so they cannot be fully disentangled (see this paper for more details). Instead, we can look at the ratio between epistemic and aleatoric uncertainty in different regions.

Figure displaying ratio between epistemic and total uncertainty at different points on the graph. Figure made by author.

As expected, our ratio is lowest in the center since we have data there and highest in regions outside the interval [-4,4] containing our training datapoints.

Conclusions

The cubic example is a relatively simple function, but deep evidential regression (and more generally, evidential deep learning) can be applied to a range of tasks. The authors explore it for depth estimation and it has since been used for tasks like video temporal grounding and radiotherapy dose prediction.

However, I believe it is not a silver bullet, at least in its current state. In addition to the previously mentioned challenges with interpreting “absolute” uncertainty and disentanglement, it can be sensitive to the λ_{reg} regularization hyperparameter. From my testing, uncertainty quality rapidly decays even after slight adjustments such λ_{reg}=0.01 to λ_{reg}=0.03. The constant “battle” between the regularization and NLL terms means the optimization landscape is more complex than a typical neural network. I have personally tried it for image reconstruction in this repository with some mixed results. Regardless, it is still a really interesting and rapid alternative to traditional approaches such as bayesian UQ.

What are some important takeaways from this article? Evidential deep learning is a new and emerging framework for uncertainty quantification focused on training networks to output parameters to higher order distributions. Deep evidential regression in particular learns the parameters to the Normal Inverse Gamma as a prior for the unknown parameters of a normal distribution. Some advantages include: huge training and inference duration speedup relative to approaches like deep ensembles and variational inference and compact representation. Some challenges include: difficult optimization landscape and lack of complete uncertainty disentanglement. This is a field to keep watching for sure!

Thanks for reading, here are some further readings and references:

What's Hot

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Escaping the Valley of Choice in BI

Solving a Murder Mystery Using Bayesian Inference

Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Most Popular

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Our Picks

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Subscribe to Updates

What's Hot

Introduction to Deep Evidential Regression for Uncertainty Quantification

What is Uncertainty and Why is it Important?

Formalizing Uncertainty and Uncertainty Quantification (UQ) Approaches

DER Theory

Evidential Deep Learning Cubic Example

Conclusions

Related Posts

Subscribe to Updates