A discussion on calibrating a groundwater model, including high parameterization with regulation and Pareto analysis.

The underlying model has been orginally published in The cost of uniqueness in groundwater model calibration (Moore & Doherty 2006, *Advances in Water Resources 29(4):605-623
DOI: 10.1016/j.advwatres.2005.07.003).*

**Nothing of the content below is new.** It has been published a number of times and nothing needs to be added. I am just explaining them using more colorful pictures and avoiding any mathematics, to give the reader a chance to understand the underlying problem before diving into the literature.

Loading output library...

This post shows how a simple groundwater model is calibrated, including a number of implications with this process that any groundwater modeller should be aware of.

We are using a very illustrative example from Moore & Doherty, where the transmissivity of a steady-state two-dimensional flow model is calibrated against standing water levels in a number of observation wells.

Moore & Doherty have provided a reference model with a transmissivity distribution that has been generated by a geostochastic model. This will act as our fake reality, that we try to model inversely using the hydraulich head values calculated by the reference model.

Loading output library...

**Figure 1: Transmissivity of the Reference Model**

... and of course, it provides (by definition) a perfect scatter plot (RMS=0.0):

Loading output library...

**Figure 2: Scatterplot of Reference model **

The calibration process starts with the entirely uncalibrated model. In here, we only have a rough guess on what the transmissivity may be (1 mÂ²/d), which has been assigned to the model.

Loading output library...

**Figure 3: Uncalibrated Model (T = 1 m2/d) **

If running the model in this state we would of course get a very ugly looking scatter plot:

Loading output library...

**Figure 4: Scatter plot of uncalibrated model**

Obviously, the modelled water levels are too high, showing that the transmissivity value are quite a bit away from the real values.

The obvious next step is to calibrate the model.

The parameter we are focussing on here is the transmissivity of the model, previously set to 1 mÂ² throughout the model domain.

The model comprises a Dirichlet (fixed head) boundary condition at the southern (bottom) model boundary, and a Neumann (fixed flux) boundary condition at the top. Becuase the total groundwater flow is fixed, the transmissivity determines the head gradient in the model. Because the modelled head gradient is to large, it shows that the transmissivity is too small.

For a start, we simply change the transmissivity throughout the model (no zonation) until we find the transmissivity values that gives us the best fit to the observations (minimum RMS). Because the flux in the model is constant, we should be able to find a unique solution here (this would not be the case if we would have two Dirichlet-type boundary conditions, but that's another story.

This can be done manually by trial and error, I am using PEST however because I am a very lazy person. Turns out the best homogenous transmissivity value was found at **10.5 m2/d**, about an order of magnitude higher than our initial guess (OK .. in reality I would now check my model conceptualization and mabe have a seriuous word with the guy who evaluated the pump test...).

Loading output library...

**Figure 4: Transmissivity in calibrated model (T = 10.5 m2/d)**

Loading output library...

**Figure 5: Scatter plot of uncalibrated model**

The best objective function was found at **RMS 1.88 m**, which is in many cases not that great (aka will not find approval by my customer). But this is the best, since we have to meet 12 observations with just one parameter.

So in the next steps, we need to introduce more degrees of freedom.

The only way to get more degrees of freedom with only one parameter type (here: transmissivity) is to introduce heteroneity to the system.

In practive there are mainly two different way of doing this:

- when doing a manual calibration, one would usually use
**parameter zones**, whereas each zone can have a different parameter value - in automatic calibration, generated parameter fields can be used. A very common method is the
**pilot-point**method as implemented in PEST, where the transmissivity is smoothly interpolated between a number of fixed points.

I am using the latter one here, however the choice for one of these methods does not really matter. What is important here is that the added heterogeneity increases the number of degrees of freedom of the model.

So far we had more observations to satisfy (12) than we had parameters available to do it (1), making our inverse problem **overdetermined**. In order to further reduce the model-to-measurement misfit we will increase the number of parameters.

And for the fun of it, we will completely overdo this! Let's see what happens if we turn the problem into the opposite and create an equally hopelessly **underdetermined** problem: Instead of having a parameter-to-obs ratio of 1:10, let's make it 10:1.

Pilot-points make this very easy, we simply generate a rectangular grid (8 columns, 13 rows) of pilot points over the model, giving us 104 locations to adjust the transmissivity during the calibration process.

With these abundant number of degrees of freedom, I have so many screws to turn that I can indefinitely improve the calibration (and when I say me, I definitely mean PEST...), eventually getting what on first sight looks like the perfect calibration again:

Loading output library...

**Figure 6: A perfectly calibrated model** (.. but see the hook below)

OK, so just looking at the scatter plot it looks perfect, but you know that there is nothing like a free meal. Having a look at the transmissivity, you can see all kinds of bumps and bulls-eyes in the transmissivity.

Loading output library...

**Figure 7: Transmissivities of Reality, and unregularized model**

This looks bad enough already, but if you compare it with the reference model (remember, we have the luxury of knowing the "actual" transmissivity distribution) one can easily see that the the solution has nothing to do with the reality:

If we would use this model to predict anything that is not very close to the observations wells, this will probably go terribly wrong. It seems as if calibration metrics only looking for the least model-to-measurement misfit (aka RMS) are not a good indicator of predictive power of a model.

What happened here is called **Overfitting** - an effect best described by science-non-fiction author Nate Silver as *the most important scientific problem you do not know about*. But this is worth another blog-post on another day...

The above calibration only looked at the RMS and therefore at model-to-measurement misfit alone. What was missing however was an information on the plausibility of the parameter values.

If you look at the above figure, the distribution of transmissivity rises and falls in the vicinity of the observation wells. This however is highly unlikely (the river that deposited our sediments this way thousands of years back certainly had no clue where we would drill our bores?!)

So what we do now is defining a second metric that describes plausibility. There are simple and hard ways of doing this. The hard ways are a hot topic in research (involving geostatistics and a lot of other nerdy technology) which is beyond my pedrigree. Here are the easy methods:

preferred values: Since we usually at least roughly know which parameter values in which region of the model is realistic, we can create a second RMS norm of the deviation between calibrated parameter and preferred parameter value.

preferred homogeneity: Even though we do not expect the transmissivity to be perfectly constant throughout the model, we also do not expect it to wildy jump around from one pilot point to each other. So by creating an RMS based on differences of neighboring pilot points (which should ideally be zero), we can put a preference on a smooth parameter field, adding as little heterogeneity as possible to achieve the calibration target.

starting PEST again with this second objective, it is quite illustrative to observe the development of the two objectives over a couple of iterations:

Loading output library...

**Figure 8: Finding a balance between measurement and regularization objectives**

The calibration starts with the model that we previously calibrated with a completely homogeneous transmissivity field. At this points (iteration zero), the regularisation objective is zero, meaning it is perfectly fullfilled (the transmissivity field is completely homogenous) while the measurement objective corresponds to the RMS we received after the calibration.

During iterations 1, 2 and 3, the measurement objective decreases (meaning calibration quality increases, RMS is getting smaller), this comes at the cost of a rising regularisation objective while PEST carefully introduces additional heterogeneity.

Iterations 4 and 5 is comprised by some fine-tuning, however not much is happening and PEST terminates.

The resulting model is somewhere in between the two extremes we have seen before:

Field measurements are represented a lot better than in the homogeneous model calibration (RMS dropped from 1.88 to 0.12 m).

Loading output library...

**Figure 9: Scatter plot after regularized calibration**

Looking at the transmissivity, we can see that it is still relatively homogeneous, the bulls-eyes that appeared in the overfitted, unregularized model did not occur yet.

Consequently, we can probably achieve an even better calibration without compromising the plausibility.

Loading output library...

**Figure 10: Transmissivity filed after regularized calibration**

In the previous chapter we could see how PEST balances the two objectives against each other. (Which is actually a bit more than finding a balance, see constrained optimization if you are interested).

After a couple of iterations, the balance is found, at which point you can improve one of the two objectives only at the cost of sacrificing the other one. This state is also called pareto-optimal.

Pareto-optiomality is great, as in that state we (or PEST) have done everything possible to get both objectives right. In this case, we have achieved 0.42 RMS and 0.36 RMS for measurement and regularization objective. Unfortunately, this is not a unique solution, as we can still lower the measuremten objective by accepting a higher regularization objective.

This finally means that it is possible to specify a perticular target for one ojective, and then finding the corresponding pareto-optimal solution where the other objective is minimized. There is therefore a mutual relationship between the two objectives, known as the pareto front:

Loading output library...

**Figure 11: The Pareto front calculated by PEST**

The pareto-front will look different for any model, but it typically shows some hyperbolic nature: If you try to squeeze out everything you can to minimize one objective, you drive the other into very poor numbers.

This means we always have to make a decision between the two, which will always be subjective.

Unfortunately, there is no solution to this problem, so you should think about how much you trust your measurements and your models ability to reproduce them if the paramters are perfectly known (use a low measurement RMS in this case), or if you have more trust in your prior knowledge and your understanding of the preferred parameter values (use a higher RMS for the measurement in this case).

Ideally, we should now accept that there is more than one possibility to calibrate the model (as other modellers like the meterologists do).

If however you are in the common but inconvenient situation of having a client that insists on having a definite answer, you have to be pragmatic and at least use the pareto analysis to get your best pick and choose a handfull of models that are neither within the upper branch of the pareto front (underfitted models) or the right branch of the pareto front (overfitting danger zone). Alternatively, you may want to pick that model, that has the RMS that your client may JUST accept, and see of the parameter distribution is somewhat acceptable.

We do this for four realizations of the model (as marked in figure 11):

Loading output library...

**Figure 12: Model realization 3 (left) and 10 (right)**

Loading output library...

**Figure 12: Model realization 36 (left) and 55 (right)**

While we are moving across the front, we can see how the parameter field changes from relatively smooth (upper / underfitted branch of the front) to bumpy (right / overfitted branch).

Which of the distributions is still acceptable and which one should be considerered overfitted is up to the modellers expertise.

Using an synthetic model that allowing us to compare our calibration model, we made several attempts to calibrate the model with different strategies:

- overdetermined calibration, with a single model parameter
- underdetermined calibration, with a lot more parameters than observations to fullfill

Both ended up with unsatisfactory results. The model being under- and overfitted, respectively, we cannot expect that these model realizations would be a good predictor of future behavior.

We then constrained the parmeters using regularization, this giving us the ability to balance our to objectives (physical plasibility and calibration quality). This method yields a swarm of models which are each pareto-optimal.

Plotting the pareto front gives us aready some indication which models are likely to be over- or underfitted, thus that these are not likely realizations that are to be rejected.