Maximum Likelihood Estimators

With Maximum Likelihood Estimator (MLE) problems, a parametric distribution is named, one or more of it’s parameter values are unknown, a set of data from this distribution is given, and you’re asked to find the parameter values that maximize the probability of observing the data.  You do this by brute force.  For each data point, you express the probability of observing it, which is simply the density function of whatever named distribution was given in the problem.  The probability of observing the whole set of data is simply the product of the probability of each data point.  If there are lots of data points, you end up with a massive function.

For example, the density function for N is given by:

f_N(n;\lambda) = \lambda e^{-\lambda n}

in which \lambda is the unknown parameter.  You are also given a set of observations a, b, c and you must find the value of \lambda which maximizes the probability of seeing these particular values.  The likelihood of seeing these values is given by L(\lambda):

L(\lambda) = \left( \lambda e^{-\lambda a}\right)\left( \lambda e^{-\lambda b}\right)\left( \lambda e^{-\lambda c}\right) = \lambda^3 e^{-\lambda (a+b+c)}

To find \lambda which maximizes this function, you take the derivative of it, set it equal to 0, and solve for \lambda.  If you’re looking a few steps ahead, you should realize that doing the maximization by brute force will be difficult.  To get around this, we can log the likelihood function then find the maximum.  This does not change the resulting value of \lambda.  The log likelihood l(\lambda) is then:

l(\lambda) = 3\ln(\lambda) -\lambda(a+b+c)

The derivative with respect to \lambda is

\displaystyle \frac{\delta}{\delta \lambda}l(\lambda) = \frac{3}{\lambda} - (a+b+c)

Equating to 0 and solving, we have

\lambda = \displaystyle \frac{3}{a+b+c}

Leave a comment

Filed under Empirical Models

Kaplan-Meier and Nelson-Aalen Estimators

When the empirical data is incomplete (truncated or censored), raw empirical estimators will not produce good results.  In this scenario, there are two techniques available to determine the distribution function based on the data.  The Kaplan-Meier product limit estimator can be used to generate a survival distribution function.  The Nelson-Aalen estimator can be used to generate a cumulative hazard rate function.  The Kaplan-Meier estimator is given by:

S_n(t) = \displaystyle \prod_{i=1}^{j-1} \left(1-\frac{s_i}{r_i}\right), \quad y_{j-1} \le t < y_j

where r_i is the risk set at time y_i and s_i is the number of observations from the random event whose distribution you are trying to estimate.  For example, if the random event you are interested in is death, then r_1 could be the number of life insurance policy holders immediately prior to the first death, and s_i would be the number of observed deaths at the time of the first death (you can have simultaneous deaths).  The key to dealing with problems that use this estimator is to understand how r_i changes with respect to censoring or truncation.  If a person withdraws from the life insurance policy, this decreases r_i but this is not a death, so it does not contribute to s_i.  If new members join at time y_i, they are not part of the risk set until time y_{i+1}.

If the data is censored past a certain point, you can assume an exponential distribution for the censored portion.  Suppose observations past c are censored.  If you know the value of S_n(c) you can solve for \theta using S_n(c) = e^{-c/\theta}.

The Nelson-Aalen cumulative hazard rate estimator is given by:

\tilde H(t) = \displaystyle \sum_{i-1}^{j-1} \frac{s_i}{r_i}, \quad y_{j-1} \le t < y_j

You can use this to get a survival function:

\tilde S(t) = e^{-\tilde H(t)}

Leave a comment

Filed under Empirical Models

Statistics for Empirical Models

An empirical model makes an assumption about the type of distribution underlying a set of data.  The data is then used to estimate the parameters in the assumed underlying distribution.  For example, if it is assumed that the underlying distribution is a uniform distribution, the data may be used to estimate the maximum value that the random variable can have.  If the assumed underlying distribution is Poisson, the data may be used to estimate the rate parameter \lambda.  There are three measures of quality for these estimators– bias, consistency, and mean squared error.


  1. Let \hat\theta be an estimator and \theta be the true parameter being estimated.  The bias is defined by: bias_{\hat\theta}(\theta) = E\left[\hat\theta|\theta\right] - \theta.  An estimator is said to be unbiased if bias_{\hat\theta}(\theta) = 0.
  2. An estimator is said to be consistent if for any \delta > 0, \lim_{n \rightarrow \infty} \Pr\left(|\hat\theta_n - \theta| < \delta\right) = 1, where \hat\theta_n is the estimator based on n observations.
  3. The Mean Squared Error is MSE_{\hat\theta}(\theta) = E\left[(\hat\theta - \theta)^2|\theta\right].  A low mean squared error is desirable.

The following relationship is useful:

MSE_{\hat\theta}(\theta) = Var(\hat\theta) + [bias_{\hat\theta}(\theta)]^2

To determine a construct a confidence interval for an estimator, you simply add and subtract z\hat\sigma to the estimate where z is the appropriate z-value from the standard normal table and \hat\sigma is the square root of the variance of the estimator.  So a 95% confidence interval for an estimator \hat\theta would be

\left(\hat\theta - 1.96\sqrt{\hat{var}(\theta)},\hat\theta + 1.96\sqrt{\hat{var}(\theta)}\right)

You can usually express the variance of the estimator in terms of the true parameter.  In that case you can substitute the estimated variance with the true variance of the estimator based on your assumption of the underlying distribution.

When the variance of an estimator can be expressed in terms of the true parameter that you are trying to estimate, a more accurate confidence interval can be derived by

\displaystyle -z \le \frac{\hat\theta - \theta}{\sqrt{\hat var(\theta)}} \le z

Solving for \theta gives the interval.

An important point to keep in mind is that the results you observe in the data is the outcome of a random variable.  Since estimators are calculated based on these observations, they are a function of the results of a random variable.  If you make some assumptions about the underlying distribution, you can calculate the statistics of an estimator, such as the variance of an estimator.  For example, in a population of 100, you observe 5 claims.  You assume the underlying distribution for the number of claims per person is poisson and you estimate its parameter to be \hat\lambda = \frac {5}{100}.  This means your equation for the estimator is:

\displaystyle \hat\lambda = \frac{1}{100}\sum_{i=1}^{100} X_i

where X_i represents the true underlying distribution for each person.  Thus the variance of your estimator is given by

\begin{array}{rll} var(\hat\lambda) &=& \displaystyle var\left(\frac{1}{100}\sum_{i=1}^{100} X_i\right) \\ \\ &=& \displaystyle \frac{1}{100^2} var\left(\sum_{i=1}^{100} X_i\right) \\ \\ &=& \displaystyle \frac{1}{100} var(X) \end{array}

Since you’ve assumed the underlying distribution to be poisson,

\hat var(\hat\lambda) = \displaystyle \frac{\lambda}{100}

where \lambda is the variance of X.  This is how you can arrive at equations for the statistics of estimators based on an assumed underlying distribution.  The key is to realize that there is a link between the estimator and the true underlying distribution.

Leave a comment

Filed under Empirical Models

Ruin Theory

You walk into a casino with a certain amount of surplus money.  Every hour you spend in the casino, there is a certain probability of winning or losing some given amount of money.  A ruin theory question might ask, what is the probability that you go bankrupt in x number of hours.  The bankruptcy state is an absorbing state, so once you enter into that state, the probability of leaving that state is 0.  The solution to these types of problems usually require calculating an exhaustive list of probabilities for ruin by the convolution method.  But first, familiarity with the notation should be developed.


  1. \psi(u) — Starting with surplus u, this is the probability of ruin as t goes to infinity with t defined on \mathbb{R}.
  2. \bar{\psi}(u) —  Same as above except t \in \mathbb{N}_+, positive integers.
  3. \psi(u,t) — Probability of ruin from time 0 to time t with t \in \mathbb{R}.
  4. \bar\psi(u,t) — Same as above except t \in \mathbb{N}_+.
The analogous survival probabilities follow the same conventions except they are denoted by \phi.  
The following relationships are useful:
  1. \psi(u) \geq \psi(u,t) \geq \bar\psi(u,t)
  2. \psi(u) \geq \bar\psi(u) \geq \bar\psi(u,t)
  3. \psi(u) \geq \psi(u+k) for k \geq 0
  4. \phi(u) \leq \phi(u,t) \leq \bar\phi(u,t)
  5. \phi(u) \leq \bar\phi(u) \leq \bar\phi(u,t)
  6. \phi(u) \leq \phi(u+k) for k \geq 0

Leave a comment

Filed under Probability, Ruin Theory

Recursive Discrete Aggregate Loss

You have 2 six-sided dice.  You roll one dice to determine the number of times you will roll the second dice.  The sum of the results of each roll of the second dice is the amount of aggregate loss.  Since the frequency and severity are discrete, for any aggregate loss amount, the number of combinations of rolls to produce such an amount is clearly countable and finite.  For example, an aggregate loss amount of 3 can be arrived at by rolling a 1 on the first dice, then rolling a 3; or rolling a 2, then rolling the combinations (1,2),(2,1); or rolling a 3 and then rolling (1,1,1) on the second dice.  The probability of experiencing an aggregate loss of 3 is:

\begin{array}{rll} \Pr(S=3) \displaystyle &=& \frac{1}{6^2} + \frac{2}{6^3} + \frac{1}{6^4} \\ \\ \displaystyle &=& \frac{49}{6^4} \end{array}

This method of calculating the probability is called the convolution method.  Now imagine the frequency and severity distributions are discrete but infinite.  To calculate \Pr(S=10) would require calculating the probability for many possible combinations.  If the discrete functions are from the (a,b,0) class, there is a recursive formula that can calculate this.  It is given by:

g_k = \displaystyle \frac{1}{1-af_0}\sum_{j=1}^k \left(a+\frac{bj}{k}\right)f_jg_{k-j}

where k is an integer, g_k = \Pr(S=n)=f_S(n), f_n = \Pr(X=n), and p_n = \Pr(N=n).  This is called the recursive method.  To start the recursion, you need to find g_0.  You can then find any g_k.  If a problem asks for F_S(3), this is equal to g_0+g_1+g_2+g_3.  You iterate through the recursion to find each g_k then add them together.

Leave a comment

Filed under Aggregate Models, Frequency Models, Probability, Severity Models

Approximating Aggregate Losses

An aggregate loss S is the sum of all losses in a certain period of time.  There are an unknown number N of losses that may occur and each loss is an unknown amount X.  N is called the frequency random variable and X is called the severity.  This situation can be modeled using a compound distribution of N and X.  The model is specified by:

\displaystyle S = \sum_{n=1}^N X_n

where N is the random variable for frequency and the X_n‘s are IID random variables for severity.  This type of structure is called a collective risk model.

An alternative way to model aggregate loss is to model each risk using a different distribution appropriate to that risk.  For example, in a portfolio of risks, one may be modeled using a pareto distribution and another may be modeled with an exponential distribution.  The expected aggregate loss would be the sum of the individual expected losses.  This is called an individual risk model and is given by: 

\displaystyle S = \sum_{i=1}^n X_i

where n is the number of individual risks in the portfolio and the X_i‘s are random variables for the individual losses.  The X_i‘s are NOT IID, and n is known.

Both of these models are tested in the exam; however, the individual risk model is usually tested in combination with the collective risk model.  An example of a problem structure that combines the two is given below.

Example 1: Your company sells car insurance policies.  The in-force policies are categorized into high-risk and low-risk groups.  In the high-risk group, the number of claims in a year is poisson with a mean of 30.  The number of claims for the low-risk group is poisson with a mean of 10.  The amount of each claim is pareto distributed with \theta = 200 and \alpha = 2.
Analysis: Being able to see the structure of the problem is a very important first step in being able to solve it.  In this situation, you would model the aggregate loss as an individual risk model.  There are 2 individual risks– high and low risk.  For each group, you would model the aggregate loss using a collective risk model.  For the high-risk, the frequency is poisson with mean 30 and the severity is pareto with \theta = 200 and \alpha = 2.  For the low-risk group, the frequency is poisson with mean 10 and the severity is pareto with the same parameters.

For these problems, you will need to know how to:

  1. Find the expected aggregate loss.
  2. Find the variance of aggregate loss.
  3. Approximate the probability that the aggregate loss will be above or below a certain amount using a normal distribution.  
    Example: what is the probability that aggregate losses are below $5,000?
  4. Determine how many risks would need to be in a portfolio for the probability of aggregate loss to reach a given level of certainty for a given amount.
    Example: how many policies should you underwrite so that the aggregate loss is less than the expected aggregate loss with a 95% degree of certainty? 
  5. Determine how long your risk exposure should be for the probability of aggregate loss to reach a given level of certainty for a given amount.

Problems that require you to determine probabilities for the aggregate loss will usually state that you should use a normal approximation.  This will require the calculation of the expected aggregate loss and the variance of the aggregate loss.

Expected aggregate loss for a collective risk model is given by:

E[S] = E[N]E[X]

For the individual risk model, it is

\displaystyle E[S] = \sum_{i=1}^n E[X_i]

Variances under the collective risk model are conditional variances.

Var(S) = E[Var(X|I)] + Var(E[X|I])

When frequency and severity are independent, the following shortcut is valid and is called a compound variance:

Var(S) = E[N]Var(X) + Var(N)E[X]^2

Variance under the individual risk model is additive:

\displaystyle Var(S) = \sum_{i=1}^n Var(X)

Example 2: Continuing from Example 1, calculate the mean and variance of the aggregate loss.  Assume frequency and severity are independent.
Answer: This is done by

  1. Calculating the expected aggregate loss and variance in the high-risk group.
  2. Calculating the expected aggregate loss and variance in the low-risk group.
  3. Adding the expected values from both groups to get the total expected aggregate loss.
  4. Adding the variances from both groups to get the total variance.

I will use subscript H and L to denote high and low risk groups respectively.

E[S_H] = E[N_H]E[X_H] = 30\times 200 = 6,000

\begin{array}{rll} Var(S_H) &=& E[N_H]Var(X_H) + Var(N_H)E[X_H]^2 \\ &=& 30 \times 40,000 + 30 \times 200^2 \\ &=& 2,400,000 \end{array}

E[S_L] = E[N_L]E[X_L] = 10 \times 200 = 2,000

\begin{array}{rll} Var(S_H) &=& 10 \times 40,000 + 10 \times 200^2 \\ &=& 800,000 \end{array}

Add expected values to get

E[S] = 6,000 + 2,000 = 8,000

Add variances to get

Var(S) = 2,400,000 + 800,000 = 3,200,000

Once the mean and variance of the aggregate loss has been calculated, you can use them to approximate probabilities for aggregate losses using a normal distribution.

Example 3: Continuing from Example 2, use a normal approximation for aggregate loss to calculate the probability that losses exceed $12,000.
Answer:  To solve this, you will need to calculate a z value for the normal distribution using the expected value and variance found in Example 2.

\begin{array}{rll} \Pr(S > 12,000) &=& 1- \Pr(S< 12,000) \\ \\ &=& \displaystyle 1-\Phi\left(\frac{12,000 - 8,000}{\sqrt{3,200,000}}\right) \\ \\ &=& 1 - \Phi(2.24) \\ \\ &=& 0.0125 \end{array}

Suppose in the above examples the severity X is discrete.  For example, X is poisson.  Under this specification, we need to add 0.5 to 12,000 in the calculation for \Pr(S > 12,000).  So we would instead calculate \Pr(S > 12,000.5)  This is called a continuity correction and occurs when we have a discrete severity random variable.  If we were interested in \Pr(S<12,000), we would subtract 0.5 instead.  This has a greater effect when the domain of possible values is smaller.

Another type of problem I’ve encountered in the samples is constructed as follows:

Example 4: You drive a 1992 Honda Prelude Si piece-of-crap-mobile (no, that’s my old car and you are driving it because I sold it to you to buy my Mercedes).  The failure rate per year is poisson with mean 2.  The average cost of repair for each instance of breakdown is $500 with a standard deviation of $1000.  How many years do you have to continue driving the car so that the probability of the total maintenance cost exceeding 120% of the expected total maintenance cost is less than 10%?  (Assume the car is so crappy that it cannot deteriorate any further so the failure rates and average repair costs remain constant every year.)
Answer:  For one year,

E[S_1] = 1,000

\begin{array}{rll} Var(S_1) &=& 2 \times 1,000^2 + 2 \times 500^2 \\ &=& 2,500,000 \end{array}

For n years, we have

E[S] = 1,000n

Var(S) = 2,500,000n

According to the problem, we are interested in S such that \Pr(S > 1,200n) = 0.1.  Under normal approximation, this implies

\begin{array}{rll} \Pr(S>1,200n) &=& 1-\Pr(S<1,200n) \\ \\ &=& \displaystyle 1- \Phi\left(\frac{1,200n - 1,000n}{\sqrt{2,500,000n}}\right) \end{array}

Which implies

\displaystyle \Phi\left(\frac{200n}{\sqrt{2,500,000n}}\right) = 0.9

The probability 0.9 corresponds to a z value of 1.28.  This implies

\displaystyle \frac{200n}{\sqrt{2,500,000n}} = 1.28

Solving for n we have n = 1024 years.  LOL!

1 Comment

Filed under Aggregate Models, Frequency Models, Probability, Severity Models

Frequency with Respect to Exposure and Coverage Modifications

In a portfolio of risks, there are two types of modifications which can influence the frequency distribution of payments.

  1. Exposure Modification (not in syllabus) — increasing or decreasing the number of risks or time periods of coverage in the portfolio
  2. Coverage Modification — applying limits, deductible or adjusting for inflation in each individual risk
If there is an exposure modification, you would adjust the frequency distribution by scaling the appropriate parameter to reflect the change in exposure.  The following list provides the appropriate parameter to adjust for each distribution:
  1. Poisson:  \lambda
  2. Negative Binomial:  r
    (Geometric is a Negative Binomial with r = 1)
  3. Binomial:  m
    (Only valid if the new value remains an integer)
Example 1:  You own a portfolio of 10 risks.  You model the frequency of claims with a negative binomial having parameters r = 2 and \beta = 0.5.  The number of risks in your portfolio increases to 15.  What are the parameters for the new distribution?
Answer:  The frequency distribution now has parameters r = 3 and \beta = 0.5  Note that since the mean and variance are r\beta and r\beta(1+\beta) respectively, the new mean and variance are multiplied by 1.5.
Coverage modifications shift, censor, or scale the individual risks, usually in the presence of a deductible or claim limit, and they change the conditions that trigger a payment.  For example, adding a deductible d is considered a coverage modification and this changes the condition for payment because any losses below the deductible do not qualify for payment.  If the risk is represented by random variable X, then adding a deductible would change the random variable to (X-d)_+. Scaling in the presence of a deductible or claim limit also affects the frequency distribution.  The following lists parameters affected by coverage modifications:
  1. Poisson:  \lambda
  2. Negative Binomial:  \beta
  3. Binomial:  q
  4. Geometric:  \beta
These parameters are scaled by the probability that a payment occurs.
Example 2:  The frequency of loss is modeled as a Poisson distribution with parameter \lambda = 5.  A deductible is imposed so that only 80% of losses result in payments.  What is the new distribution?
Answer:  It is Poisson with \lambda = 0.8(5) = 4.
Example 3:  The frequency of payment N is modeled as a negative binomial with parameters r = 3 and \beta = 0.5.  Losses X are pareto distributed with parameters \alpha = 2 and \theta = 100.  The deductible is changed from d=10 to d=15.  What are the new parameters in the frequency distribution?
Answer:  Firstly, N is the frequency of payment.  So it reflects the current deductible.  If you wanted the distribution of N without deductible, you would divide \beta by \Pr{(X>10)}.  Now to find the distribution of N with the deductible of 15, you multiply \beta by Pr(X>15).  To summarize:
\begin{array}{rll} \beta_{new} &=& \displaystyle \beta_{old}\times\frac{\Pr(X>15)}{\Pr(X>10)} \\ \\ &=& \displaystyle 0.5\times \frac{0.75614367}{0.82644628} \\ \\ &=& 0.45746692 \end{array}

Leave a comment

Filed under Coverage Modifications, Deductibles, Frequency Models, Limits