---
title: "Data Science and Predictive Analytics (UMich HS650)"
subtitle: "__Function Optimization__

"
author: "### SOCR/MIDAS (Ivo Dinov)

"
date: "`r format(Sys.time(), '%B %Y')`"
tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics]
output:
html_document:
theme: spacelab
highlight: tango
includes:
before_body: SOCR_header.html
after_body: SOCR_footer_tracker.html
toc: true
number_sections: true
toc_depth: 2
toc_float:
collapsed: false
smooth_scroll: true
---
Most data-driven scientific inference, qualitative, quantitative and visual analytics involve formulating, understanding the behavior of, and optimizing objective (cost) functions. Presenting the mathematical foundations of representation and interrogation of diverse spectra of objective functions provides mechanisms for obtaining effective solutions to complex big data problems. (Multivariate) *function optimization* (minimization or maximization) is the process of searching for variables $x_1, x_2, x_3, \ldots, x_n$ that either minimize or maximize the multivariate cost (objective) function $f(x_1, x_2, x_3, \ldots, x_n)$. In this Chapter, we will specifically discuss (1) constrained and unconstrained optimization, (2) Lagrange multipliers, (3) linear, quadratic and (general) non-linear programming, and (4) data denoising.
# Free (unconstrained) optimization
Function optimization without restrictions for the domain of the cost function, $\Omega \ni \{x_i\}$. The [extreme value theorem](http://en.wikipedia.org/wiki/Extreme_value_theorem) suggest that a solution to the free optimization processes, $\min_{x_1, x_2, x_3, \ldots, x_n}{f(x_1, x_2, x_3, \ldots, x_n)}$ or $\max_{x_1, x_2, x_3, \ldots, x_n}{f(x_1, x_2, x_3, \ldots, x_n)}$, may be obtained by a gradient vector descent method. This means that we can minimize/maximize the objective function by finding solutions to $\nabla f = \{\frac{d f}{d x_1}, \frac{d f}{d x_2}, \ldots, \frac{d f}{d x_1}\}=\{0, 0, \ldots, 0\}$. Solutions to this equation, $x_1, \ldots, x_n$, will present candidate (local) minima and maxima.
In general, identifying critical points using the gradient or tangent plane, where the partial derivatives are trivial, may not be sufficient to determine the *extrema* (*minima* or *maxima*) of multivariate objective functions. Some critical points may represent inflection points, or local extrema that are far from the global *optimum* of the objective function. The eigen-values of the [Hessian matrix](https://en.wikipedia.org/wiki/Second_partial_derivative_test), which includes the second order partial derivatives, at the critical points provide clues to pinpoint extrema. For instance, invertible Hessian matrices that (1) are positive definite (i.e., all eigenvalues are positive), yield a local minimum at the critical point, (2) are negative definite (all eigenvalues are negative) at the critical point suggests that the objective function has a local maximum, and (3) have both positive and negative eigenvalues yield a saddle point for the objective function at the critical point where the gradient is trivial.
There are two complementary strategies to avoid being trapped in *local* extrema. First, we can run many iterations with different initial vectors. At each iteration, the objective function may achieve a (local) maximum/minimum/saddle point. Finally, we select the overall minimal (or maximal) value from all iterations. Another adaptive strategy involves either adjusting the step sizes or accepting solutions *in probability*, e.g., [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing) is one example of an adaptive optimization.
## Example 1: minimizing a univariate function (inverse-CDF)
The cumulative distribution function (CDF) of a real-valued random process $X$, also known as the distribution function of $X$, represents the probability that the random variable $X$ does not exceed a certain level. Mathematically speaking, the CDF of $X$ is $F_X(x) = P(X\leq x)$. Recall the [Chapter 1](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/01_Foundation.html) discussions of Uniform, Normal, Cauchy, Binomial, Poisson and other discrete and continuous distributions. Also explore the [dynamic representations of density and distribution functions included in the Probability Distributome Calculators](http://www.distributome.org/V3/calc/index.html).
For each $p\in [0,1]$, the *inverse distribution function*, also called *quantile function* (e.g., `qnorm`), yields the critical value ($x$) at which the probability of the random variable is less than or equal to the given probability ($p$). When the CDF $F_X$ is continuous and strictly increasing, the value of the inverse CDF at $p$, $F^{-1}(p)=x$, is the unique real number $x$ such that $F(x)=p$.
Below, we will plot the probability density function (PDF) and the CDF for *Normal* distribution in R.
```{r}
par(mfrow=c(1,2), mar=c(3,4,4,2))
# generate 1000 StdNormal variables
x <- rnorm(1000)
# Estimate the StdNormal Density function
density_x <- density(x)
plot(density_x, col="darkgreen",xlab="", ylab="Density", type="l",lwd=2, cex=2, main="PDF of (Simulated) Standard Normal", cex.axis=0.8)
# could also do
# xseq<-seq(-4, 4, 0.01); density<-dnorm(xseq, 0, 1); plot (density, main="Density")
# Compute the CDF
xseq<-seq(-4, 4, 0.01); cumulative<-pnorm(xseq, 0, 1)
# plot (cumulative, main="CDF")
plot(xseq, cumulative, col="darkred", xlab="", ylab="Cumulative Probability", type="l",lwd=2, cex=2, main="CDF of (Simulated) Standard Normal", cex.axis=.8)
```
Suppose we are interested in computing, or estimating, the inverse-CDF from first principles. Specifically, to invert the CDF, we need to be able to solve the following equation (representing our objective function):
$$CDF(x)-p=0.$$
The `uniroot` and `stats::nlm` R functions do *non-linear minimization* of a function $f$ using a [Newton-Raphson algorithm](https://en.wikipedia.org/wiki/Newton%27s_method).
```{r}
set.seed(1234)
x <- rnorm(1000, 100, 20)
pdf_x <- density(x)
# Interpolate the density, the values returned when input x values are outside [min(x): max(x)] should be trivial
f_x <- approxfun(pdf_x$x, pdf_x$y, yleft=0, yright=0)
# Manual computation of the cdf by numeric integration
cdf_x <- function(x){
v <- integrate(f_x, -Inf, x)$value
if (v<0) v <- 0
else if(v>1) v <- 1
return(v)
}
# Finding the roots of the inverse-CDF function by hand (CDF(x)-p=0)
invcdf <- function(p){
uniroot(function(x){cdf_x(x) - p}, range(x))$root
# alternatively, can use
# nlm(function(x){cdf_x(x) - p}, 0)$estimate
# minimum - the value of the estimated minimum of f.
# estimate - the point at which the minimum value of f is obtained.
}
invcdf(0.5)
# We can validate that the inverse-CDF is correctly computed: F^{-1}(F(x))==x
cdf_x(invcdf(0.8))
```
The ability to compute exactly, or at least estimate, the inverse-CDF function is important for many reasons. For instance, generating random observations from a specified probability distribution (e.g., normal, exponential, or gamma distribution) is an important task in many scientific studies. One approach for such random number generation from a specified distribution evaluates the inverse CDF at random uniform $u\sim U(0,1)$ values. Recall that in [Chapter 15](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/15_SpecializedML_FormatsOptimization.html) we showed an example of generating random uniform samples using atmospheric noise. The key step is ability to quickly, efficiently and reliably estimate the inverse CDF function, which we just showed one example of.
Let's see why inverting the CDF using random uniform data works. Consider the cumulative distribution function (CDF) of a probability distribution from which we are interested in sampling. If the CDF has a closed form analytical expression and is invertible, then we generate a random sample from that distribution by evaluating the inverse CDF at $u$, where $u \sim U(0,1)$. This is possible since a continuous CDF, $F$, is a one-to-one mapping of the domain of the CDF (range of $X$) into the interval $[0,1]$. Therefore, if $U$ is a uniform random variable on $[0,1]$, then $X = F^{-1}(U)$ has the distribution $F$. Suppose $U \sim Uniform[0,1]$, then $P(F^{-1}(U) \leq x)= P(U \leq F(x))$, by applying $F$ to both sides of this inequality, since $F$ is monotonic. Thus, $P(F^{-1}(U) \leq x)= F(x)$, since $P(U \leq u) = u$ for uniform random variables.
## Example 2: minimizing a bivariate function
Let's look at the function $f(x_1, x_2) = (x_1-3)^2 + (x_2+4)^2$. We define the function in R and utilize the `optim` function to obtain the extrema points in the support of the objective function and/or the extrema values at these critical points.
```{r}
require("stats")
f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 }
initial_x <- c(0, -1)
x_optimal <- optim(initial_x, f, method="CG") # performs minimization
x_min <- x_optimal$par
# x_min contains the domain values where the (local) minimum is attained
x_min # critical point/vector
x_optimal$value # extrema value of the objective function
```
`optim` allows the use of 6 candidate optimization strategies:
* **Nelder-Mead**: robust but relatively slow, works reasonably well for non-differentiable functions.
* **BFGS**: quasi-Newton method (also known as a variable metric algorithm), uses function values and gradients to build up a picture of the surface to be optimized.
* **CG**: conjugate gradients method, fragile, but successful in larger optimization problems because it's unnecessary to save large matrix.
* **L-BFGS-B**: allows box constraints.
* **SANN**: a variant of simulated annealing, belonging to the class of stochastic global optimization methods.
* **Brent**: for one-dimensional problems only, useful in cases where `optim()` is used inside other functions where only method can be specified.
## Example 3: using simulated annealing to find the maximum of an oscillatory function
Consider the function $f(x) = 10 \sin(0.3 x)\times \sin(1.3 x^2) - 0.00002 x^4 + 0.3 x+35$. Maximizing $f()$ is equivalent to minimizing $-f()$. Let's plot this oscillatory function, find and report its critical points and extremum values.
The function `optim` returns two important results:
* `par`: the best set of domain parameters found to optimize the function
* `value`: the extreme values of the function corresponding to par.
```{r}
funct_osc <- function (x) { -(10*sin(0.3*x)*sin(1.3*x^2) - 0.00002*x^4 + 0.3*x+35) }
plot(funct_osc, -50, 50, n = 1000, main = "optim() minimizing an oscillatory function")
abline(v=17, lty=3, lwd=4, col="red")
res <- optim(16, funct_osc, method = "SANN", control = list(maxit = 20000, temp = 20, parscale = 20))
res$par
res$value
```
# Constrained Optimization
## Equality constraints
When there are support restrictions, dependencies or other associations between the domain variables $x_1, x_2, \ldots, x_n$, constrained optimization needs to be applied.
For example, we can have $k$ equations specifying these restrictions, which may specify certain model characteristics:
$$\begin{cases}
g_1(x_1, x_2, \ldots, x_n) = 0\\
\ldots \\
g_k(x_1, x_2, \ldots, x_n) = 0
\end{cases} .$$
Note that the right hand sides of these equations may always be assumed to be trivial ($0$), otherwise we can just move the non-trivial parts within the constraint functions $g_i$. [Linear Programming](http://en.wikipedia.org/wiki/Linear_programming), [Quadratic Programming](http://en.wikipedia.org/wiki/Quadratic_programming), and [Lagrange multipliers](http://en.wikipedia.org/wiki/Lagrange_multiplier) may be used to solve such equality-constrained optimization problems.
## Lagrange Multipliers
We can merge the equality constraints within the objective function ($f \longrightarrow f^*$). Lagrange multipliers represents a typical solution strategy that turns the *constrained* optimization problem ($\min_{x} {f(x)}$ subject to $g_i(x_1, x_2, \ldots, x_n)$, $1\leq i\leq k$), into an *unconstrained* optimization problem:
$$f^*(x_1, x_2, \ldots, x_n; \lambda_1, \lambda_2, \ldots, \lambda_k) = f(x_1, x_2, \ldots, x_n) + \sum_{i=1}^k {\lambda_i g_i(x_1, x_2, \ldots, x_n)}.$$
Then, we can apply traditional unconstrained optimization schemas, e.g., extreme value theorem, to minimize the unconstraint problem:
$$f^*(x_1, x_2, \ldots, x_n; \lambda_1, \lambda_2, \ldots, \lambda_k) = f(x_1, x_2, \ldots, x_n) + \lambda_1 g_1(x_1, x_2, \ldots, x_n) + \ldots + \lambda_k g_k(x_1, x_2, \ldots, x_n).$$
This represent an unconstrained optimization problem using [Lagrange multipliers](https://en.wikipedia.org/wiki/Lagrange_multiplier).
The solution of the constrained problem is also a solution to:
$$\nabla f^* = \left [\frac{d f}{d x_1}, \frac{d f}{d x_2}, \ldots, \frac{d f}{d x_n}; \frac{d f}{d \lambda_1}, \frac{d f}{d \lambda_2}, \ldots, \frac{d f}{d \lambda_k} \right ] = [0, 0, \ldots, 0].$$
## Inequality constrained optimization
There are no general solutions for arbitrary inequality constraints; however, partial solutions do exist when some restrictions on the form of constraints are present.
When both the constraints and the objective function are `linear functions` of the domain variables, then the problem can be solved by Linear Programming.
### Linear Programming (LP)
LP works when the objective function is a linear function. The constraint functions are also linear combination of the same variables.
Consider the following elementary (`minimization`) example:
$$ \min_{x_1, x_2, x_3} (-3x_1 -4x_2 -3x_3)$$
subject to:
$$ \left\{
\begin{array}{rl}
6x_1 + 2x_2 + 4x_3 & \leq 150 \\
x_1 + x_2 + 6x_3 & \geq 0 \\
4x_1 + 5x_2 + 4x_3 & = 40
\end{array} \right. $$
The exact solution is $x_1 = 0, x_2 = 8, x_3 = 0$, and can be computed using the package `lpSolveAPI` to set up the constraint problem and the generic `solve()` method to find its solutions.
```{r}
# install.packages("lpSolveAPI")
library(lpSolveAPI)
lps.model <- make.lp(0, 3) # define 3 variables
# add the constraints as a matrix of the linear coefficients, relations and RHS
add.constraint(lps.model, c(6, 2, 4), "<=", 150)
add.constraint(lps.model, c(1, 1, 6), ">=", 0)
add.constraint(lps.model, c(4, 5, 4), "=" , 40)
# set objective function (default: find minimum)
set.objfn(lps.model, c(-3, -4, -3))
# you can save the model to a file
# write.lp(lps.model, 'c:/Users/LPmodel.lp', type='lp')
# these commands define the constraint linear model
# /* Objective function */
# min: -3 x1 -4 x2 -3 x3;
#
# /* Constraints */
# +6 x1 +2 x2 +4 x3 <= 150;
# + x1 + x2 +6 x3 >= 0;
# +4 x1 +5 x2 +4 x3 = 40;
#
# writing it in the text file named 'LPmodel.lp'
solve(lps.model)
# Retrieve the values of the variables from a solved linear program model
get.variables(lps.model) # check against the exact solution x_1 = 0, x_2 = 8, x_3 = 0
get.objective(lps.model) # get optimal (min) value
```
In lower dimensional problems, we can also plot the constraints to graphically demonstrate the corresponding support restriction. For instance, here is an example of a simpler 2D constraint and its Venn diagrammatic representation.
$$ \left\{
\begin{array}{rl}
x_1 & \leq \frac{150 -2x_2}{6} \\
x_1 & \geq -x_2\\
\end{array} \right. . $$
```{r}
library(ggplot2)
ggplot(data.frame(x = c(-100, 0)), aes(x = x)) +
stat_function(fun=function(x) {(150-2*x)/6}, aes(color="Function 1")) +
stat_function(fun=function(x) { -x }, aes(color = "Function 2")) +
theme_bw() +
scale_color_discrete(name = "Function") +
geom_polygon(
data = data.frame(x = c(-100, -100, 0, 0, Inf), y = c(0,350/6, 150/6, 0, 0)),
aes(x = x, y = y, fill = "Constraint 1"),
inherit.aes = FALSE, alpha = 0.5) +
geom_polygon(
data = data.frame(x = c(-100, -100, 0, Inf), y = c(0, 100, 0, 0)),
aes(x = x, y = y, fill = "Constraint 2"),
inherit.aes = FALSE, alpha = 0.3) +
scale_fill_discrete(name = "Constraint Set") +
scale_y_continuous(limits = c(0, 100))
```
Here is another example of `maximization` of a trivariate cost function, $f(x_1, x_2, x_3)=3x_1+ 4x_2 -x_3$, subject to:
$$ \left\{
\begin{array}{rl}
-x_1 + 2x_2 + 3x_3 & \leq 16 \\
3x_1 - x_2 - 6x_3 & \geq 0 \\
x_1 - x_2 & \leq 2
\end{array} \right. . $$
```{r}
lps.model2 <- make.lp(0, 3)
add.constraint(lps.model2, c(-1, 2, 3), "<=", 16)
add.constraint(lps.model2, c(3, -1, -6), ">=", 0)
add.constraint(lps.model2, c(1, -1, 0), "<=", 2)
set.objfn(lps.model2, c(3, 4, -1), indices = c(1, 2, 3))
lp.control(lps.model2, sense='max') # changes to max: 3 x1 + 4 x2 - x3
solve(lps.model2) # 0 suggests that this solution convergences
get.variables(lps.model2) # get point of maximum
get.objective(lps.model2) # get optimal (max) value
```
In 3D we can utilize the `rgl::surface3d()` method to display the constraints. This output is suppressed, as it can only be interpreted via the pop-out 3D rendering window.
```{r eval=F, echo=T, warning=F, message=FALSE}
library("rgl")
n <- 100
x <- y <- seq(-500, 500, length = n)
region <- expand.grid(x = x, y = y)
z1 <- matrix(((150 -2*region$x -4*region$y)/6), n, n)
z2 <- matrix(-region$x + 6*region$y, n, n)
z3 <- matrix(40 -5*region$x - 4*region$y, n, n)
surface3d(x, y, z1, back = 'line', front = 'line', col = 'red', lwd = 1.5, alpha = 0.4)
surface3d(x, y, z2, back = 'line', front = 'line', col = 'orange', lwd = 1.5, alpha = 0.4)
surface3d(x, y, z3, back = 'line', front = 'line', col = 'blue', lwd = 1.5, alpha = 0.4)
axes3d()
```
It is possible to restrict the domain type to contain only solutions that are:
* *integers*, which makes it an Integer Linear Programming (ILP),
* *binary/boolean* values (BLP), or
* *mixed* types, Mixed Integer Liner Programming (MILP).
Some examples are included below.
### Mixed Integer Linear Programming (MILP)
Let's demonstrate MILP with an example where the type of $x_1$ is unrestricted, $x_2$ is dichotomous (binary), and $x_3$ is restricted to be an integer.
```{r}
lps.model <- make.lp(0, 3)
add.constraint(lps.model, c(6, 2, 4), "<=", 150)
add.constraint(lps.model, c(1, 1, 6), ">=", 0)
add.constraint(lps.model, c(4, 5, 4), "=", 40)
set.objfn(lps.model, c(-3, -4, -3))
set.type(lps.model, 2, "binary")
set.type(lps.model, 3, "integer")
get.type(lps.model) # This is Mixed Integer Linear Programming (MILP)
set.bounds(lps.model, lower=-5, upper=5, columns=c(1))
# give names to columns and restrictions
dimnames(lps.model) <- list(c("R1", "R2", "R3"), c("x1", "x2", "x3"))
print(lps.model)
solve(lps.model)
get.objective(lps.model)
get.variables(lps.model)
get.constraints(lps.model)
```
The next example limits all three variable to be dichotomous (binary).
```{r}
lps.model <- make.lp(0, 3)
add.constraint(lps.model, c(1, 2, 4), "<=", 5)
add.constraint(lps.model, c(1, 1, 6), ">=", 2)
add.constraint(lps.model, c(1, 1, 1), "=", 2)
set.objfn(lps.model, c(2, 1, 2))
set.type(lps.model, 1, "binary")
set.type(lps.model, 2, "binary")
set.type(lps.model, 3, "binary")
print(lps.model)
solve(lps.model)
get.variables(lps.model)
```
## Quadratic Programming (QP)
QP can be used for second order (quadratic) objective functions, but the constraint functions are still linear combination of the domain variables.
A matrix formulation of the problem can be expressed as minimizing an objective function:
$$f(X) = \frac{1}{2} X^T D X - d^T X, $$
where $X$ is a vector $[x_1, x_2, \ldots, x_n]^T$, $D$ is the matrix of weights of each association pair, $x_i, x_j$, and $d$ are the weights for each individual feature, $x_i$. The $\frac{1}{2}$ coefficient ensures that the weights matrix $D$ is symmetric and each $x_i, x_j$ pair is not double-counted. This cost function is subject to the constraints:
$$A^T X\ \ [ = | \geq ]\ \ b, $$
where the first $k$ constrains may represent equalities ($=$) and the remaining ones are inequalities ($\geq$), and $b$ is the constraints right hand size (RHS) constant vector.
Here is an example of a QP objective function and its R optimization:
$$f(x_1, x_2, x_3) = 2 x_1^2 - x_1x_2 - 2 x_2^2 + x_2 x_3 + 2 x_3^2 - 5 x_2 + 3 x_3.$$
Subject to the following constraints:
$$\begin{array}{rcl}
-4x_1 + -3x_2 = -8 \\
2x_1 + x_2 = 2 \\
-2x_2 + x_3 \geq 0
\end{array} .$$
```{r}
library(quadprog)
Dmat <- matrix(c( 2, -1, 0,
-1, 2, -1,
0, -1, 2), 3, 3)
dvec <- c(0, -5, 3)
Amat <- matrix(c(-4, -3, 0,
2, 1, 0,
0, -2, 1), 3, 3)
bvec <- c(-8, 2, 0)
n.eqs <- 2 # the first two constraints are equalities
sol <- solve.QP(Dmat, dvec, Amat, bvec=bvec, meq=2)
sol$solution # get the (x1, x2, x3) point of minimum
sol$value # get the actual cost function minimum
```
The minimum value, $49$, of the QP solution is attained at $x_1=-1, x_2=4, x_3=8$.
When $D$ is a positive definitive matrix, i.e., $X^T D X \gt 0$, for all non-zero $X$, the QP problem may be solved in polynomial time. Otherwise, the QP problem is NP-hard. In general, even if $D$ has only one negative eigenvalue, the QP problem is still [NP-hard](https://en.wikipedia.org/wiki/NP-hardness).
The QP function `solve.QP()` expects a positive definitive matrix $D$.
# General Non-linear Optimization
The package `Rsolnp` provides a special function `solnp()`, which solves the general non-linear programming problem:
$$\min_x { f(x) }$$
subject to:
$$g(x)=0$$
$$l_h \leq h(x) \leq u_h$$
$$l_x \leq x \leq u_x, $$
where $f(x), g(x), h(x)$ are all smooth functions.
## Dual problem optimization
*Duality* in math really just means having two complementary ways to think about an optimization problem. The *primal problem* represents an optimization challenge in terms of the original decision variable $x$. The *dual problem*, also called *Lagrange dual*, searches for a lower bound of a minimization problem or an upper bound for a maximization problem. In general, the primal problem may be difficult to analyze, or solve directly, because it may include non-differentiable penalty terms, e.g., $l_1$ norms, recall LASSO/Ridge regularization in [Chapter 17](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/17_RegularizedLinModel_KnockoffFilter.html). Hence, we turn to the corresponding *Lagrange dual problem* where the solutions may be more amenable, especially for [convex functions](https://ocw.mit.edu/courses/sloan-school-of-management/15-094j-systems-optimization-models-and-computation-sma-5223-spring-2004/lecture-notes/duality_article.pdf), that satisfy the following inequality:
$$f(\lambda x +(1-\lambda)y)\leq \lambda f(x) + (1-\lambda)f(y).$$
### Motivation
Suppose we want to borrow money, $x$, from a bank, or lender, and $f(x)$ represents the borrowing cost to us. There are natural "design constraints" on money lending. For instance, there may be a cap in the interest rate, $h(x)\leq b$, or we can have many other constraints on the loan duration. There may be multiple lenders, including self-funding, that may "charge" us $f(x)$ for lending us $x$. *Lenders goals are to maximize profits*. Yet, they can't charge you more than the prime interest rate plus some premium based on your credit worthiness. Thus, for a given fixed $\lambda$, a lender may make us an offer to lend us $x$ aiming to minimize $$f(x)+\lambda \times h(x).$$
If this cost is not optimized, i.e., minimized, you may be able to get another loan $y$ at lower cost $f(y) 1e-6) {
previous <- y
y <- y - solve( grad_h(y), h(y) )
}
y[1:k]
}
x <- c(0, 0, 0)
# Define the objective cost function
fn4 <- function(x) # f(x, y, z) = 4y-2z + x^2+y^2
{
4*x[2] - 2*x[3] + x[1]^2+ x[2]^2
#sum(x^2)
}
# check the derivative of the objective function
grad(fn4, x)
# define the domain constraints of the problem
# constraint z1: 2x-y-z = 2
# constraint z2: x^2+y^2 +z = 1
eqn4 <- function(x){
z1=2*x[1] - x[2] - x[3] -2
z2=x[1]^2 + x[2]^2 + x[3] -1
return(c(z1, z2))
}
# Check the Jacobian of the constraints
jacobian(eqn4, x)
# Call the Lagrange-multipliers solver
# check one step of the algorithm
k <- length(x)
l <- length(eqn4(x));
h <- function(x) {
c(grad(fn4, x[1:k]) - t(-x[(1:2)]) %*% jacobian(eqn4, x[1:k]), -eqn4(x[1:k]))
}
jacobian(h, x)
# Lagrange-multipliers solver for f(x, y, z) subject to g(x, y, z)
lagrange_multipliers(x, fn4, eqn4)
```
Now, let's double-check the above manual optimization results against the automatic `solnp` solution minimizing
$$f(x, y, z) = 4y-2z + x^2+y^2$$
subject to:
$$\begin{array}{rcl}
2x-y-z=2\\
x^2+y^2=1.
\end{array}$$
```{r}
library(Rsolnp)
fn4 <- function(x) # f(x, y, z) = 4y-2z + x^2+y^2
{
4*x[2] - 2*x[3] + x[1]^2+ x[2]^2
}
# constraint z1: 2x-y-z = 2
# constraint z2: x^2+y^2 +z = 1
eqn4 <- function(x){
z1=2*x[1] - x[2] - x[3]
z2=x[1]^2 + x[2]^2 + x[3]
return(c(z1, z2))
}
constraints4 <- c(2, 1)
x0 <- c(1, 1, 1)
ctrl <- list(trace=0)
sol4 <- solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl)
sol4$values
```
The results of both (manual and automated) experiments identifying the optimal $(x, y, z)$ coordinates minimizing the objective function $f(x, y, z) = 4y-2z + x^2+y^2$ are in agreement.
* *Manual* optimization: `lagrange_multipliers(x, fn4, eqn4)`: 0.3416408 -1.0652476 -0.2514708.
* *Automated* optimization: `solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl)`: 0.3416408 -1.0652476 -0.2514709.
# Data Denoising
Suppose we are given $x_{noisy}$ with $n$ noise-corrupted data points. The noise may be additive ($x_{noisy}\sim x +\epsilon$) or not additive. We may be interested in denoising the signal and recovering a version of the original (unobserved) dataset $x$, potentially as a smother representation of the original (uncorrupted) process. Smoother signals suggest less (random) fluctuations between neighboring data points.
One objective function we can design to denoise the observed signal, $x_{noisy}$, may include a *fidelity term* and a *regularization term*, see the regularized linear modeling, [Chapter 17](http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/17_RegularizedLinModel_KnockoffFilter.html).
`Total variation` denoising assumes that for each time point $t$, the observed noisy data $$\underbrace{x_{noisy}(t)}_{\text{observed signal}} \sim \underbrace{x(t)}_{\text{native signal}} + \underbrace{\epsilon(t)}_{\text{random noise}}.$$
To recover the *native signal*, $x(t)$, we can optimize ($\arg\min_x{f(x)}$) the following objective cost function:
$$f(x) = \underbrace{\frac{1}{2} \sum_{t=1}^{n-1} {\|y(t) - x_{noisy}(t)\| ^2}}_{\text{fidelity term}} + \underbrace{\lambda \sum_{t=2}^{n-1} | x(t) - x(t-1)|}_{\text{regularization term}}, $$
where $\lambda$ is the regularization smoothness parameter, $\lambda \rightarrow 0 \implies y \rightarrow x_{noisy}$. Minimizing $f(x)$ provides a minimum total-variation solution to the data denoising problem.
Below is an example illustrating total variation (TV) denoising using a simulated noisy dataset. We start by generating an oscillatory noisy signal. Then, we compute several smoothed versions of the noisy data, plot the initial and smoothed signals, define and optimize the TV denoising objective function, which is a mixture of a fidelity term and a regularization term.
```{r warning=FALSE}
n <- 1000
x <- rep(0, n)
xs <- seq(0, 8, len=n) # seq(from = 1, to = 1, length)
noise_level = 0.3 # sigma of the noise, try varying this noise-level
# here is where we add the zero-mean noise
set.seed(1234)
x_noisy <- function (x) {
sin(x)^2/(1.5+cos(x)) + rnorm(length(x), 0, noise_level)
}
# initialize the manual denoised signal
x_denoisedManu <- rep(0, n)
df <- as.data.frame(cbind(xs, x_noisy(xs)))
# loess fit a polynomial surface determined by numerical predictors,
# using local fitting
poly_model1 <- loess(x_noisy(xs) ~ xs, span=0.1, data=df) # tight model
poly_model2 <- loess(x_noisy(xs) ~ xs, span=0.9, data=df) # smoother model
# To see some of numerical results of hte model-fitting:
# View(as.data.frame(cbind(xs, x_noisy, predict(poly_model1))))
plot(xs, x_noisy(xs), type='l')
lines(xs, poly_model1$fitted, col="red", lwd=2)
lines(xs, poly_model2$fitted, col="blue", lwd=3)
```
Next, let's initiate the parameters, define the objective function and optimize it, i.e., estimate the parameters that minimize the cost function as a mixture of fidelity and regularization terms.
```{r}
# initialization of parameters
betas_0 <- c(0.3, 0.3, 0.5, 1)
betas <- betas_0
# Denoised model
x_denoised <- function(x, betas) {
if (length(betas) != 4) {
print(paste0("Error!!! length(betas)=", length(betas), " != 4!!! Exiting ..."))
break();
}
# print(paste0(" .... betas = ", betas, "...\n"))
# original noise function definition: sin(x)^2/(1.5+cos(x))
return((betas[1]*sin(betas[2]*x)^2)/(betas[3]+cos(x)))
}
library(Rsolnp)
fidelity <- function(x, y) {
sqrt((1/length(x)) * sum((y - x)^2))
}
regularizer <- function(betas) {
reg <- 0
for (i in 1:(length(betas-1))) {
reg <- reg + abs(betas[i])
}
return(reg)
}
# Objective Function
objective_func <- function(betas) {
# f(x) = 1/2 * \sum_{t=1}^{n-1} {|y(t) - x_{noisy}(t)\|^2}} + \lambda * \sum_{t=2}^{n-1} | x(t) - x(t-1)|
fid <- fidelity(x_noisy(xs), x_denoised(xs, betas))
reg <- abs(betas[4])*regularizer(betas)
error <- fid + reg
# uncomment to track the iterative optimization state
# print(paste0(".... Fidelity =", fid, " ... Regularizer = ", reg, " ... TotalError=", error))
# print(paste0("....betas=(",betas[1],",",betas[2],",",betas[3],",",betas[4],")"))
return(error)
}
# inequality constraint forcing the regularization parameter lambda=beta[4]>0
inequalConstr <- function(betas){
betas[4]
}
inequalLowerBound <- 0; inequalUpperBound <- 100
# should we list the value of the objective function and the parameters at every iteration (default trace=1; trace=0 means no interim reports)
# constraint problem
# ctrl <- list(trace=0, tol=1e-5) ## could specify: outer.iter=5, inner.iter=9)
set.seed(121)
sol_lambda <- solnp(betas_0, fun = objective_func, ineqfun = inequalConstr, ineqLB = inequalLowerBound, ineqUB = inequalUpperBound, control=ctrl)
# unconstraint optimization
# ctrl <- list(trace=1, tol=1e-5) ## could specify: outer.iter=5, inner.iter=9)
# sol_lambda <- solnp(betas_0, fun = denoising_func, control=ctrl)
# suppress the report of the the functional values (too many)
# sol_lambda$values
# reprot the optimal parameter estimates (betas)
sol_lambda$pars
# Reconstruct the manually-denoised signal using the optimal betas
betas <- sol_lambda$pars
x_denoisedManu <- x_denoised(xs, betas)
print(paste0("Final Denoised Model:", betas[1], "*sin(", betas[2], "*x)^2/(", betas[3], "+cos(x)))"))
plot(x_denoisedManu)
```
Finally, we can validate our manual denoising protocol against the automated TV denoising using the R package [tvd](https://cran.r-project.org/web/packages/tvd).
```{r}
# install.packages("tvd")
library("tvd")
lambda_0 <- 0.5
x_denoisedTVD <- tvd1d(x_noisy(xs), lambda_0, method = "Condat")
# lambda_o is the total variation penalty coefficient
# method is a string indicating the algorithm to use for denoising.
# Default method is "Condat"
# plot(xs, x_denoisedTVD, type='l')
plot(xs, x_noisy(xs), type='l')
lines(xs, poly_model1$fitted, col="red", lwd=2)
lines(xs, poly_model2$fitted, col="blue", lwd=3)
lines(xs, x_denoisedManu, col="pink", lwd=4)
lines(xs, x_denoisedTVD, col="green", lwd=5)
# add a legend
legend("bottom", c("x_noisy", "poly_model1$fitted", "poly_model2$fitted", "x_denoisedManu", "x_denoisedTVD"), col=c("black", "red", "blue", "pink", "green"), lty=c(1,1, 1,1), cex=0.7, lwd= c(1,2,3,4,5), title="TV Denoising")
```
# References
* [CRAN Optimization & Math Programming Site](http://cran.r-project.org/web/views/Optimization.html) provides details about a broad range of R optimization functions.
* [Vincent Zoonekynd's Optimization Blog](http://zoonek.free.fr/blosxom/R/2012-06-01_Optimization.html).