#' --- #' title: "Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

## Function Optimization

" #' author: "

### SOCR/MIDAS (Ivo Dinov)

" #' date: "r format(Sys.time(), '%B %Y')" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' toc: true #' number_sections: true #' toc_depth: 2 #' toc_float: #' collapsed: false #' smooth_scroll: true #' --- #' #' Most data-driven scientific inference, qualitative, quantitative and visual analytics involve formulating, understanding the behavior of, and optimizing objective (cost) functions. Presenting the mathematical foundations of representation and interrogation of diverse spectra of objective functions provides mechanisms for obtaining effective solutions to complex big data problems. (Multivariate) *function optimization* (minimization or maximization) is the process of searching for variables $x_1, x_2, x_3, \ldots, x_n$ that either minimize or maximize the multivariate cost (objective) function $f(x_1, x_2, x_3, \ldots, x_n)$. In this Chapter, we will specifically discuss (1) constrained and unconstrained optimization, (2) Lagrange multipliers, (3) linear, quadratic and (general) non-linear programming, and (4) data denoising. #' #' # Free (unconstrained) optimization #' #' Unconstrained function optimization refers to searching for extrema without restrictions for the domain of the cost function, $\Omega \ni \{x_i\}$. The [extreme value theorem](http://en.wikipedia.org/wiki/Extreme_value_theorem) suggest that a solution to the free optimization processes, $\min_{x_1, x_2, x_3, \ldots, x_n}{f(x_1, x_2, x_3, \ldots, x_n)}$ or $\max_{x_1, x_2, x_3, \ldots, x_n}{f(x_1, x_2, x_3, \ldots, x_n)}$, may be obtained by a gradient vector descent method. This means that we can minimize/maximize the objective function by finding solutions to $\nabla f = \{\frac{d f}{d x_1}, \frac{d f}{d x_2}, \ldots, \frac{d f}{d x_1}\}=\{0, 0, \ldots, 0\}$. Solutions to this equation, $x_1, \ldots, x_n$, will present candidate (local) minima and maxima. #' #' In general, identifying critical points using the gradient or tangent plane, where the partial derivatives are trivial, may not be sufficient to determine the *extrema* (*minima* or *maxima*) of multivariate objective functions. Some critical points may represent inflection points, or local extrema that are far from the global *optimum* of the objective function. The eigen-values of the [Hessian matrix](https://en.wikipedia.org/wiki/Second_partial_derivative_test), which includes the second order partial derivatives, at the critical points provide clues to pinpoint extrema. For instance, invertible Hessian matrices that (1) are positive definite (i.e., all eigenvalues are positive), yield a local minimum at the critical point, (2) are negative definite (all eigenvalues are negative) at the critical point suggests that the objective function has a local maximum, and (3) have both positive and negative eigenvalues yield a saddle point for the objective function at the critical point where the gradient is trivial. #' #' There are two complementary strategies to avoid being trapped in *local* extrema. First, we can run many iterations with different initial vectors. At each iteration, the objective function may achieve a (local) maximum/minimum/saddle point. Finally, we select the overall minimal (or maximal) value from all iterations. Another adaptive strategy involves either adjusting the step sizes or accepting solutions *in probability*, e.g., [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing) is one example of an adaptive optimization. #' #' ## Example 1: minimizing a univariate function (inverse-CDF) #' #' The cumulative distribution function (CDF) of a real-valued random process $X$, also known as the distribution function of $X$, represents the probability that the random variable $X$ does not exceed a certain level. Mathematically speaking, the CDF of $X$ is $F_X(x) = P(X\leq x)$. Recall the [Chapter 1](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html) discussions of Uniform, Normal, Cauchy, Binomial, Poisson and other discrete and continuous distributions. Try the [Probability Distributome Navigator](http://distributome.org/V3/). Also explore the [dynamic representations of density and distribution functions included in the Probability Distributome Calculators](http://www.distributome.org/V3/calc/index.html). #' #' For each $p\in [0,1]$, the *inverse distribution function*, also called *quantile function* (e.g., qnorm), yields the critical value ($x$) at which the probability of the random variable is less than or equal to the given probability ($p$). When the CDF $F_X$ is continuous and strictly increasing, the value of the inverse CDF at $p$, $F^{-1}(p)=x$, is the unique real number $x$ such that $F(x)=p$. #' #' Below, we will plot the probability density function (PDF) and the CDF for *Normal* distribution in R. #' #' par(mfrow=c(1,2), mar=c(3,4,4,2)) z<-seq(-4, 4, 0.1) # points from -4 to 4 in 0.1 steps q<-seq(0.001, 0.999, 0.001) # probaility quantile values from 0.1% to 99.9% in 0.1% steps dStandardNormal <- data.frame(Z=z, Density=dnorm(z, mean=0, sd=1), Distribution=pnorm(z, mean=0, sd=1)) plot(z, dStandardNormal$Density, col="darkblue",xlab="z", ylab="Density", type="l",lwd=2, cex=2, main="Standard Normal PDF", cex.axis=0.8) # could also do # xseq<-seq(-4, 4, 0.01); density<-dnorm(xseq, 0, 1); plot (density, main="Density") # Compute the CDF xseq<-seq(-4, 4, 0.01); cumulative<-pnorm(xseq, 0, 1) # plot (cumulative, main="CDF") plot(xseq, cumulative, col="darkred", xlab="", ylab="Cumulative Probability", type="l",lwd=2, cex=2, main="CDF of (Simulated) Standard Normal", cex.axis=.8) #' #' #' The example below shows a semi-interactive standard normal distribution calculator using plot_ly. You can see many other probability distribution calculators on the [Distributome site](http://www.distributome.org/V3/calc/index.html). #' #' library(plotly) library(quantmod) z<-seq(-4, 4, 0.1) # points from -4 to 4 in 0.1 steps q<-seq(0.001, 0.999, 0.01) # probaility quantile values from 0.1% to 99.9% in 0.1% steps dStandardNormal <- data.frame(Z=z, Density=dnorm(z, mean=0, sd=1), Distribution=pnorm(z, mean=0, sd=1)) # plot(z, dStandardNormal$Density, col="darkblue",xlab="z", ylab="Density", type="l",lwd=2, cex=2, main="Standard Normal PDF", cex.axis=0.8) # polygon(z, dStandardNormal$Density, col="red", border="blue") dStandardNormal$ID <- seq.int(nrow(dStandardNormal)) aggregate_by <- function(dataset, feature) { feature <- lazyeval::f_eval(feature, dataset) levels <- plotly:::getLevels(feature) aggData <- lapply(seq_along(levels), function(x) { cbind(dataset[feature %in% levels[seq(1, x)], ], frame = levels[[x]]) }) dplyr::bind_rows(aggData) } dStandardNormal <- dStandardNormal %>% aggregate_by(~ID) plotMe <- dStandardNormal %>% plot_ly( x = ~Z, y = ~Density, frame = ~frame, type = 'scatter', mode = 'lines', fill = 'tozeroy', fillcolor="red", line = list(color = "blue"), text = ~paste("Z: ", Z, "
Density: ", Density, "
CDF: ", Distribution), hoverinfo = 'text' ) %>% layout( title = "Standard Normal Distribution Distribution", yaxis = list( title = "N(0,1) Density", range = c(0,0.45), zeroline = F, tickprefix = "" # density value ), xaxis = list( title = "Z", range = c(-4,4), zeroline = T, showgrid = T ) ) %>% animation_opts( frame = 100, transition = 1, redraw = FALSE ) %>% animation_slider( currentvalue = list( prefix = "Z: " ) ) plotMe # If you want to create a shareable plotly link to the Interactive Normal Distribution Calculator, see the API for estbalishing credentials: https://plot.ly/r/getting-started # stdNormalCalculator = api_create(p, filename="SOCR/interactiveStdNormalCalculator") # stdNormalCalculator #' #' #' Suppose we are interested in computing, or estimating, the inverse-CDF from first principles. Specifically, to invert the CDF, we need to be able to solve the following equation (representing our objective function): #' $$CDF(x)-p=0.$$ #' #' The stats::uniroot and stats::nlm R functions do *non-linear minimization* of a function $f$ using a [Newton-Raphson algorithm](https://en.wikipedia.org/wiki/Newton%27s_method). Let's test that optimization using $N(\mu=100,\sigma=20)$. #' #' set.seed(1234) x <- rnorm(1000, 100, 20) pdf_x <- density(x) # Interpolate the density, the values returned when input x values are outside [min(x): max(x)] should be trivial f_x <- approxfun(pdf_x$x, pdf_x$y, yleft=0, yright=0) # Manual computation of the cdf by numeric integration cdf_x <- function(x){ v <- integrate(f_x, -Inf, x)$value if (v<0) v <- 0 else if(v>1) v <- 1 return(v) } # Finding the roots of the inverse-CDF function by hand (CDF(x)-p=0) invcdf <- function(p){ uniroot(function(x){cdf_x(x) - p}, range(x))$root # alternatively, can use # nlm(function(x){cdf_x(x) - p}, 0)$estimate # minimum - the value of the estimated minimum of f. # estimate - the point at which the minimum value of f is obtained. } invcdf(0.5) # We can validate that the inverse-CDF is correctly computed: F^{-1}(F(x))==x cdf_x(invcdf(0.8)) #' #' #' The ability to compute exactly, or at least estimate, the inverse-CDF function is important for many reasons. For instance, generating random observations from a specified probability distribution (e.g., normal, exponential, or gamma distribution) is an important task in many scientific studies. One approach for such random number generation from a specified distribution evaluates the inverse CDF at random uniform$u\sim U(0,1)$values. Recall that in [Chapter 15](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/15_SpecializedML_FormatsOptimization.html) we showed an example of generating random uniform samples using atmospheric noise. The key step is ability to quickly, efficiently and reliably estimate the inverse CDF function, which we just showed one example of. #' #' Let's see why inverting the CDF using random uniform data works. Consider the cumulative distribution function (CDF) of a probability distribution from which we are interested in sampling. If the CDF has a closed form analytical expression and is invertible, then we generate a random sample from that distribution by evaluating the inverse CDF at$u$, where$u \sim U(0,1)$. This is possible since a continuous CDF,$F$, is a one-to-one mapping of the domain of the CDF (range of$X$) into the interval$[0,1]$. Therefore, if$U$is a uniform random variable on$[0,1]$, then$X = F^{-1}(U)$has the distribution$F$. Suppose$U \sim Uniform[0,1]$, then$P(F^{-1}(U) \leq x)= P(U \leq F(x))$, by applying$F$to both sides of this inequality, since$F$is monotonic. Thus,$P(F^{-1}(U) \leq x)= F(x)$, since$P(U \leq u) = u$for uniform random variables. #' #' ## Example 2: minimizing a bivariate function #' #' Let's look at the function$f(x_1, x_2) = (x_1-3)^2 + (x_2+4)^2$. We define the function in R and utilize the optim function to obtain the extrema points in the support of the objective function and/or the extrema values at these critical points. #' #' require("stats") f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 } initial_x <- c(0, -1) x_optimal <- optim(initial_x, f, method="CG") # performs minimization x_min <- x_optimal$par # x_min contains the domain values where the (local) minimum is attained x_min # critical point/vector x_optimal$value # extrema value of the objective function #' #' #' optim allows the use of 6 candidate optimization strategies: #' #' * **Nelder-Mead**: robust but relatively slow, works reasonably well for non-differentiable functions. #' * **BFGS**: quasi-Newton method (also known as a variable metric algorithm), uses function values and gradients to build up a picture of the surface to be optimized. #' * **CG**: conjugate gradients method, fragile, but successful in larger optimization problems because it's unnecessary to save large matrix. #' * **L-BFGS-B**: allows box constraints. #' * **SANN**: a variant of simulated annealing, belonging to the class of stochastic global optimization methods. #' * **Brent**: for one-dimensional problems only, useful in cases where optim() is used inside other functions where only method can be specified. #' #' ## Example 3: using simulated annealing to find the maximum of an oscillatory function #' #' Consider the function$f(x) = 10 \sin(0.3 x)\times \sin(1.3 x^2) - 0.00002 x^4 + 0.3 x+35$. Maximizing$f()$is equivalent to minimizing$-f()$. Let's plot this oscillatory function, find and report its critical points and extremum values. #' #' The function optim returns two important results: #' #' * par: the best set of domain parameters found to optimize the function #' * value: the extreme values of the function corresponding to par. #' #' funct_osc <- function (x) { -(10*sin(0.3*x)*sin(1.3*x^2) - 0.00002*x^4 + 0.3*x+35) } plot(funct_osc, -50, 50, n = 1000, main = "optim() minimizing an oscillatory function") abline(v=17, lty=3, lwd=4, col="red") res <- optim(16, funct_osc, method = "SANN", control = list(maxit = 20000, temp = 20, parscale = 20)) res$par res$value #' #' #' # Constrained Optimization #' #' ## Equality constraints #' #' When there are support restrictions, dependencies or other associations between the domain variables$x_1, x_2, \ldots, x_n$, constrained optimization needs to be applied. #' #' For example, we can have$k$equations specifying these restrictions, which may specify certain model characteristics: #' $$\begin{cases} #' g_1(x_1, x_2, \ldots, x_n) = 0\\ #' \ldots \\ #' g_k(x_1, x_2, \ldots, x_n) = 0 #' \end{cases} .$$ #' #' Note that the right hand sides of these equations may always be assumed to be trivial ($0$), otherwise we can just move the non-trivial parts within the constraint functions$g_i$. [Linear Programming](http://en.wikipedia.org/wiki/Linear_programming), [Quadratic Programming](http://en.wikipedia.org/wiki/Quadratic_programming), and [Lagrange multipliers](http://en.wikipedia.org/wiki/Lagrange_multiplier) may be used to solve such equality-constrained optimization problems. #' #' ## Lagrange Multipliers #' We can merge the equality constraints within the objective function ($f \longrightarrow f^*$). Lagrange multipliers represents a typical solution strategy that turns the *constrained* optimization problem ($\min_{x} {f(x)}$subject to$g_i(x_1, x_2, \ldots, x_n)=0$,$1\leq i\leq k$), into an *unconstrained* optimization problem: #' $$f^*(x_1, x_2, \ldots, x_n; \lambda_1, \lambda_2, \ldots, \lambda_k) = f(x_1, x_2, \ldots, x_n) + \sum_{i=1}^k {\lambda_i g_i(x_1, x_2, \ldots, x_n)}.$$ #' #' Then, we can apply traditional unconstrained optimization schemas, e.g., extreme value theorem, to minimize the unconstrained problem: #' #' $$f^*(x_1, x_2, \ldots, x_n; \lambda_1, \lambda_2, \ldots, \lambda_k) = f(x_1, x_2, \ldots, x_n) + \lambda_1 g_1(x_1, x_2, \ldots, x_n) + \ldots + \lambda_k g_k(x_1, x_2, \ldots, x_n).$$ #' #' This represent an unconstrained optimization problem using [Lagrange multipliers](https://en.wikipedia.org/wiki/Lagrange_multiplier). #' #' The solution of the constrained problem is also a solution to: #' $$\nabla f^* = \left [\frac{d f}{d x_1}, \frac{d f}{d x_2}, \ldots, \frac{d f}{d x_n}; \frac{d f}{d \lambda_1}, \frac{d f}{d \lambda_2}, \ldots, \frac{d f}{d \lambda_k} \right ] = [0, 0, \ldots, 0].$$ #' #' ## Inequality constrained optimization #' There are no general solutions for arbitrary inequality constraints; however, partial solutions do exist when some restrictions on the form of constraints are present. #' #' When both the constraints and the objective function are linear functions of the domain variables, then the problem can be solved by Linear Programming. #' #' ### Linear Programming (LP) #' LP works when the objective function is a linear function. The constraint functions are also linear combination of the same variables. #' #' Consider the following elementary (minimization) example: #' $$\min_{x_1, x_2, x_3} (-3x_1 -4x_2 -3x_3)$$ #' #' subject to: #' $$\left\{ #' \begin{array}{rl} #' 6x_1 + 2x_2 + 4x_3 & \leq 150 \\ #' x_1 + x_2 + 6x_3 & \geq 0 \\ #' 4x_1 + 5x_2 + 4x_3 & = 40 #' \end{array} \right.$$ #' #' The exact solution is$x_1 = 0, x_2 = 8, x_3 = 0$, and can be computed using the package lpSolveAPI to set up the constraint problem and the generic solve() method to find its solutions. #' #' # install.packages("lpSolveAPI") library(lpSolveAPI) lps.model <- make.lp(0, 3) # define 3 variables # add the constraints as a matrix of the linear coefficients, relations and RHS add.constraint(lps.model, c(6, 2, 4), "<=", 150) add.constraint(lps.model, c(1, 1, 6), ">=", 0) add.constraint(lps.model, c(4, 5, 4), "=" , 40) # set objective function (default: find minimum) set.objfn(lps.model, c(-3, -4, -3)) # you can save the model to a file # write.lp(lps.model, 'c:/Users/LPmodel.lp', type='lp') # these commands define the constraint linear model # /* Objective function */ # min: -3 x1 -4 x2 -3 x3; # # /* Constraints */ # +6 x1 +2 x2 +4 x3 <= 150; # + x1 + x2 +6 x3 >= 0; # +4 x1 +5 x2 +4 x3 = 40; # # writing it in the text file named 'LPmodel.lp' solve(lps.model) # Retrieve the values of the variables from a solved linear program model get.variables(lps.model) # check against the exact solution x_1 = 0, x_2 = 8, x_3 = 0 get.objective(lps.model) # get optimal (min) value #' #' #' In lower dimensional problems, we can also plot the constraints to graphically demonstrate the corresponding support restriction. For instance, here is an example of a simpler 2D constraint and its Venn diagrammatic representation. #' #' $$\left\{ #' \begin{array}{rl} #' x_1 & \leq \frac{150 -2x_2}{6} \\ #' x_1 & \geq -x_2\\ #' \end{array} \right. .$$ #' #' #' library(ggplot2) ggplot(data.frame(x = c(-100, 0)), aes(x = x)) + stat_function(fun=function(x) {(150-2*x)/6}, aes(color="Function 1")) + stat_function(fun=function(x) { -x }, aes(color = "Function 2")) + theme_bw() + scale_color_discrete(name = "Function") + geom_polygon( data = data.frame(x = c(-100, -100, 0, 0, Inf), y = c(0,350/6, 150/6, 0, 0)), aes(x = x, y = y, fill = "Constraint 1"), inherit.aes = FALSE, alpha = 0.5) + geom_polygon( data = data.frame(x = c(-100, -100, 0, Inf), y = c(0, 100, 0, 0)), aes(x = x, y = y, fill = "Constraint 2"), inherit.aes = FALSE, alpha = 0.3) + scale_fill_discrete(name = "Constraint Set") + scale_y_continuous(limits = c(0, 100)) #' #' #' Here is another example of maximization of a trivariate cost function,$f(x_1, x_2, x_3)=3x_1+ 4x_2 -x_3$, subject to: #' $$\left\{ #' \begin{array}{rl} #' -x_1 + 2x_2 + 3x_3 & \leq 16 \\ #' 3x_1 - x_2 - 6x_3 & \geq 0 \\ #' x_1 - x_2 & \leq 2 #' \end{array} \right. .$$ #' #' lps.model2 <- make.lp(0, 3) add.constraint(lps.model2, c(-1, 2, 3), "<=", 16) add.constraint(lps.model2, c(3, -1, -6), ">=", 0) add.constraint(lps.model2, c(1, -1, 0), "<=", 2) set.objfn(lps.model2, c(3, 4, -1), indices = c(1, 2, 3)) lp.control(lps.model2, sense='max') # changes to max: 3 x1 + 4 x2 - x3 solve(lps.model2) # 0 suggests that this solution convergences get.variables(lps.model2) # get point of maximum get.objective(lps.model2) # get optimal (max) value #' #' #' In 3D we can utilize the rgl::surface3d() method to display the constraints. This output is suppressed, as it can only be interpreted via the pop-out 3D rendering window. #' #' library("rgl") n <- 100 x <- y <- seq(-500, 500, length = n) region <- expand.grid(x = x, y = y) z1 <- matrix(((150 -2*region$x -4*region$y)/6), n, n) z2 <- matrix(-region$x + 6*region$y, n, n) z3 <- matrix(40 -5*region$x - 4*region$y, n, n) surface3d(x, y, z1, back = 'line', front = 'line', col = 'red', lwd = 1.5, alpha = 0.4) surface3d(x, y, z2, back = 'line', front = 'line', col = 'orange', lwd = 1.5, alpha = 0.4) surface3d(x, y, z3, back = 'line', front = 'line', col = 'blue', lwd = 1.5, alpha = 0.4) axes3d() #' #' #' We can also use plot_ly to display the linear constraints (planes) in 3D. #' #' #define the regular x,y grid (in 2D) n <- 100 x <- y <- seq(-500, 500, length = n) region <- expand.grid(x = x, y = y) # define the z values for all three 2D planes z1 <- matrix(((150 -2*region$x -4*region$y)/6), n, n) z2 <- matrix(-region$x + 6*region$y, n, n) z3 <- matrix(40 -5*region$x - 4*region$y, n, n) #z.seq <- function(x,y) coef.lm.fit[1]+coef.lm.fit[2]*x+coef.lm.fit[3]*y # define the values of z = z(x.seq, y.seq), as a Matrix of dimension c(dim(x.seq), dim(y.seq)) #z <- t(outer(x.seq, y.seq, z.seq)) library(plotly) myPlotly <- plot_ly(x=~x, y=~y, z=~z1, colors = c("blue", "red"),type="surface", opacity=0.7) %>% add_trace(x=~x, y=~y, z=~z2, type="surface", colors = "green", opacity=0.5) %>% add_trace(x=~x, y=~y, z=~z3, type="surface", colors = "orange", opacity=0.3) %>% layout(scene = list( aspectmode = "manual", aspectratio = list(x=1, y=1, z=1), xaxis = list(title = "X"), yaxis = list(title = "Y"), zaxis = list(title = "ZA")) ) # print(myPlotly) myPlotly #' #' #' It is possible to restrict the domain type to contain only solutions that are: #' #' * *integers*, which makes it an Integer Linear Programming (ILP), #' * *binary/boolean* values (BLP), or #' * *mixed* types, Mixed Integer Liner Programming (MILP). #' #' Some examples are included below. #' #' ### Mixed Integer Linear Programming (MILP) #' Let's demonstrate MILP with an example where the type of$x_1$is unrestricted,$x_2$is dichotomous (binary), and$x_3$is restricted to be an integer. #' #' lps.model <- make.lp(0, 3) add.constraint(lps.model, c(6, 2, 4), "<=", 150) add.constraint(lps.model, c(1, 1, 6), ">=", 0) add.constraint(lps.model, c(4, 5, 4), "=", 40) set.objfn(lps.model, c(-3, -4, -3)) set.type(lps.model, 2, "binary") set.type(lps.model, 3, "integer") get.type(lps.model) # This is Mixed Integer Linear Programming (MILP) set.bounds(lps.model, lower=-5, upper=5, columns=c(1)) # give names to columns and restrictions dimnames(lps.model) <- list(c("R1", "R2", "R3"), c("x1", "x2", "x3")) print(lps.model) solve(lps.model) get.objective(lps.model) get.variables(lps.model) get.constraints(lps.model) #' #' #' The next example limits all three variable to be dichotomous (binary). #' #' lps.model <- make.lp(0, 3) add.constraint(lps.model, c(1, 2, 4), "<=", 5) add.constraint(lps.model, c(1, 1, 6), ">=", 2) add.constraint(lps.model, c(1, 1, 1), "=", 2) set.objfn(lps.model, c(2, 1, 2)) set.type(lps.model, 1, "binary") set.type(lps.model, 2, "binary") set.type(lps.model, 3, "binary") print(lps.model) solve(lps.model) get.variables(lps.model) #' #' #' ## Quadratic Programming (QP) #' QP can be used for second order (quadratic) objective functions, but the constraint functions are still linear combination of the domain variables. #' #' A matrix formulation of the problem can be expressed as minimizing an objective function: #' $$f(X) = \frac{1}{2} X^T D X - d^T X,$$ #' #' where$X$is a vector$[x_1, x_2, \ldots, x_n]^T$,$D$is the matrix of weights of each association pair,$x_i, x_j$, and$d$are the weights for each individual feature,$x_i$. The$\frac{1}{2}$coefficient ensures that the weights matrix$D$is symmetric and each$x_i, x_j$pair is not double-counted. This cost function is subject to the constraints: #' $$A^T X\ \ [ = | \geq ]\ \ b,$$ #' where the first$k$constrains may represent equalities ($=$) and the remaining ones are inequalities ($\geq$), and$b$is the constraints right hand size (RHS) constant vector. #' #' Here is an example of a QP objective function and its R optimization: #' $$f(x_1, x_2, x_3) = 2 x_1^2 - x_1x_2 - 2 x_2^2 + x_2 x_3 + 2 x_3^2 - 5 x_2 + 3 x_3.$$ #' Subject to the following constraints: #' #' $$\begin{array}{rcl} #' -4x_1 + -3x_2 = -8 \\ #' 2x_1 + x_2 = 2 \\ #' -2x_2 + x_3 \geq 0 #' \end{array} .$$ #' #' library(quadprog) Dmat <- matrix(c( 2, -1, 0, -1, 2, -1, 0, -1, 2), 3, 3) dvec <- c(0, -5, 3) Amat <- matrix(c(-4, -3, 0, 2, 1, 0, 0, -2, 1), 3, 3) bvec <- c(-8, 2, 0) n.eqs <- 2 # the first two constraints are equalities sol <- solve.QP(Dmat, dvec, Amat, bvec=bvec, meq=2) sol$solution # get the (x1, x2, x3) point of minimum sol$value # get the actual cost function minimum #' #' #' The minimum value,$49$, of the QP solution is attained at$x_1=-1, x_2=4, x_3=8$. #' #' When$D$is a positive definitive matrix, i.e.,$X^T D X \gt 0$, for all non-zero$X$, the QP problem may be solved in polynomial time. Otherwise, the QP problem is NP-hard. In general, even if$D$has only one negative eigenvalue, the QP problem is still [NP-hard](https://en.wikipedia.org/wiki/NP-hardness). #' #' The QP function solve.QP() expects a positive definitive matrix$D$. #' #' # General Non-linear Optimization #' The package Rsolnp provides a special function solnp(), which solves the general non-linear programming problem: #' #' $$\min_x { f(x) }$$ #' subject to: #' $$g(x)=0$$ #' $$l_h \leq h(x) \leq u_h$$ #' $$l_x \leq x \leq u_x,$$ #' where$f(x), g(x), h(x)$are all smooth functions. #' #' ## Dual problem optimization #' #' *Duality* in math really just means having two complementary ways to think about an optimization problem. The *primal problem* represents an optimization challenge in terms of the original decision variable$x$. The *dual problem*, also called *Lagrange dual*, searches for a lower bound of a minimization problem or an upper bound for a maximization problem. In general, the primal problem may be difficult to analyze, or solve directly, because it may include non-differentiable penalty terms, e.g.,$l_1$norms, recall LASSO/Ridge regularization in [Chapter 17](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/17_RegularizedLinModel_KnockoffFilter.html). Hence, we turn to the corresponding *Lagrange dual problem* where the solutions may be more amenable, especially for [convex functions](https://ocw.mit.edu/courses/sloan-school-of-management/15-094j-systems-optimization-models-and-computation-sma-5223-spring-2004/lecture-notes/duality_article.pdf), that satisfy the following inequality: #' $$f(\lambda x +(1-\lambda)y)\leq \lambda f(x) + (1-\lambda)f(y).$$ #' #' ### Motivation #' Suppose we want to borrow money,$x$, from a bank, or lender, and$f(x)$represents the borrowing cost to us. There are natural "design constraints" on money lending. For instance, there may be a cap in the interest rate,$h(x)\leq b$, or we can have many other constraints on the loan duration. There may be multiple lenders, including self-funding, that may "charge" us$f(x)$for lending us$x$. *Lenders goals are to maximize profits*. Yet, they can't charge you more than the prime interest rate plus some premium based on your credit worthiness. Thus, for a given fixed$\lambda$, a lender may make us an offer to lend us$x$aiming to minimize $$f(x)+\lambda \times h(x).$$ #' #' If this cost is not optimized, i.e., minimized, you may be able to get another loan$y$at lower cost$f(y) 1e-6) { previous <- y y <- y - solve( grad_h(y), h(y) ) } y[1:k] } x <- c(0, 0, 0) # Define the objective cost function fn4 <- function(x) # f(x, y, z) = 4y-2z + x^2+y^2 { 4*x[2] - 2*x[3] + x[1]^2+ x[2]^2 #sum(x^2) } # check the derivative of the objective function grad(fn4, x) # define the domain constraints of the problem # constraint z1: 2x-y-z = 2 # constraint z2: x^2+y^2 +z = 1 eqn4 <- function(x){ z1=2*x[1] - x[2] - x[3] -2 z2=x[1]^2 + x[2]^2 + x[3] -1 return(c(z1, z2)) } # Check the Jacobian of the constraints jacobian(eqn4, x) # Call the Lagrange-multipliers solver # check one step of the algorithm k <- length(x) l <- length(eqn4(x)); h <- function(x) { c(grad(fn4, x[1:k]) - t(-x[(1:2)]) %*% jacobian(eqn4, x[1:k]), -eqn4(x[1:k])) } jacobian(h, x) # Lagrange-multipliers solver for f(x, y, z) subject to g(x, y, z) lagrange_multipliers(x, fn4, eqn4) #' #' #' Now, let's double-check the above manual optimization results against the automatic solnp solution minimizing #' $$f(x, y, z) = 4y-2z + x^2+y^2$$ #' subject to: #' #' $$\begin{array}{rcl} #' 2x-y-z=2\\ #' x^2+y^2=1. #' \end{array}$$ #' #' library(Rsolnp) fn4 <- function(x) # f(x, y, z) = 4y-2z + x^2+y^2 { 4*x[2] - 2*x[3] + x[1]^2+ x[2]^2 } # constraint z1: 2x-y-z = 2 # constraint z2: x^2+y^2 +z = 1 eqn4 <- function(x){ z1=2*x[1] - x[2] - x[3] z2=x[1]^2 + x[2]^2 + x[3] return(c(z1, z2)) } constraints4 <- c(2, 1) x0 <- c(1, 1, 1) ctrl <- list(trace=0) sol4 <- solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl) sol4$values #' #' #' The results of both (manual and automated) experiments identifying the optimal$(x, y, z)$coordinates minimizing the objective function$f(x, y, z) = 4y-2z + x^2+y^2$are in agreement. #' #' * *Manual* optimization: lagrange_multipliers(x, fn4, eqn4): 0.3416408 -1.0652476 -0.2514708. #' * *Automated* optimization: solnp(x0, fun = fn4, eqfun = eqn4, eqB = constraints4, control=ctrl): 0.3416408 -1.0652476 -0.2514709. #' #' # Data Denoising #' #' Suppose we are given$x_{noisy}$with$n$noise-corrupted data points. The noise may be additive ($x_{noisy}\sim x +\epsilon$) or not additive. We may be interested in denoising the signal and recovering a version of the original (unobserved) dataset$x$, potentially as a smother representation of the original (uncorrupted) process. Smoother signals suggest less (random) fluctuations between neighboring data points. #' #' One objective function we can design to denoise the observed signal,$x_{noisy}$, may include a *fidelity term* and a *regularization term*, see the regularized linear modeling, [Chapter 17](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/17_RegularizedLinModel_KnockoffFilter.html). #' #' Total variation denoising assumes that for each time point$t$, the observed noisy data $$\underbrace{x_{noisy}(t)}_{\text{observed signal}} \sim \underbrace{x(t)}_{\text{native signal}} + \underbrace{\epsilon(t)}_{\text{random noise}}.$$ #' #' To recover the *native signal*,$x(t)$, we can optimize ($\arg\min_x{f(x)}$) the following objective cost function: #' #' $$f(x) = \underbrace{\frac{1}{2} \sum_{t=1}^{n-1} {\|y(t) - x_{noisy}(t)\| ^2}}_{\text{fidelity term}} + \underbrace{\lambda \sum_{t=2}^{n-1} | x(t) - x(t-1)|}_{\text{regularization term}},$$ #' #' where$\lambda$is the regularization smoothness parameter,$\lambda \rightarrow 0 \implies y \rightarrow x_{noisy}$. Minimizing$f(x)$provides a minimum total-variation solution to the data denoising problem. #' #' Below is an example illustrating total variation (TV) denoising using a simulated noisy dataset. We start by generating an oscillatory noisy signal. Then, we compute several smoothed versions of the noisy data, plot the initial and smoothed signals, define and optimize the TV denoising objective function, which is a mixture of a fidelity term and a regularization term. #' #' n <- 1000 x <- rep(0, n) xs <- seq(0, 8, len=n) # seq(from = 1, to = 1, length) noise_level = 0.3 # sigma of the noise, try varying this noise-level # here is where we add the zero-mean noise set.seed(1234) x_noisy <- function (x) { sin(x)^2/(1.5+cos(x)) + rnorm(length(x), 0, noise_level) } # initialize the manual denoised signal x_denoisedManu <- rep(0, n) df <- as.data.frame(cbind(xs, x_noisy(xs))) # loess fit a polynomial surface determined by numerical predictors, # using local fitting poly_model1 <- loess(x_noisy(xs) ~ xs, span=0.1, data=df) # tight model poly_model2 <- loess(x_noisy(xs) ~ xs, span=0.9, data=df) # smoother model # To see some of numerical results of hte model-fitting: # View(as.data.frame(cbind(xs, x_noisy, predict(poly_model1)))) plot(xs, x_noisy(xs), type='l') lines(xs, poly_model1$fitted, col="red", lwd=2) lines(xs, poly_model2$fitted, col="blue", lwd=3) #' #' #' Next, let's initiate the parameters, define the objective function and optimize it, i.e., estimate the parameters that minimize the cost function as a mixture of fidelity and regularization terms. #' #' # initialization of parameters betas_0 <- c(0.3, 0.3, 0.5, 1) betas <- betas_0 # Denoised model x_denoised <- function(x, betas) { if (length(betas) != 4) { print(paste0("Error!!! length(betas)=", length(betas), " != 4!!! Exiting ...")) break(); } # print(paste0(" .... betas = ", betas, "...\n")) # original noise function definition: sin(x)^2/(1.5+cos(x)) return((betas[1]*sin(betas[2]*x)^2)/(betas[3]+cos(x))) } library(Rsolnp) fidelity <- function(x, y) { sqrt((1/length(x)) * sum((y - x)^2)) } regularizer <- function(betas) { reg <- 0 for (i in 1:(length(betas-1))) { reg <- reg + abs(betas[i]) } return(reg) } # Objective Function objective_func <- function(betas) { # f(x) = 1/2 * \sum_{t=1}^{n-1} {|y(t) - x_{noisy}(t)\|^2}} + \lambda * \sum_{t=2}^{n-1} | x(t) - x(t-1)| fid <- fidelity(x_noisy(xs), x_denoised(xs, betas)) reg <- abs(betas[4])*regularizer(betas) error <- fid + reg # uncomment to track the iterative optimization state # print(paste0(".... Fidelity =", fid, " ... Regularizer = ", reg, " ... TotalError=", error)) # print(paste0("....betas=(",betas[1],",",betas[2],",",betas[3],",",betas[4],")")) return(error) } # inequality constraint forcing the regularization parameter lambda=beta[4]>0 inequalConstr <- function(betas){ betas[4] } inequalLowerBound <- 0; inequalUpperBound <- 100 # should we list the value of the objective function and the parameters at every iteration (default trace=1; trace=0 means no interim reports) # constraint problem # ctrl <- list(trace=0, tol=1e-5) ## could specify: outer.iter=5, inner.iter=9) set.seed(121) sol_lambda <- solnp(betas_0, fun = objective_func, ineqfun = inequalConstr, ineqLB = inequalLowerBound, ineqUB = inequalUpperBound, control=ctrl) # unconstraint optimization # ctrl <- list(trace=1, tol=1e-5) ## could specify: outer.iter=5, inner.iter=9) # sol_lambda <- solnp(betas_0, fun = denoising_func, control=ctrl) # suppress the report of the the functional values (too many) # sol_lambda$values # reprot the optimal parameter estimates (betas) sol_lambda$pars # Reconstruct the manually-denoised signal using the optimal betas betas <- sol_lambda$pars x_denoisedManu <- x_denoised(xs, betas) print(paste0("Final Denoised Model:", betas[1], "*sin(", betas[2], "*x)^2/(", betas[3], "+cos(x)))")) plot(x_denoisedManu) #' #' #' Finally, we can validate our manual denoising protocol against the automated TV denoising using the R package [tvd](https://cran.r-project.org/web/packages/tvd). #' #' # install.packages("tvd") library("tvd") lambda_0 <- 0.5 x_denoisedTVD <- tvd1d(x_noisy(xs), lambda_0, method = "Condat") # lambda_o is the total variation penalty coefficient # method is a string indicating the algorithm to use for denoising. # Default method is "Condat" # plot(xs, x_denoisedTVD, type='l') plot(xs, x_noisy(xs), type='l') lines(xs, poly_model1$fitted, col="red", lwd=2) lines(xs, poly_model2$fitted, col="blue", lwd=3) lines(xs, x_denoisedManu, col="pink", lwd=4) lines(xs, x_denoisedTVD, col="green", lwd=5) # add a legend legend("bottom", c("x_noisy", "poly_model1$fitted", "poly_model2$fitted", "x_denoisedManu", "x_denoisedTVD"), col=c("black", "red", "blue", "pink", "green"), lty=c(1,1, 1,1), cex=0.7, lwd= c(1,2,3,4,5), title="TV Denoising") #' #' #' # Sparse Matrices #' #' In [Chapter 4](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/04_LinearAlgebraMatrixComputing.html) we saw matrix computing. However, we need a new data structure to efficiently store Big but Sparse matrices, whose elements are mostly trivial (zero). Dense Matrices have the majority of their elements be non-zero, but sparse matrices represent mostly trivial elements. The *sparcity of a matrix* is the proportion of non-zero elements. Even if the original data is not sparse, some data proprocessing may result is sparse data structures. Various techniques to store and process sparse matrices are included in the R package Matrix. #' #' Here is an example of a large $15000\times 15000$ but sparse diagonal matrix ($SM$) compared to a standard diagonal matrix ($D$) of the same size: #' #' # install.packages("pryr") library("Matrix") memory.limit(50000) # increase the RAM allocation n = 15000 SM = sparseMatrix(1:n, 1:n, x = 1) D = diag(1, n, n) # Check the sizes of the SM and D: pryr::object_size(SM); pryr::object_size(D) # Try to invert each of the matrices to see that the SM arithmetic is much more efficient solve(SM)[1:10, 1:10] # skip this is very intensive: solve(D)[1:10, 1:10] #' #' #' The size of the matrix $D$ is $10,000$ times larger than $SM$ despite the fact that both represent $n\times n$ matrices. The difference is that information is sorted differently in *sparse* matrices, where identical values that appear multiple times are only stored once along with pointers to the matrix locations with the same value. #' #' # Parallel Computing #' #' Parallel computing is useful for many processes where the algorithm can be partitioned into smaller jobs that can be run in parallel on different compute cores and results integrated at the end, e.g., Monte-Carlo simulations, machine learning tasks, separable transformations. The user creates a virtual cluster of compute nodes where each core runs independently and the resulting information pieces are aggregated at the end. #' #' Let's try a simple calculation of the mean: #' #' n = 200000 sapply(1:n, mean)[1:20] #' #' #' Let's create a private compute cluster using 4 of the available 8 cores and rerun the same script using parallel processing via the R parallel package. #' #' library("parallel") # check the number of available cores: detectCores() # construct a cluster of 4 cores clust = makeCluster(4) # swaping sapply() with parSapply(): parSapply(clust, 1:n, mean)[1:20] stopCluster(clust) #' #' #' Once done, we should always stop the clusters and release the system resources to avoid memory leaks. #' #' Most computers include multiple processors which can be utilized to optimize the calculations and speed-up performance. Never request, or use, all cores, as this will halt the machine, and do not expect to achieve performance improvement directly proportional to the number of cores used, as there is multi-thread communication overhead. #' #' # Foundational Methods for Function Optimization #' #' Function optimization may not always be possible, i.e., finding *exact* minima or maxima of #' functions are not always analytically tractable. However, there are iterative algorithms to compute *approximate* solutions to such function optimization problems. #' #' ## Basics #' Let's recall the following Newtonian principles for a given *objective* or *cost* function* $f: \Re^n \to \Re$, which we want to optimize, i.e., we are looking for solution(s) #' $x^*= \arg\min_{x} \; f(x)$. #' #' * Maximizing $f$ is the same as minimizing $-f$, as $\arg\max_{x} \; f(x) = \arg\min_{x} \;-f(x)$. #' #' * If $\psi$ is strictly increasing (e.g., $\log$, $x^2$, $\exp(x)$), then $\arg\min_{x} \; f(x) = \arg\min_{x}\; \psi(f(x))$. #' #' * The *accuracy* (bias) measures how close we get to the optimum point $x^*$. #' #' * The *convergence speed* reflects how quickly (in terms of the number of iterations) we get towards $x^*$. #' #' * The *computational complexity* captures how expensive is it to perform a single iteration. #' #' * For 1D functions ($n=1$), if $f$ is smooth, then $x^\ast$ is a *local optimum* implies that $f'(x^\ast) = 0$. The the sign of the second derivative determines if the optimum is minimum ($f''(x^\ast)>0$) or maximum ($f''(x^*)<0$). #' #' * In $\Re^n$, for smooth $f: \Re^n \to \Re$, then at (local) optimal points, $\nabla f(x^\ast) = 0$, where the *gradient* of $f$ is the vector of all partial derivatives:$\nabla f(x) =\left(\begin{array}{c} #' \frac{\partial f(x)}{\partial x_1} \\ #' \frac{\partial f(x)}{\partial x_2} \\ #' \vdots \\ #' \frac{\partial f(x)}{\partial x_n} #' \end{array}\right)$. #' #' * In higher dimensions, the second derivative is represented as the [Hessian matrix](https://en.wikipedia.org/wiki/Hessian_matrix), $\nabla^2 f$, which is defined at each point $x$ by $\nabla^2 f(x)= [\nabla^2 f(x)]_{ij} = \frac{\partial^2 f(x)}{\partial x_i \partial x_j}$. At a local minimum $x^*$, the Hessian is [positive definite](https://en.wikipedia.org/wiki/Positive-definite_matrix) ($v^T \nabla^2 f(x^\ast) v > 0, \;\;\; \text{for all$v$}$), for minima, or negative-definite ($v^T \nabla^2 f(x^\ast) v < 0, \;\;\; \text{for all$v$}$), for maxima. #' #' ## Gradient Descent #' #' Let's recall the relation between function changes (rise), relative to changes in the argument (run), gradients, and derivatives. #' #' The derivative of a function is defined as the limit of the rise over the run, #' $$f'(x_0) = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}}.$$ #' #' Thus, when $x$ is close to $x_0$, a first-order (linear) approximation of the function at $x$ is obtained by $f(x) \approx f(x_0) + f'(x_0)(x-x_0)$. This linear approximation allows us to approximate (locally) the function extrema by moving down the slope, or gradient, $f'(x_0)$. The higher-dimensional analogue is similar, for $x \sim x_0 \in \Re^n$, $f(x) \approx f(x_0) + \nabla f(x_0)^T (x-x_0)$. Effectively, all smooth functions look linear in a small neighborhood around each domain point (including the extrema points). In addition, the gradient $\nabla f(x_0)$ points in the direction of fastest ascent. Therefore, to optimize $f$, we move in direction given by $-\nabla f(x_0)$. Recall that the inner product of two vectors, $$, is defined by  = a^T b = \sum_{i=1}^n a_i b_i. #' #' ### Pseudo algorithm #' #' Here is a simple yes illustrative algorithm for minimizing f by gradient descent, by iteratively moving in the direction of the negative gradient. #' #' 1. (*Initialization*): Start with initial guess x^{(0)}, a step size \eta (aka *learning rate*), and a threshold (tolerance level), #' 2. (*Iterative Update*) For k=1,2,3,\ldots: #' - Compute the gradient \nabla f(x^{(k-1)}) #' - If gradient is close to zero (stopping criterion threshold), then stop, otherwise continue #' - Update x^{(k)} = x^{(k-1)} - \eta \nabla f(x^{(k-1)}), #' 3. Return final x^{(k)} as approximate solution x^*. #' #' #' ### Example #' #' Let's demonstrate a simple example performing gradient descent optimization. #' #' # install.packages("numDeriv") library(numDeriv) # to access the gradient calculation funciton: grad() gradDescent = function(f, x0, max.iter=500, stepSize=0.01, stoppingCriterion=0.001) { n = length(x0) xmat = matrix(0, nrow=n, ncol=max.iter) xmat[,1] = x0 for (k in 2:max.iter) { # Calculate the current gradient currentGrad = grad(f, xmat[,k-1]) # Check Stopping criterion if (all(abs(currentGrad) < stoppingCriterion)) { k = k-1; break } # Move in the opposite direction of the gradient xmat[,k] = xmat[,k-1] - stepSize * currentGrad } xmat = xmat[,1:k] # remove the initial guess (x0) # return the final solution, the solution path, min.value, and the number of iterations return(list(x=xmat[,k], xmat=xmat, min.val=f(xmat[,k]), k=k)) } #' #' #' Let's try one of the earlier examples, \min f(x_1, x_2) = \min \{(x_1-3)^2 + (x_2+4)^2\}, which we solves using optim(): #' #' # Define the funciton f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 } initial_x <- c(0, -1) x_optimal <- optim(initial_x, f, method="CG") # performs minimization x_min <- x_optimalpar # x_min contains the domain values where the (local) minimum is attained x_min # critical point/vector x_optimalvalue #' #' #' #' # define cal lthe gadient descent optimizer x0 = c(0, 0) optim.f = gradDescent(f, x0) # local min x optim.fx # optimal value optim.fmin.val # number of iterations optim.fk #' #' #' Let's make a 3D plot to visualize the solution path. #' #' surface = function(f, from.x=0, to.x=1, from.y=0, to.y=1, n.x=30, n.y=30, theta=5, phi=80, ...) { # Build the 2d grid x.seq = seq(from=from.x,to=to.x,length.out=n.x) y.seq = seq(from=from.y,to=to.y,length.out=n.y) plot.grid = expand.grid(x.seq,y.seq) z.vals = apply(plot.grid,1,f) z.mat = matrix(z.vals,nrow=n.x) # Plot with the persp function orig.mar = par()mar # Save the original margins par(mar=c(1,1,1,1)) # Make the margins small r = persp(x.seq,y.seq,z.mat,theta=theta,phi=phi,...) par(mar=orig.mar) # Restore the original margins invisible(r) } # Draw our simple function in 2d, and our gradient descent path on top r = surface(f,from.x=-2,to.x=5,from.y=-6,to.y=5,theta=45,phi=-5,xlab="x1",ylab="x2",zlab="f") points(trans3d(x0[1],x0[2],f(x0),r),col="red",cex=2) lines(trans3d(optim.fxmat[1,],optim.fxmat[2,],apply(optim.fxmat,2,f),r),lwd=4,col="red") points(points(trans3d(optim.fx[1],optim.fx[2],f(optim.fx),r), col="gray",cex=2)) #' #' #' ### SUmmary of Gradient Descent #' #' There are pros and cons of using gradient descent optimization #' #' Advantages of Gradient Descent: #' #' - Simple and intuitive #' - Easy to implement #' - Iterations are usually computationally efficient (involving estimation of the gradient) #' #' Disadvantages: #' #' - The algorithm may be slow and may get trapped into left-right alternating patterns when \nabla f(x) are of #' very different sizes #' - It may require a long time to converge to the optimum #' - For some functions, getting close to the optimum (x^{(k)} \sim x^*, where f(x^{(k)})-f(x^*) \leq \epsilon, may require large number of iterations, k\approx 1/\epsilon. However, for smoother functions this may require k\approx \log(1/\epsilon) iterations. #' #' ## Convexity #' #' There are substantial differences between convex function optimization over convex sets and non-convex function optimization over non-convex domains. The *convexity* assumptions make optimization problem much easier than the general non-convex space optimization problems since (1) local extrema must be global extrema in convex spaces, and (2) [first-order conditions are sufficient conditions for optimality](https://doi.org/10.1137%2F1035044). #' #' Let's reexamine our simple quadratic convex optimization problem (\min f(x_1, x_2) = \min \{(x_1-3)^2 + (x_2+4)^2\}). We can plot a number of gradient descents pathways starting with different initial conditions. If they all merge into each other (convergence), this would suggest robustness and reproducibility of the optimization. #' #' x0 = c(0,0) optim.f0 = gradDescent(f,x0) x1 = c(-2,-2) optim.f1 = gradDescent(f,x1) x2 = c(2,1) optim.f2 = gradDescent(f,x2) x3 = c(3,-1) optim.f3 = gradDescent(f,x3) x4 = c(4,2) optim.f4 = gradDescent(f,x4) # Draw our simple function in 2d, and all gradient descent paths on top r = surface(f,from.x=-2,to.x=5,from.y=-5,to.y=2, theta=45,phi=-5,xlab="x",ylab="x2",zlab="f") lines(trans3d(optim.f0xmat[1,],optim.f0xmat[2,],apply(optim.f0xmat,2,f),r),lwd=4,col="red") lines(trans3d(optim.f1xmat[1,],optim.f1xmat[2,],apply(optim.f1xmat,2,f),r),lwd=4,col="green") lines(trans3d(optim.f2xmat[1,],optim.f2xmat[2,],apply(optim.f2xmat,2,f),r), lwd=4,col="blue") lines(trans3d(optim.f3xmat[1,],optim.f3xmat[2,],apply(optim.f3xmat,2,f),r), lwd=4,col="purple") lines(trans3d(optim.f4xmat[1,],optim.f4xmat[2,],apply(optim.f4xmat,2,f),r),lwd=4,col="orange") points(trans3d(3,-4,f(c(3,-4)),r),col="red",cex=2, lwd=4, pch=0) # optimum solution #' #' #' So, all iterative solution pathways in this convex problem do lead to the optimal point. However, problems are not always convex, and this process can be significantly disrupted. #' #' Let's look at another example, \displaystyle\min_{x=(x_1,x_2)} f_1(x)=\min \{(\frac{1}{2}\times x_1^2-\frac{1}{4}\times x_2^2+3)\times \cos(2x_1+ 1- e^{x_2})\} . #' #' f1 = function(x) { return((1/2*x[1]^2-1/4*x[2]^2+3)*cos(2*x[1]+1-exp(x[2]))) } x0 = c(0.5,0.5) optim.f0 = gradDescent(f1,x0,stepSize=0.01) x1 = c(-0.1,-1.3) optim.f1 = gradDescent(f1,x1,stepSize=0.01,max.iter=400) x2 = c(-0.5,-1.3) optim.f2 = gradDescent(f1,x2,stepSize=0.01,max.iter=400) x3 = c(-0.2,1.4) optim.f3 = gradDescent(f1,x3,stepSize=0.01,max.iter=400) x4 = c(-0.5,-0.5) optim.f4 = gradDescent(f1,x4,stepSize=0.01,max.iter=400) # Show the f1 optimizaiton space, and plot all gradient descent pathways r = surface(f1,from.x=-2,to.x=2,from.y=-2,to.y=2,theta=-10,phi=70,xlab="x1",ylab="x2",zlab="f1") lines(trans3d(optim.f0xmat[1,],optim.f0xmat[2,],apply(optim.f0xmat,2,f1),r),lwd=4,col="red") lines(trans3d(optim.f1xmat[1,],optim.f1xmat[2,],apply(optim.f1xmat,2,f1),r),lwd=4,col="green") lines(trans3d(optim.f2xmat[1,],optim.f2xmat[2,],apply(optim.f2xmat,2,f1),r),lwd=4,col="blue") lines(trans3d(optim.f3xmat[1,],optim.f3xmat[2,],apply(optim.f3xmat,2,f1),r),lwd=4,col="purple") lines(trans3d(optim.f4xmat[1,],optim.f4xmat[2,],apply(optim.f4xmat,2,f1),r),lwd=4,col="orange") #' #' #' Clearly, the non-convex problem has divergent solution pathways, that depend on the initial conditions and the topology of the space. Convexity represents the fundamental difference between these two optimization problems (f(x) and f_1(x)). #' #' If f(x) is a convex real-valued function defined on a convex domain D, f:D \to \Re, then #' \forall x_1, x_2 \in D and \forall t \in [0, 1]: #'$$f(tx_1+(1-t)x_2)\leq t f(x_1)+(1-t)f(x_2).$$#' The convex optimization problem is to find a point x^\ast \in D for which f(x) is optimized, i.e., f(x^\ast) \le f(x), \forall x \in D. #' #' In finite-dimensional normed spaces, the [Hahn-Banach theorem](https://en.wikipedia.org/wiki/Hahn%E2%80%93Banach_theorem) provides theoretical necessary and sufficient conditions for optimality, Duality theory generalizes the classical linear programming problem to more complex situations and provides effective computational methods. #' #' ### Notes #' #' - All extrema of convex functions are global, i.e., there are no local extrema. Optimization of convex problems is of polynomial complexity. #' - Gradient descent optimization provides solutions to most smooth convex functions, subject to selecting appropriate numeric parameters (step-size, i.e., *learning rate*) #' - Gradient descent may often fail for non-convex functions, where moving downhill locally may not always lead to the optimum, #' - Non-convex optimization is NP-hard problem and is much harder in general to solve. #' #' ## Newton's method #' #' As a second order method, Newton's optimization is a bit more sophisticated than gradient descent, and may also be more accurate. Let's recall the *second order Taylor expansion* of functions. #' #'$$f(x) \approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2. $$#' This represents a more accurate approximation of f(x) near x_0. This directly generalizes to the multivariate case, f(x): D\subset\Re^n \to \Re: #'$$f(x) \approx f(x_0) + \nabla f(x)^T(x-x_0) + \frac{1}{2}(x-x_0)^T \nabla^2 f(x_0) (x-x_0).$$#' The basic idea of Newton's method is to repeatedly expand the second-order Taylor representation of the objective function, reduce the size of the quadratic term on the right-hand side, and iterate. #' #' ### Newton's method pseudo code #' #' Repeat the formulation and minimization of the quadratic approximation to f. #' #' 1. Start with initial solution guess \x^{(0)}, a step-size \eta (learning step), and a tolerance level (\epsilon), #' 2. Iterate for k=1,2,3,\ldots: #' - Compute the gradient g=\nabla f(x^{(k-1)}), Hessian #' H=\nabla^2 f(x^{(k-1)}) #' - If gradient is close to zero (H<\epsilon), then stop, otherwise continue on #' - Update the solution to x^{(k)} = x^{(k-1)} - \eta H^{-1} g #' 3. Return final x^{(k)} as approximate solution x^*. #' #' ### Advantages and disadvantages of the Newton's method #' #' Advantages: #' #' - Converges faster to a solution than gradient descent, because jointly, the Hessian and #' gradient together point in a more reliable direction than the gradient does alone #' - For nice functions, the expected number of iterations to get f(x^{(k)})-f(x^*) \leq \epsilon, is k\approx \log\log(1/\epsilon) iterations #' - Typically, it requires fewer iterations than gradient descent #' - For quadratic functions, it converges in just one step #' #' Disadvantages: #' #' - In principle, each Newton iteration is much more expensive than its gradient descent counterpart #' than a single gradient descent update. Requires inverting the Hessian #' - If the Hessian isn't invertible (or is close to singular), the algorithm may fail. #' #' #' ## Stochastic Gradient Descent #' #' Recall from [Chapter 4](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/04_LinearAlgebraMatrixComputing.html), that to estimate the linear regression coefficients on p variables, we minimize the fidelity term: #' #'$$ f(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - x_i^T \beta)^2. $$#' For observed responses y_i and predictors x_i, i=1,\ldots n, the fidelity represents the standard least squares loss. When n,p are reasonable in size, then we can just solve this using linear algebra: #' #'$$\beta^* = (X^T X)^{-1} X^T y,$$#' where X_{n\times p} (design) predictor matrix (with ith row x_i), and #' y_{n\times 1} is the response vector (with ith component y_i). #' #' For extreme sample or parameter sizes, e.g. when n=10^k, it me be difficult to compute X^T X, #' which will impede the computational solution. Gradient descent may work, but may also be expensive, because the gradient \nabla f(\beta) = \frac{1}{n} \sum_{i=1}^n x_i(x_i^T \beta - y_i) involves a sum of n terms. #' #' This problem lead to the development of *stochastic methods* that overcome that challenge of large-scale statistical optimization. The idea is to simplify the calculations so that each time we need to compute a complex gradient or a Hessian, we can approximate it by a simpler analogue computed by *random estimation*. #' #' For instance, instead of updating the solution using #'$$\beta - t \nabla f(\beta) = \beta - \frac{t}{n} \sum_{i=1}^n x_i(x_i^T \beta - y_i), $$#' #' which may be prohibitively expensive, we can update the solution by #'$$\beta - t g = \beta - \frac{t}{m} \sum_{i \in I} x_i (x_i^T \beta - y_i).$$#' #' The range of the sum above is reduced to I\subset\{1, 2, ..., n\}, which represents a random subset of \{1,\ldots n\} of size m\ll n. In other words, we've replaced the *full gradient* by a stochastic gradient estimate: #'$$g = \frac{1}{m} \sum_{i \in I} x_i(x_i^T \beta - y_i). #' Note that the latter is an unbiased estimate of the former and is computed over just $m$ data points. This is much more computationally efficient when $m$ is relatively small. In general, stochastic gradient descent makes rapid *initial* progress in minimizing the objective function at the start, however, it may also take longer to get to highly accurate solutions at the end. #' #' ### Stochastic Gradient Descent: Pseudo code #' #' Let's demonstrate the pseudo code implementation of stochastic gradient descent for high-dimensional data with a large number of predictors, $p$. For both statistical as well as optimization reasons, complete search over all $p$ regression parameters may be impractical. In a statistical sense, doing a full search may yield bad estimates (e.g., with high variance), and in a computational sense, the algorithm may be very expensive and slow to converge. Therefore, when we have large number of features ($p\gg 10^m$), it's likely that most of the predictors may be less important and only a few may be salient features. We can ignore the small effects by shrinking them to zero, just like we did in [Chapter 17](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/17_RegularizedLinModel_KnockoffFilter.html) for regularized linear modeling. Below is an example of a *sparse* stochastic gradient decent pseudo code: #' #' 1. Start with initial guess $\beta^{(0)}$, step-size $\eta$, and and tolerance level $\epsilon$ #' 2. For $k=1,2,3,\ldots$: #' - Randomly shuffle indices of the cases and select the reduced index set, $I\subset\{1, 2, ..., n\}$ #' - Approximate the true gradient $\nabla f(\beta^{(k-1)})$, by $g = \frac{1}{m} \sum_{i \in I} x_i(x_i^T \beta - y_i)$ #' - Check if gradient is close to zero; is so stop, otherwise continue #' - Update $\beta^{(k)} = \beta^{(k-1)} - \eta \nabla f(\beta^{(k-1)})$ #' - Correct $\beta^{(k)}$ by thresholding the small components to zero #' 3. Return final $\beta^{(k)}$ as approximate solution $\beta^*$. #' #' Note that if all $\beta$ get smaller at the same time that we shrink the small ones to zero, then this will guarantee the converge of the algorithm and will yield the *LASSO* solution. #' #' ## Simulated Annealing (SANN) #' #' Simulated annealing is another approach for *probabilistic* approximation of the global optimum of an objective function over a large domain (search space). It is very effective for discrete search spaces and for problems where quick identification of an approximate global optimum is more important than finding an accurate local optimum. *Annealing* is a term used in in metallurgy to control repeated heating and cooling of materials to increase the size or reduce defects in the materials. In SANN, the *simulation of annealing* refers to finding approximations of the global minimum of the cost function with a large number of variables by equilibration (annealing) and a simulated walk through the solution space to slow decrease in the probability of accepting worse solutions compared to previously censored solutions. #' #' ### Pseudocode #' #' Below is a [simulated annealing pseudocode](https://en.wikipedia.org/wiki/Simulated_annealing) that starts from a state $s_o$ and continues until a maximum of $k_{max}$ steps are completed. At each step, the call neighbour(s) call generates a randomly chosen neighbor of a given state $s$, and random(0, 1) returns a random Uniform distribution value. The annealing schedule is defined by temperature(r), which yields the temperature to use, given the fraction $r$ of the time budget that has been expended so far. #' #' 1. Initialize the algorithm by setting $s=s_o$ #' 2. For $k\in \{0, ..., k_{max}\}$: #' - $T\longleftarrow temperature(k/k_{max})$ #' - Pick a random neighbor, $s_{new} \longleftarrow neighbor(s)$ #' - If $P(E(s), E(s_{new}), T) \ge random(0, 1)$, then $s \longleftarrow s_{new}$ #' 3. Output: the final state, $s$. #' #' Below is a SANN example to optimize $f(x_1,x_2) = (x_1 - 3)^2 + (x_2 +4)^2$. #' #' f <- function(x) { (x[1] - 3)^2 + (x[2] +4)^2 } initial_x <- c(0, -1) x_optimal <- optim(initial_x, f, method="SANN") # performs Simulated Annealing minimization x_min <- x_optimal$par # x_min contains the domain values where the (local) minimum is attained x_min # critical point/vector x_optimal$value #' #' #' #' # References #' #' * The [Efficient R programming](https://csgillespie.github.io/efficientR/performance.html), by Colin Gillespie & Robin Lovelace, includes a number of tricks to improve performance and enhance computational processing. #' * [CRAN Optimization & Math Programming Site](http://cran.r-project.org/web/views/Optimization.html) provides details about a broad range of R optimization functions. #' * [Vincent Zoonekynd's Optimization Blog](http://zoonek.free.fr/blosxom/R/2012-06-01_Optimization.html). #' #' #'
#'
#' SOCR Resource #' Visitor number #' #' #'
#'
#' #' #' #' #' #' #' #' #' #' #' #'
#'