# Symbolic regression using Genetic Programming...in python...

Performing Expression hunting using symbolic regression in python using DEAP package for evolutionary algorithms!

Let's say that we have an expression, and that expression generates some data points *P.* Now, we forget about the expression. Can we ever get back the expression that generated those points only using the points? We can surely design a neural network which simulates that function, but that is not exactly like having an expression for the function.

Disclaimer: The code shown in this blog is brazenly taken from DEAP Symbolic Regression Example Source File. I shall write my own version of symbolic regression soon.

The answer to the question is YES, if the original expression is simple enough! This can be done using the sophisticated algorithm of **Genetic Programming**. Genetic Programming is like asking a computer to write a program which gives us a required output! Ring a bell if you have used *Github Copilot*? But that is a completely different technology, which uses a large transformers based language model, GPT-3.

Genetic programming is a branch of more general set of techniques called *Evolutionary Algorithms.*** Symbolic regression** intends to find an expression that satisfies a set of data points given to the system.

Note: If you are unfamiliar with genetic algorithms, please consider reading my article on solving non-linear equations using genetic algoriothms.

## The Problem that we are trying to solve here!

Say that somehow, we have stumbled upon a **black box**, which acts like a **function**, takes some **input** in and give some **output** out. All we can do is plot the data on a graph and from there we want an expression that gives us the same output.

For example, let's say that there is a task that no one knows how to solve, so we train a **Neural Network** to learn to do that from the data.

This neural netork is like a black box, it can be used to take inputs and give us output, but again, we get to learn nothing from this. This is premise we are in and the absense of the expression is the problem.

## Genetic Programming

A technique to ask a computer to write a program which fits some output when executed. The way it is done is very similar Genetic Algorithms, but with a slight difference, here our population is the population of code.

As we all must know, that any code can be translated into a tree. This happens every time we compile our code, as it is turned into an Abstract Syntax Tree in an intermediate process.

This program can be thought as a single **individual** in the **population** of programs. You can define an **evaluate** function, which calculates, for each **individual**, how close is its output to the target data. Initially, we **randomly** create these program trees out of a **set of** **primitives,** or you can call **operators, **and some inputs.

Every operator takes some input, for example, addition takes in 2 numbers to add, therefore, the primitive for addition(+) will have 2 children, same with multiplication. The tree can be **flattened** into a **sequence** because we know how many children a node must have.

This sequence of the flattened tree can be used to generate more sequences by applying the evolutionary operators like cross-over,mating, mutation, etc.

I am explaining how **Genetic Programming** works on programs but expressions are not much different than programs if you think about it.

## DEAP: Python module for Evolutionary Algorithms

From the official github page,

DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent.

Deap will make it extremely simple to implement the concepts of Genetic Programming. Go ahead and install the package in python,

`$ pip install deap`

In this code example we shall solve for the expression:

The way the module **DEAP** works is that we prepare some data structures and then pass them to the algorithm class which does everything for us.

These data structures include **PrimitiveSet**, **Creator** object and **Toolbox** object. The **primitive** set will contains the **operators** to be used as tree components while creating an **individual**. **Creator** will create those individuals with corresponding some random **fitness**.

Toolbox will include all the functions that are needed by the algorithm to work, for example, there will be a function to **evaluate** the fitness of a particular individual, one for **mating, **one for **mutation**, one for **selection **and some related to the statistics of the population.

### Step 0. Adding the imports

Make sure you have installed all the packages we are importing here,

import operator import math import random import numpy from deap import algorithms from deap import base from deap import creator from deap import tools from deap import gp

### Step 1. Build a set of Primitives(operators and constants)

Primitives are what used to build the **Expression Tree. **Expression tree is just like a program tree for our case because of the task at hand does not require loops. We use a class `gp.PrimitiveSet`

provided in `deap`

to initialize. Give is a name(*Main*) and the number of inputs(*1*).

`pset = gp.PrimitiveSet("MAIN", 1)`

Rename the input to `x`

`pset.renameArguments(ARG0='x')`

Now, add some operator in the `pset`

object by providing a **function** for those operators and thier number of **inputs**,

# addPrimitive(function, No. inputs for that function) pset.addPrimitive(operator.add, 2) pset.addPrimitive(operator.sub, 2) pset.addPrimitive(operator.mul, 2)

Let's add a modified version of divide operator, one which does not give **ZeroDivisionError** exception.

def protectedDiv(left, right): try: return left / right except ZeroDivisionError: return 1 pset.addPrimitive(protectedDiv, 2)

We have added the basic operators, now shall add some trigonometric ones,

pset.addPrimitive(math.cos, 1) pset.addPrimitive(math.sin, 1)

and one for negation, like -2 to 2 or 3 to -3.

`pset.addPrimitive(operator.neg, 1)`

Some variables are needed to be created at the time of tree creation, like whenever a new generation individual is born some random value is needed, for that we need **EphemeralConstant.**

`pset.addEphemeralConstant("rand101", lambda: random.randint(-1,1))`

### Step 2. Adding Creator and Toolbox Object

We create classes inside the `creator`

object which can be initialized and instancialized later. We specify the base class and the parameters that it will take.

creator.create("FitnessMin", base.Fitness, weights=(-1.0,)) creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMin)

The `base`

has a base class for a fitness class called `base.fitness`

, we create a new class `FitnessMin`

inside `creator`

namespace with a member of the class called `weights`

with the value `(-1.0)`

. This **FitnessMin** class will be used as member in the `Individual`

class as `fitness`

in the 2nd line. The **Individual** class extends the `gp.PrimitiveTree`

because each **individual** is a **PrimitiveTree. **

Now, for Toolbox,

toolbox = base.Toolbox()

Add tools to our `toolbox`

class,

toolbox.register("expr", gp.genHalfAndHalf, pset=pset, min_=1, max_=2) toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.expr) toolbox.register("population", tools.initRepeat, list, toolbox.individual)

Here, we register 3 functions, `expr`

, `individual`

and `population`

to create **expressions**(line#1) out of the `pset`

object, then those expressions are passed in the `creator.Individual`

class to generate an **individual**(line#2), then a list of those generated individuals create the **population**(line#3)

We need a function which evaluates an individual fitness, and then add it to our `toolbox`

,

toolbox.register("compile", gp.compile, pset=pset) def evalSymbReg(individual, points): # Transform the tree expression in a callable function func = toolbox.compile(expr=individual) # Evaluate the mean squared error between the expression # and the real function : x**4 + x**3 + x**2 + x sqerrors = ((func(x) - x**4 - x**3 - x**2 - x)**2 for x in points) #return Average error over the range: -1 to 1 return math.fsum(sqerrors) / len(points), # range: -1 to 1 at 0.1 gap toolbox.register("evaluate", evalSymbReg, points=[x/10. for x in range(-10,10)])

Here, we have included a new function `compile`

which is an inbuilt function inside the module, used to compile the individual as a callable function.

Now, this callable function `func`

should mimic the required expression so we take the difference of the returned value, `func(x)`

, from the **expression** for the range -1 to 1 and keep the mean square error for each x. We return the average of this value which shows how strong the target data matched the expression geberated by the individual.

We then add it to the `toolbox`

object as `evaluate`

.

Add some functions to perform the evolution using inbuilt functions from `gp`

and `tools`

class,

toolbox.register("select", tools.selTournament, tournsize=3) toolbox.register("mate", gp.cxOnePoint) toolbox.register("expr_mut", gp.genFull, min_=0, max_=2) toolbox.register("mutate", gp.mutUniform, expr=toolbox.expr_mut, pset=pset)

### Step 3. Main Function

Calling the algorithm class with the prepared `toolbox`

class is straight forward,

def main(): random.seed(320) pop = toolbox.population(n=1000) hof = tools.HallOfFame(1) pop, log = algorithms.eaSimple(pop, toolbox, 0.5, 0.1, 40, halloffame=hof, verbose=False) print("Most successful individual tree/expression found is:",hof[0]) return if __name__ == "__main__": main()

We create a population of 1000, a variable to hold the best solution called the hall of fame,** **`hof`

. Initialize an aglorithm `eaSimple`

which stands for **Evolutionary Algorithm Simple**. Under the hood, it is simply applying the function we put inside the toolbox, selecting, mating, mutating, and repeating till our generation is good enough.

The psuedocode for the `eaSimple`

is as follows,( `varAnd()`

applies **crossover** and **mutation**)

evaluate(population) for g in range(ngen): population = select(population, len(population)) offspring = varAnd(population, toolbox, cxpb, mutpb) evaluate(offspring) population = offspring

## The OUTPUT

Run the code by pasting it all in a file called, `gp_sr.py`

.

`$ python3 gp_sr.py`

which gives the output,

`mul(add(mul(x, x), 1), mul(add(x, 1), x))`

which, upon simplification, is exactly the same expression as

## Conclusion

Genetic Programming concepts can be used to identify an expression which represents the data points in question. It does so by creating a population of simple random expression made out of the primitive set and then use the operators of evolutionary algorithms to select, mate, and generate new offsprings of expressions which are slightly better than their parents. This converges to the actual expression after many generations and we find our solution **Expression** at the end.

DEAP package makes it really simple to implement all the above.