Symbolic regression using Genetic Programming...in python...
Performing Expression hunting using symbolic regression in python using DEAP package for evolutionary algorithms!
Let's say that we have an expression, and that expression generates some data points P. Now, we forget about the expression. Can we ever get back the expression that generated those points only using the points? We can surely design a neural network which simulates that function, but that is not exactly like having an expression for the function.
Disclaimer: The code shown in this blog is brazenly taken from DEAP Symbolic Regression Example Source File. I shall write my own version of symbolic regression soon.
The answer to the question is YES, if the original expression is simple enough! This can be done using the sophisticated algorithm of Genetic Programming. Genetic Programming is like asking a computer to write a program which gives us a required output! Ring a bell if you have used Github Copilot? But that is a completely different technology, which uses a large transformers based language model, GPT-3.
Genetic programming is a branch of more general set of techniques called Evolutionary Algorithms. Symbolic regression intends to find an expression that satisfies a set of data points given to the system.
Note: If you are unfamiliar with genetic algorithms, please consider reading my article on solving non-linear equations using genetic algoriothms.
The Problem that we are trying to solve here!
Say that somehow, we have stumbled upon a black box, which acts like a function, takes some input in and give some output out. All we can do is plot the data on a graph and from there we want an expression that gives us the same output.
For example, let's say that there is a task that no one knows how to solve, so we train a Neural Network to learn to do that from the data.
This neural netork is like a black box, it can be used to take inputs and give us output, but again, we get to learn nothing from this. This is premise we are in and the absense of the expression is the problem.
Genetic Programming
A technique to ask a computer to write a program which fits some output when executed. The way it is done is very similar Genetic Algorithms, but with a slight difference, here our population is the population of code.
As we all must know, that any code can be translated into a tree. This happens every time we compile our code, as it is turned into an Abstract Syntax Tree in an intermediate process.
This program can be thought as a single individual in the population of programs. You can define an evaluate function, which calculates, for each individual, how close is its output to the target data. Initially, we randomly create these program trees out of a set of primitives, or you can call operators, and some inputs.
Every operator takes some input, for example, addition takes in 2 numbers to add, therefore, the primitive for addition(+) will have 2 children, same with multiplication. The tree can be flattened into a sequence because we know how many children a node must have.
This sequence of the flattened tree can be used to generate more sequences by applying the evolutionary operators like cross-over,mating, mutation, etc.
I am explaining how Genetic Programming works on programs but expressions are not much different than programs if you think about it.
DEAP: Python module for Evolutionary Algorithms
From the official github page,
DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent.
Deap will make it extremely simple to implement the concepts of Genetic Programming. Go ahead and install the package in python,
$ pip install deap
In this code example we shall solve for the expression:
The way the module DEAP works is that we prepare some data structures and then pass them to the algorithm class which does everything for us.
These data structures include PrimitiveSet, Creator object and Toolbox object. The primitive set will contains the operators to be used as tree components while creating an individual. Creator will create those individuals with corresponding some random fitness.
Toolbox will include all the functions that are needed by the algorithm to work, for example, there will be a function to evaluate the fitness of a particular individual, one for mating, one for mutation, one for selection and some related to the statistics of the population.
Step 0. Adding the imports
Make sure you have installed all the packages we are importing here,
import operator import math import random import numpy from deap import algorithms from deap import base from deap import creator from deap import tools from deap import gp
Step 1. Build a set of Primitives(operators and constants)
Primitives are what used to build the Expression Tree. Expression tree is just like a program tree for our case because of the task at hand does not require loops. We use a class gp.PrimitiveSet
provided in deap
to initialize. Give is a name(Main) and the number of inputs(1).
pset = gp.PrimitiveSet("MAIN", 1)
Rename the input to x
pset.renameArguments(ARG0='x')
Now, add some operator in the pset
object by providing a function for those operators and thier number of inputs,
# addPrimitive(function, No. inputs for that function) pset.addPrimitive(operator.add, 2) pset.addPrimitive(operator.sub, 2) pset.addPrimitive(operator.mul, 2)
Let's add a modified version of divide operator, one which does not give ZeroDivisionError exception.
def protectedDiv(left, right): try: return left / right except ZeroDivisionError: return 1 pset.addPrimitive(protectedDiv, 2)
We have added the basic operators, now shall add some trigonometric ones,
pset.addPrimitive(math.cos, 1) pset.addPrimitive(math.sin, 1)
and one for negation, like -2 to 2 or 3 to -3.
pset.addPrimitive(operator.neg, 1)
Some variables are needed to be created at the time of tree creation, like whenever a new generation individual is born some random value is needed, for that we need EphemeralConstant.
pset.addEphemeralConstant("rand101", lambda: random.randint(-1,1))
Step 2. Adding Creator and Toolbox Object
We create classes inside the creator
object which can be initialized and instancialized later. We specify the base class and the parameters that it will take.
creator.create("FitnessMin", base.Fitness, weights=(-1.0,)) creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMin)
The base
has a base class for a fitness class called base.fitness
, we create a new class FitnessMin
inside creator
namespace with a member of the class called weights
with the value (-1.0)
. This FitnessMin class will be used as member in the Individual
class as fitness
in the 2nd line. The Individual class extends the gp.PrimitiveTree
because each individual is a PrimitiveTree.
Now, for Toolbox,
toolbox = base.Toolbox()
Add tools to our toolbox
class,
toolbox.register("expr", gp.genHalfAndHalf, pset=pset, min_=1, max_=2) toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.expr) toolbox.register("population", tools.initRepeat, list, toolbox.individual)
Here, we register 3 functions, expr
, individual
and population
to create expressions(line#1) out of the pset
object, then those expressions are passed in the creator.Individual
class to generate an individual(line#2), then a list of those generated individuals create the population(line#3)
We need a function which evaluates an individual fitness, and then add it to our toolbox
,
toolbox.register("compile", gp.compile, pset=pset) def evalSymbReg(individual, points): # Transform the tree expression in a callable function func = toolbox.compile(expr=individual) # Evaluate the mean squared error between the expression # and the real function : x**4 + x**3 + x**2 + x sqerrors = ((func(x) - x**4 - x**3 - x**2 - x)**2 for x in points) #return Average error over the range: -1 to 1 return math.fsum(sqerrors) / len(points), # range: -1 to 1 at 0.1 gap toolbox.register("evaluate", evalSymbReg, points=[x/10. for x in range(-10,10)])
Here, we have included a new function compile
which is an inbuilt function inside the module, used to compile the individual as a callable function.
Now, this callable function func
should mimic the required expression so we take the difference of the returned value, func(x)
, from the expression for the range -1 to 1 and keep the mean square error for each x. We return the average of this value which shows how strong the target data matched the expression geberated by the individual.
We then add it to the toolbox
object as evaluate
.
Add some functions to perform the evolution using inbuilt functions from gp
and tools
class,
toolbox.register("select", tools.selTournament, tournsize=3) toolbox.register("mate", gp.cxOnePoint) toolbox.register("expr_mut", gp.genFull, min_=0, max_=2) toolbox.register("mutate", gp.mutUniform, expr=toolbox.expr_mut, pset=pset)
Step 3. Main Function
Calling the algorithm class with the prepared toolbox
class is straight forward,
def main(): random.seed(320) pop = toolbox.population(n=1000) hof = tools.HallOfFame(1) pop, log = algorithms.eaSimple(pop, toolbox, 0.5, 0.1, 40, halloffame=hof, verbose=False) print("Most successful individual tree/expression found is:",hof[0]) return if __name__ == "__main__": main()
We create a population of 1000, a variable to hold the best solution called the hall of fame, hof
. Initialize an aglorithm eaSimple
which stands for Evolutionary Algorithm Simple. Under the hood, it is simply applying the function we put inside the toolbox, selecting, mating, mutating, and repeating till our generation is good enough.
The psuedocode for the eaSimple
is as follows,( varAnd()
applies crossover and mutation)
evaluate(population) for g in range(ngen): population = select(population, len(population)) offspring = varAnd(population, toolbox, cxpb, mutpb) evaluate(offspring) population = offspring
The OUTPUT
Run the code by pasting it all in a file called, gp_sr.py
.
$ python3 gp_sr.py
which gives the output,
mul(add(mul(x, x), 1), mul(add(x, 1), x))
which, upon simplification, is exactly the same expression as
Conclusion
Genetic Programming concepts can be used to identify an expression which represents the data points in question. It does so by creating a population of simple random expression made out of the primitive set and then use the operators of evolutionary algorithms to select, mate, and generate new offsprings of expressions which are slightly better than their parents. This converges to the actual expression after many generations and we find our solution Expression at the end.
DEAP package makes it really simple to implement all the above.