An artificial neural network is made of layers, and a layer is made of many perceptrons (aka neurons). Perceptron is the basic computational unit of the neural network, which multiplies input with weight, adds bias, and passes the result from the activation function to deliver the output to the next layer.

In this blog, we will first design a single-layer perceptron model for learning logical AND and OR gates. Then we will design a multi-layer perceptron for learning the XOR gate's properties. While creating these perceptrons, we will know why we need multi-layer neural networks.

A single-layer perceptron contains an input layer with neurons equal to the number of features in the dataset and then an output layer with neurons equal to the target class. Single-layer perceptrons separate linearly separable datasets like AND and OR gates. In contrast, a multi-layer perceptron is used when the dataset contains non-linearity. Apart from the input and output layers, MLP( short form of Multi-layer perceptron) has hidden layers in between the input and output layers. These hidden layers help in learning the complex patterns in our data points.

Logic gates are the basic building blocks of digital circuits. They decide which set of input signals will trigger the circuit using boolean operations. They commonly have two inputs in binary form (0,1) and produce one output (0 or 1) after performing a boolean operation.

**Graph insights :**

- A linear line can easily separate data points of OR and AND. So we can use a single-layer perceptron here.
- Whereas, to separate data points of XOR, we need two linear lines or can add a new dimension and then separate them using a plane. Multi-layer Perceptron will work better in this case. In the later part of this blog, we will see how SLP fails in learning XOR properties and will implement MLP for it.

Let's understand the neural network training process and see how Perceptron maps these lines over the data points.

Designing a neural network in terms of writing code will be very hectic and unreadable to the users. Escaping all the complexities, data professionals use python libraries and frameworks to implement models. But we are designing an elementary neural network, so we will build it without using any framework like TensorFlow and PyTorch. We will take the help of NumPy, a python library famous for its mathematical operations and multidimensional arrays. Then we will switch to Keras for building multi-layer Perceptron.

So let's start!

```
# First run pip install numpy command in terminal to install it for windows
import numpy as np
```

The input for the logic gate consists of two values (T, F). T is for true and F for false, similar to binary values (1, 0). Input is fed to the neural network in the form of a matrix. So we have to define the input and output matrix's dimension (a.k.a. shape). X's shape will be (1, 2) because one input set has two values, and the shape of Y will be (1, 1).

```
T=1.0
F=0.0
# creating data for logical AND operation
def get_OR_data():
X=[
[F,F],
[F,T],
[T,F],
[T,T]
]
Y=[
[F],
[T],
[T],
[T]
]
return X,Y
X,Y=get_OR_data()
```

We have defined the get*OR*data function for fetching inputs and outputs. Similarly, we can define get*AND*data and get*XOR*data functions using the same set of inputs.

Now, we will define a class MyPerceptron to include various functions which will help the model to train and test. The first function will be a constructor to initialize the parameters like learning rate, epochs, weight, and bias.

```
class MyPerceptron:
def __init__(self,learning_rate=0.1,n_iterations=1000):
self.lr=learning_rate
self.epochs=n_iterations
self.weights=None
self.bias=None
```

The second function is divided into four stages:

**Defining the weight and bias.****Deciding the activation function.****Deciding the loss function.****Deciding the optimization algorithm.**

```
def fit(self,X,Y):
# Defining the shape of weight and bias.
self.weights=np.zeros(X.shape[1])
self.bias=0
# training the model on X_train and Y_train
for epoch in range(self.epochs):
for i in range(X.shape[0]):
# Deciding the activation function
Y_pred= self.Step_activ_func(np.dot(self.weights,X[i]) + self.bias)
# Deciding the loss function
mae=Y[i]-Y_pred
# Updating the weight and bias using optimization algorithm
self.weights=self.weights + self.lr *mae*X[i]
self.bias=self.bias + self.lr * mae
```

The basic principle of matrix multiplication says if the shape of X is (mn) and W is (nk), then only they can be multiplied, and the shape of XW will be (mk). So keeping this in mind, the weight matrix W will be (2,1). Similarly, the shape of the bias will be (1,1).

An activation function limits the output produced by neurons but not necessarily in the range [0,1] or [0, infinity). This bound is to ensure that exploding and vanishing of gradients should not happen. The other function of the activation function is to activate the neurons so that model becomes capable of learning complex patterns in the dataset. So let's activate the neurons by knowing some famous activation functions.

**Unit step**: 1 for x ≥ 0 and 0 for x < 0. The advantage of this function is its easy implementation.**Sigmoid**: This function is mainly used when the output values are probabilities because it is confined between (0,1). It is differentiable and widely used in neural networks.**ReLU (Rectified Linear Unit) function**: In electric terms, a rectifier is a device that converts alternating current (AC) to direct current (DC). Using the same analogy ReLU function eliminates the negative side and only gives positive output. Its value is zero for all negative inputs and gives the same value for positive inputs.

We will use the Unit step activation function to keep our model simple and similar to traditional Perceptron.

```
def Step_activ_func(self,activation):
if(activation>=0):
return 1
else:
return 0
```

After passing the neuron output from the activation function, we must calculate the error between the predicted and actual output. The functions used to calculate this error are called loss functions. And while training the neural network, we try to minimize the summed value of the loss function for all the samples, which is called the cost function in Machine Learning. Some of the famous cost functions in neural networks are :

**Mean absolute error(MAE)**: The sum of all the differences between actual and predicted output. This loss function is preferred when we try to solve a regression problem.**Binary Cross Entropy**: This loss function is used for the binary classification problem, and the most commonly used activation function with BCE is the sigmoid function. We will use this in multi-layer Perceptron.**Categorical Cross Entropy**: This solves multi-classification problems and gives the best result when used in sync with the softmax activation function. To know more about it, you can refer to this blog.

We will use mean absolute error to implement a single-layer perceptron.

After everything is in place, our goal is to optimize the performance. To fulfil this goal, we need an optimization algorithm. It starts with random weight and bias values and updates them after every iteration to minimize the error. Some of the most famous optimization algorithms are :

**Gradient Descent**: This technique uses the cost function gradient (average of losses) to update the parameters until the cost is minimized. Gradient descent is easy to implement and works best with convex functions.**Stochastic Gradient Descent**: Gradient descent calculates the gradient of the cost function for all data points, whereas SGD randomly selects one sample and calculates its gradient. This makes SGD faster and less memory-consuming.

We are using a more simple optimization technique here. We will update the parameters using a simple analogy presented below.

**W new = Wold + learning_rate * error * X**

This is our final equation when we go into the mathematics of gradient descent and calculate all the terms involved. To understand how we reached this final result, see this blog.

The learning rate determines how much weight and bias will be changed after every iteration so that the loss will be minimized, and we have set it to 0.1.

We have implemented all the functions of Perceptron, and now it's time to train. But before that, we have to define one more parameter: epoch. An epoch is a parameter that determines the number of times the model should be trained on the entire dataset. We have already initialized the epoch value in the constructor of the MyPerceptron class.

```
clf=MyPerceptron()
clf.fit(X,Y)
X_test=[
[F,F],
[T,F],
[F,T],
[F,F]
]
Y_test=[
[F],
[T],
[T],
[F]
]
X_test=np.array(X_test)
Y_test=np.array(Y_test)
Y_predicted=clf.predict(X_test)
print(Y_predicted)
print(accuracy_score(Y_test,Y_predicted))
```

**Testing Result OR: [array([[0], [1], [1], [0])]**

As we can see, the Perceptron predicted the correct output for logical OR. Similarly, we can train our Perceptron to predict for AND and XOR operators. But there is a catch while the Perceptron learns the correct mapping for AND and OR. It fails to map the output for XOR because the data points are in a non-linear arrangement, and hence we need a model which can learn these complexities. Adding a hidden layer will help the Perceptron to learn that non-linearity. This is why the concept of multi-layer Perceptron came in. And now we are going to design one for XOR.

The designing process will remain the same with one change. We will choose one extra hidden layer apart from the input and output layers. We will place the hidden layer in between these two layers. For that, we also need to define the activation and loss function for them and update the parameters using the gradient descent optimization algorithm. So let's start.

Neural networks are complex to code compared to machine learning models. If we compile the whole code of a single-layer perceptron, it will exceed 100 lines. To reduce the efforts and increase the efficiency of code, we will take the help of Keras, an open-source python library built on top of TensorFlow.

```
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
import tensorflow as tf
import numpy as np
T=1.0
F=0.0
def get_XOR_data():
X=[
[F,F],
[F,T],
[T,F],
[T,T]
]
Y=[
[F],
[T],
[T],
[F]
]
return X,Y
X,Y=get_XOR_data()
X_test=[
[T,F],
[T,T],
[F,T],
[F,F]
]
Y_test=[
[T],
[F],
[T],
[F]
]
```

The first step is to import all the modules and define training and testing data as we did for single-layer Perceptron.

There are three things that we need to decide for each layer:

**The number of neurons**: It will be 16 so that the layer will learn the complex distribution of data points better.

**Activation Function:** ReLu because it works well with binary inputs.

```
model=Sequential()
model.add(Dense(16,input_shape=(2,),activation='relu'))
```

The sequential model depicts that data flow sequentially from one layer to the next. Dense is used to define layers of neural networks with parameters like the number of neurons, input_shape, and activation function.

The hidden layer performs non-linear transformations of the inputs and helps in learning complex relations. We will use 16 neurons and ReLu as an activation function for this layer.

`model.add(Dense(16,activation='relu'))`

To design a hidden layer, we need to define the key constituents again first.

**Number of neurons:** The output layer has neurons equal to the number of the output variables. One in our case.

**Activation Function:** The output is in the range of [0, 1], so we need to convert them to either 0 or 1, and for this, we will use the sigmoid function.

**Loss Function:** We commonly use binary cross-entropy as the loss function for binary classification problems.

**Optimization Algorithm:** To optimize the cost and reduce the error or loss, we need to update the parameters like weight and bias. So for this, we will use gradient descent.

```
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='SGD',metrics=['accuracy'])
```

After compiling the model, it's time to fit the training data with an epoch value of 1000. After training the model, we will calculate the accuracy score and print the predicted output on the test data.

```
model.fit(X,Y,epochs=100)
loss,accuracy=model.evaluate(X_test,Y_test,verbose=0)
print('Accuracy: %.2f' % (accuracy*100))
print(loss)
```

**Final Output: [[0.706279 ], [0.21512125], [0.70059645], [0.49652937]]
Accuracy: 100.00**

**Expected Output: [[1], [0], [1], [0]]**

We can say that Perceptron performed well and can learn XOR properties. After the successful implementation of MLP, neural networks became very popular and opened vast opportunities to solve complex problems with great accuracy. At the end of this blog, there are two use cases that MLP can easily solve.

These are some basic steps one must follow to train a neural network.

- The number of nodes in the input layer equals the number of features.
- Decide the number of hidden layers and nodes present in them.
- Initialize the value of weight and bias for each layer.
- Define the loss and activation function for each layer.
- Define the optimization algorithm to update the parameters.
- Decide the number of nodes present in the output layer. Generally equal to the number of classes in classification problems and one for regression problems.
- Decide the epochs to train the model.
- Evaluate the performance.

These steps can be performed by writing a few lines of code in Keras or PyTorch using the inbuilt algorithms, but instead of using them as a black box, we should know in and out of those algorithms. And this was the only purpose of coding Perceptron from scratch.

MNIST dataset is the most famous dataset of handwritten digits used for character recognition. Almost every algorithm has been fitted on this dataset to evaluate the model's performance and achieved the highest accuracy score of 99.91. Although many algorithms perform better than MLP, this dataset is perfect for practising neural network implementation. To understand this dataset in detail and understand how a model can be built on this dataset, look at this blog.

The Iris dataset is best for understanding which features are important to predict the flower species. Every machine learning or neural network curriculum takes this dataset as a reference to teach model building. This dataset is good for starting neural networks. This will also follow the same approach of converting image into vectors and flattening it to feed into the neural networks. Please refer to this blog to learn more about this dataset and its implementation.

This blog is intended to familiarize you with the crux of neural networks and show how neurons work. The choice of parameters like the number of layers, neurons per layer, activation function, loss function, optimization algorithm, and epochs can be a game changer. And with the support of python libraries like TensorFlow, Keras, and PyTorch, deciding these parameters becomes easier and can be done in a few lines of code. Stay with us and follow up on the next blogs for more content on neural networks.

If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy machine learning!