Neural networks are a highly sought-after topic in the software industry today. In a previous article, we discussed the fundamentals of neural networks (NNs). However, understanding the components that make up an neural network is crucial for gaining a comprehensive understanding of the concept. In this article, we will delve deeper into the anatomy of an neural network and examine the importance of each element, including input layers, hidden layers, output layers, neurons, connections, weights, biases, activation functions, and cost functions.

Before moving any further, let's first look at the schematic diagram of NN.

The input layer is the first layer of any Neural Network and represents the input data to the network. Each neuron, represented as small circular nodes (x1, x2, …, xn) in the diagram above, corresponds to one feature of the dataset. For example, in a house price prediction model, the input layer may have neurons for the house's size, distance from the railway station, and distance from the market. Understanding the input layer and its role in the neural network is crucial for designing and training efficient models.

The output layer of a Neural Network represents the final predictions generated by the network. The number of neurons in this layer corresponds to the number of outputs desired for a given input. In a regression problem, where a single output value is expected, there will be one neuron in the output layer. However, in classification tasks, where multiple output classes are possible, there will be multiple neurons, one for each class. For example, in a handwritten digit recognition task, there will be 10 neurons corresponding to the 10 possible classes (0-9).

Sometimes, the input and output pairs can have complex relationships, and to decode these relations, hidden layers exist between the input and output layers. Hidden layers also contain neurons; every Neuron connects to every other Neuron in adjacent layers. For example, neurons in hidden layer 1 will be connected to every Neuron in the input layer and hidden layer 2.

Why not use immensely deeper networks to learn all complex relations in the data? There is a tradeoff between accuracy and latency. With the increment in the number of layers and nodes, we might achieve better accuracy, but that will cost us more computations, more power, and, ultimately, more money. Hence, while designing any NN, we must consider finding the answers to the following:

- How many layers will be sufficient to achieve our target accuracy?
- How many neurons will be sufficient to achieve our target accuracy?
- How many connections should be retained from the previous layer's Neuron?

The third one is interesting as we also drop some neuron connections to ensure generalization and reduce the problem of overfitting. Overfitting is a problem for ML models where the model learns everything present in the training data sample and fails with a significant margin on unseen datasets, i.e., test datasets.

Neurons play a crucial role in the functioning of a Neural Network, as they constitute every layer, including the Input, Output, and Hidden layers. Similar to the nucleus of brain cells, each neuron, except those in the Input layer, contains a bias parameter that the Neural Network learns and adjusts during the training process. These bias values are typically initialized with random numbers, and the Neural Network fine-tunes them to minimize the difference between the computed and actual output.

The connections between neurons in a Neural Network are crucial for the learning process. Each neuron in one layer is connected to every neuron in the adjacent layers. These connections are represented by a weight value, which determines the importance of that connection. The weight values are the trainable parameters that the Neural Network learns by iterating over the training dataset. The optimization of these weight values is crucial for the overall performance of the Neural Network and is a key aspect of the learning process.

Fully connected Neural Networks, also known as feedforward networks, are a popular choice for basic neural network architecture. These networks connect every neuron in one layer to every neuron in the adjacent layers, making them highly versatile for various datasets. In deep learning, other forms of connected networks, such as convolutional neural networks, are also commonly used."

The weight matrix, a combination of weight and bias values, is a crucial aspect of Neural Networks. It represents the learnable parameters of the network, and it helps the model make predictions. The weight matrix is used to map the input to the output, and it is in the form of a matrix, which makes it easy to compute and update during the training process. Understanding the weight matrix and its role in Neural Networks is essential for designing and optimizing machine learning models.

In our brain, Neurons get activated based on the signals received through various sensory organs. As these signals can be related to different tasks, different neurons activate and provide the required responses. Similar to this, in Neural Networks, we have activation functions for Neurons of every layer. We define the activation function for a layer, and all the neurons in that layer follow that same activation function.

Neurons of every layer get the input from the previous layer multiplied by weight values. Based on the relationship we want to maintain between weighted input and the corresponding output from the Neuron, we can divide activation functions into two types:

**Linear Activation Functions:**In a Neural Network, the linear activation function is used to directly pass the weighted input as the output without any additional transformation. This function ensures that the relationship between the incoming weighted input and the corresponding output from that neuron is linear. This type of activation function is commonly used in the output layer of a Neural Network.**Non-linear Activation Functions:**Non-linear activation functions are mathematical functions that transform the weighted inputs received by a neuron to produce a non-linear relationship between the input and output. These functions help neurons in extracting complex patterns present in the data and are mostly used in the hidden layers of a Neural Network. Some popular non-linear activation functions include Sigmoid, ReLU, and Softmax.

We will dive deeply into these activation functions in our subsequent blogs.

Loss and Cost functions are one of the most vital components of any Machine Learning algorithm. Machines only understand numbers; hence we need to convey our objectives in numbers. If our framed objectives in the form of numbers do not represent what we want our machines to learn, the fault will be ours that machines will fail to learn. In our loss and cost function in ML blog, we discussed how we define a regression problem using the cost functions MAE, MSE, RMSE, etc. Similarly, we define the classification problems in the form of Binary cross-entropy and Categorical cross-entropy based on the number of categories in the data.

Machines try to optimize these costs by changing the parameter values and achieving the best suitable ones for which the cost becomes minimum. Some of the most common things that we need to keep in mind while designing a cost function for NN:

- The cost function should represent the problem statement in the form of errors between the predicted output by the NN and the expected output by the NN.
- There can be thousands of parameters in NN. Hence slight changes in the parametric values should reflect the change in the cost function.
- The cost function should be differentiable. For example, the error between predicted and expected output can be one degree. Still, we can make the error⁴ as our cost function that will make it differentiable and understandable by machines in training.
- The cost function is defined only for the output layer and not the Input and Hidden layers.

Optimizing the cost function in Machine Learning and Neural Networks is a crucial task, as trying all possible values for the weight matrix would take an excessive amount of time, even with the use of supercomputers. To aid in this process, various optimization algorithms such as Gradient Descent, Gradient Descent with momentum, Stochastic Gradient Descent, and Adam (Add a momentum) are utilized. Understanding these optimization algorithms is a common topic in Machine Learning interviews, thus, we will delve into each of these algorithms in separate articles. It is important to note that these optimization algorithms are only applied to the output layer, as the cost function is defined for this layer only.

When working with Neural Networks, there are two types of parameters to consider: trainable parameters and hyperparameters.

**Trainable Parameters:**A key aspect of Artificial Neural Networks (ANNs) is the weight matrix, which contains trainable parameters that can be modified during the learning process. These parameters include the biases of neurons in all layers except the input layer and the weights of connections between neurons. The total number of trainable parameters can be calculated by adding the total number of neurons in the hidden and output layers to the total number of connections. Understanding and optimizing these trainable parameters is essential for effectively training and utilizing ANNs for a wide range of Machine Learning tasks.-
**Hyperparameters (Untrainable Parameters):**When building an Artificial Neural Network (ANN), it is important to set certain fixed values, known as hyperparameters, which are then fine-tuned through experimentation to achieve the lowest possible cost value. Understanding and optimizing these hyperparameters plays a crucial role in effectively training and utilizing ANNs for a wide range of Machine Learning tasks, and helps to ensure the best possible performance and accuracy in predictions.**What's the difference between trainable and untrainable parameters?**When training an Artificial Neural Network (ANN), it is important to understand the difference between trainable and hyperparameters. Trainable parameters, such as weights and biases, are continuously updated during an experiment to minimize the cost function. On the other hand, hyperparameters, such as the number of hidden layers or the learning rate, remain mostly fixed or are modified in a systematic manner across several experiments to ensure the overall minimum cost function. For example, the learning rate, which controls the magnitude of updates to trainable parameters, can be kept constant during one experiment and adjusted through multiple experiments to find the best value for the specific dataset and problem. Hyperparameters are also of two types:

**Design-related hyperparameters:**These hyperparameters include the number of hidden layers, the number of neurons per hidden layer, the choice of activation function, the optimization algorithm used, and the cost function selected. Understanding and optimizing these hyperparameters plays a crucial role in effectively training and utilizing ANNs for a wide range of Machine Learning tasks, and helps to ensure the best possible performance and accuracy in predictions.**General Hyperparameters tuned after design finalization:**Some of the hyperparameters that get adjusted after the design finalization are:**Learning rate:**The learning rate, denoted by α, controls the magnitude of updates to the weight matrix. A high learning rate may result in missing optimal weight values, while a low rate may prolong training. It is important to experiment with different learning rates to determine the best fit for your dataset.**Regularization:**Regularization helps to prevent overfitting by reducing the amount of learned information. The amount of regularization to apply can be determined through experimentation.**Batch size:**The batch size controls the number of samples used to update the weight matrix during training. Online learning, which uses a single sample, can be computationally expensive and susceptible to outliers. Batch learning, which uses a group of samples, can improve performance and decrease the impact of outliers. Some disadvantages correspond to online learning are:- Data samples may be too high, and changing the weight matrix corresponding to each sample will require very high processing power.
- There can be outliers in the dataset, which can drive the weight matrix values far from the global minimum.

Hence, there is a concept of batch learning where we update the weight matrix values based on the cost averaged over samples in the batch. This type of learning can eliminate the impact of outliers present in the dataset as it gets averaged over all the samples in that batch. Some of the standard batch sizes we see in general experiments are 64, 32, 16, etc.

There can be other hyperparameters that we will see as progress toward deep-learning algorithms.

In this article, we have examined the inner workings of Artificial Neural Networks (ANNs) in depth. We have broken down all the components that make up this popular Machine Learning technique, including the Input Layer, Hidden Layer, Output Layer, Weight Matrix, Cost Function, Parameters, and Hyperparameters. We hope you enjoyed reading the article.

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.