Numerical data is the most frequently used data type in Machine Learning, and to store this tremendous amount of data, we use "array" data structures. NumPy is a famous library package of Python used by Data scientists and analysts for working effectively and efficiently with arrays.
In this article, we will learn about this beautiful package starting with the installation, and then get used to some essential functions frequently used while building machine learning projects.
Let us start by learning more about NumPy.
Numpy is an open-source library that adds support for large, multidimensional arrays and helps us perform high-level mathematical functions effectively and efficiently. Travis Oliphant developed it in 2005. Many famous Python libraries, such as Pandas, MatPlotlib, Seaborn, and Scikit-learn, are built on top of NumPy. To know NumPy in more detail, let's first understand,
Arrays are a collection of elements and can have single or multiple dimensions. One dimension array is called a vector, and an array of two dimensions is called a matrix. Similarly, an array of three dimensions can be considered a tensor(a set of matrixes). NumPy arrays are called ndarray or N-dimensional arrays.
NumPy is an excellent choice to learn after gaining confidence in Python basics. After this, to advance our carrier in data science, we should learn SciPy and Pandas. In short, our learning pattern should follow Python basics, NumPy, SciPy, or Pandas.
We all might wonder if Python lists already exist; what's the need for NumPy? So, let's know their difference first.
Python lists act as an array that can store different types of elements. Everything is an object in Python, so it matters how these objects are stored. A Python object is a pointer to a memory that stores different data types.
Lists are excellent as it helps work with different data types in a single data structure. But that comes at the price of memory and computing efficiency, especially when we have elements of the same data type.
NumPy array solves this issue as it stores similar types of elements, which helps save memory, especially when we have an array with many elements. Also, numpy makes it possible for element-wise operations, which is impossible in the list. It is also time efficient for mathematical operations, and approximately 14x faster than normal python.
One can find the detailed instruction to install NumPy on all operating systems in our make your system machine learning-enabled blog. To install NumPy via Python PyPI (pip), we can use the commands below,
Python2 on terminal → pip install numpy Python3 on terminal → pip3 install numpy Jupyter notebook python2 → !pip install numpy
Once installed, we can import this library and use it in our codes. For example:
import numpy as np
We have imported NumPy and shortened its name to "np". So in future sections, while using this library, 'np' will be used by us and not the complete name NumPy. As discussed, NumPy has significant advantages when used for mathematical operations. So let's start with creating NumPy arrays first.
np.array() can be used to create a NumPy array. This function needs values in a list and converts them into a ndarray. For example:
np.array([1,2,3]) #Output: array([1, 2, 3])
We can specify the datatype inside the "np.array" function. Suppose we select a data type as "int", but the input list has float values; then, while creating an array, it will take floors of those float values, as shown in the example below.
np.array([1,2,3.7],dtype=int) #Output: array([1, 2, 3])
We saw before that the NumPy array can be multidimensional, and the same can be created by passing a list of lists to the function. Here in the example below, a 2X3 matrix has been made. The matrix shape is defined as N x M, where N is the number of rows and M is the number of columns.
np.array([[1,2,3],[4,5,6]]) #Output: array([[1, 2, 3], [4, 5, 6]])
We use the np.full() function to create an array containing a fixed number. We have to provide the shape of the array and the numbers needed to fill it. This can be observed in the example below.
np.full((2,2),5) #Output: array([[5, 5], [5, 5]])
We can use np.zeros() andpass the shape of the array as a tuple to get an array with only the number zero. For example:
np.zeros((2,2)) #Output: array([[0., 0.], [0., 0.]])
We can use np.ones() andpass the shape of the array as a tuple to get an array with each element as 1, as shown below.
np.ones(4) # Output: array([1., 1., 1., 1.])
We often need for our application in data science that initial values are randomized. np.random.rand() can create an array with random values.We need to pass the shape of the array, and all random values are in the range [0,1), with zero included and 1 excluded:
np.random.rand(2,3) #Output: array([[0.76981844, 0.56005659, 0.61075499], [0.2434684 , 0.8560164 , 0.22834211]])
An identity matrix is a square matrix where only diagonal elements are one, and the rest are zero. Use np.eye() method to create it. Pass the number of rows or columns as they are the same np.eye(4) creates a 4X4 matrix as shown:
np.eye(4) # Output: array([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])
We can move the diagonal upward or downward by specifying the value of k in np.eye(number of rows, k=value). If the value is positive, it moves upward; for a negative value, it moves downward. Please note that the matrices we get are not identity matrices. An example is shown below where the diagonal moves downward:
np.eye(4,k=-1) # Output: array([[0., 0., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.]])
Use the np.arange() method to get an evenly spaced array:
np.arange(4) #Output: array([0, 1, 2, 3])
Specify the starting point, end point, and step size in the function to get a custom array which is as shown:
np.arange(10,30,5) Output: array([10, 15, 20, 25])
In the above example, please note that the endpoint is not included in our array. So we see above that 30 is not there in the array as it was our endpoint.
If we want the end point too, use np.linspace(), but here specify the number of elements wanted in the array instead of step size as shown:
np.linspace(10,30,6) # Output: array([10., 14., 18., 22., 26., 30.])
We have learned to make a new array with the help of different methods. Now we will see how to know the shape of an already existing array.
We need to know the array's number of rows, columns, and axes. We also like to see the shape and size of the array.
Let's create an array with the name np_array, which will be directly used for explaining different functions ahead, as shown:
np_array = np.array([[10,20,30],[40,50,60]])
Use the ndim attribute to get the number of types of axes (also known as dimensions) of an array as shown:
np_array.ndim # Output: 2
Here we get an output as 2 as the array has two axes.
We use the shape attribute to get the shape of an array. We get the result in a tuple where each index tells the number of particular axes. The output is (2,3), corresponding to 2 rows and 3 columns.
np_array.shape # Output: (2,3)
We can also get the size of the array, which is the multiplication of each type of axes. Use the size attribute for this. The output is 6 as axis1*axis2 = 2*3 = 6
np_array.size # Output: 6
Here we reshape the array without changing the elements. Use the shape attribute as shown below. The input provided is the shape of the matrix we want. Please note that the shape should have the same number of elements as the original matrix; otherwise, an error will occur.
a = np.array([10,20,30,40,50,60]) print(a) a.reshape(2,3) a.reshape(2,5) Output: array([10, 20, 30, 40, 50, 60]) ###below is the output after reshaping we get array([[10, 20, 30], [40, 50, 60]]) ## Error in reshaping to 2*5 Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot reshape array of size 6 into shape (2,5)
In the above example, we know the whole shape of the matrix. In cases we do not know the entire shape, we can give input as -1 in place of the dimension of 1 axis.
a.reshape(3,-1) # Output: array([[10, 20], [30, 40], [50, 60]]) a.reshape(-1,3) # Output: array([[10, 20, 30], [40, 50, 60]])
Transpose is a shaping method where the number of rows and columns is swapped. Use the transpose attribute as shown below:
np_array.transpose() Output: array([[10, 40], [20, 50], [30, 60]])
We flatten an array while converting a multidimensional array to 1-dimensional. We can use flatten() or ravel().
array1 = np_array.flatten() array2 = np_array.ravel() print("array shape after flatten is:",array1.shape) print("array shape after ravel is:",array2.shape) print("array after flatten is:",array1) print("array after ravel is:",array1) #Output: array shape after flatten is: (6,) array shape after ravel is: (6,) array after flatten is: [10 20 30 40 50 60] array after ravel is: [10 20 30 40 50 60]
After seeing this, we might think that both the functions are the same, but there is a fundamental in the output they return. Here flatten() returns a deep copy while ravel() returns a shallow copy.
A deep copy creates an entirely new ndarray, and a reference to this new location in memory is returned. Changes made to output will not reflect in the original array. While in shallow copy, reference to original memory is returned, which means that changes made to shallow copy output will also reflect in the original array.
###below is changes made in flatten output array1 = 0 print(np_array) Output: [[10 20 30] [40 50 60]] ###below is changes made in ravel output array2 = 0 print(np_array) Output:Output: [[10 0 30] [40 50 60]]
Use np.expand_dims() method for this purpose. The input we need to provide is the array and axis along which we wish to expand the array. We try to expand the array along a row as shown below:
np.expand_dims(a,axis=1) Output: array([, , , , ])
Use the np.squeeze() method for compressing an array. Squeezing an array means reducing its dimension along an axis. The axis we choose has a corresponding value equal to 1 in the shape tuple. If, by chance, while selecting an axis, the condition of the corresponding shape value =1 is not followed, an error will occur.
a = np.array([[[1,2,3],[4,5,6]]]) a.shape # Output: (1, 2, 3) np.squeeze(a,axis=2) # Output: We get the following error as corresponding value is 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<__array_function__ internals>", line 180, in squeeze File "/home/avisouser/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 1545, in squeeze return squeeze(axis=axis) ValueError: cannot select an axis to squeeze out which has size not equal to one np.squeeze(a,axis=0) #Output: array([[1, 2, 3], [4, 5, 6]])
We have seen now to create an array and determine its shape. Now we will see how to access specific elements of an array using slicing and indexing.
Sometimes only a part of the complete array is needed. For that, we only need to pass starting index, end index, and step size → [start index: end index: step size].
Please note that here end index is not included. Step size determines the number of elements to skip between choosing two elements.
np.array([1,2,3,4,5,6]) array([1, 2, 3, 4, 5, 6]) np.array([1,2,3,4,5,6])[1:5] #Output when no step size is: array([2, 3, 4, 5]) np.array([1,2,3,4,5,6])[1:5:1] #Output when step size is 1: array([2, 3, 4, 5]) np.array([1,2,3,4,5,6])[1:5:2] #Output when step size is 2: array([2, 4])
We take note that 2D arrays mean two axes are present. So slicing here has to occur for both axes. Please note that this method will work for multidimensional arrays too. We index elements usually like in a list of lists with 0-based indexing.
###Indexing np_array[0,0] # Output: Here we get the elemnet from first row and first column 10 np_array[0,2] # Output: Here we get the elemnet from first row and third column 30 np_array[1,2] # Output: Here we get the elemnet from second row and third column 60 ###Slicing np_array[:,1:2] # Output: Here we only choose the second column array([, ]) np_array[:1,:] #Output: Here we only choose the first row array([[10, 20, 30]]) np_array[:1,1:2] # Output: Here we only choose the common values between first row and second column array([])
We first create a 3D matrix using np.array() method.
a = np.array([[[10,20],[30,40],[50,60]],# first axis array [[70,80],[90,100],[110,120]],# second axis array [[130,140],[150,160],[170,180]]])# third axis array print(a) # Output: [[[ 10 20] [ 30 40] [ 50 60]] [[ 70 80] [ 90 100] [110 120]] [[130 140] [150 160] [170 180]]]
Please note that the 3D matrix has an additional axis compared to the 2D matrix. We can also say the third axis determines the number of 2D matrices superimposed on one another, as shown in the figure below.
Above, we see a representation of a 3D matrix. As discussed in the section → Slicing and indexing of matrices or 2D arrays, we take slices of each axis to get our required elements.
a.shape #Output: (3, 3, 2) ## above we see that we get a 3d matrix with a depth of 2 and x, y axis as 3. ###Inexing of array a[0,0,1] #Output: Here we get first element for depth 1 with x and y coordinate being 0 20 ###Slicing of array a[1:,0:2,0:2] # Output: We select first two rows of second and third array array([[[ 70, 80], [ 90, 100]], [[130, 140], [150, 160]]])
Use the np.flip() method to flip the array horizontally or vertically, depending on the axis.
np_array #Output: array([[10, 20, 30], [40, 50, 60]]) np.flip(np_array,axis=0) # Output: array([[40, 50, 60], [10, 20, 30]])
We see in the next section how to stack two arrays and conditions while doing the same.
We are stacking and Concatenating to combine two existing arrays to get a new array. The difference is in the Concatenation axis should already exist along which arrays need to be combined. Also, in stacking point to note is that the axis along which arrays combine should have the same size; otherwise, an error will occur. Use the following functions:
a = np.array([1,2,3]) b = np.array([4,5,6]) a1 = np.array([[10,20],[30,40]]) b1 = np.array([[50,60],[70,80]]) np.vstack((a,b)) # Output: array([[1, 2, 3], [4, 5, 6]]) np.hstack((a,b)) #Output: array([1, 2, 3, 4, 5, 6]) np.dstack((a1,b1)) # Output: array([[[10, 50], [20, 60]], [[30, 70], [40, 80]]]) np.concatenate((a,b),axis=0) # Output: Here we concatenate along row array([1, 2, 3, 4, 5, 6])
We all might wonder if we have said that NumPy is handy for mathematical operations. Still, these operations on ndarrays can be performed by a scalar number or between two ndarray of different dimensions. These operations are impossible without changing the size of smaller ndarray, and internal working for this process is known as broadcasting. We see in the next section more detail about broadcasting.
Broadcasting is Python's internal process, which is very helpful when we want to multiply a scalar with ndarray. It is also useful when we want to operate on 2 ndarrays, and it helps to increase the size of smaller ndarray. Note that the dimension, which does not match, has to be 1 for a smaller matrix, and then broadcasting would work. Otherwise, we will get an error.
a = np.arange(10,100,20) b = np.array([,]) a+b #Output: Here we get the output when we try to add 2 different dimensional ndarrays. array([[13, 33, 53, 73, 93], [13, 33, 53, 73, 93]]) a*2 # Output: Here we multiply by a scalar number for the whole matrix array([ 20, 60, 100, 140, 180])
Here the scalar number is hypothetically stretched to match the dimensions of ndarray so that it is feasible for multiplication.
Unless two ndarrays have the same dimensions, their calculations would not have been feasible, but now it is possible due to broadcasting.
We have said multiple times till now that NumPy is useful for mathematical operations but have not seen what type of operations can be performed using Numpy. We see these in the next section.
Basic mathematical operations are performed similarly in standard Maths. These include addition, subtraction, division, etc.
a = np.arange(10,100,20) a print("sum output is:",a+2) print("subtraction output is:",a-2) print("division output is:",a/2) #Output: array([10, 30, 50, 70, 90]) sum output is: [12 32 52 72 92] subtraction output is: [ 8 28 48 68 88] division output is: [ 5. 15. 25. 35. 45.]
Mean →Mean can be found using an np.mean() method. For a vector, it means taking the sum of the vector and dividing it by the length of the vector.
Median →We find the median using an np.median() method. The Median is a value that separates the higher half from the lower half of data, a population, or a probability distribution.
Standard deviation → We find the standard deviation using the function np.std().
np.mean(a) 50.0 np.median(a) 50.0 np.std(a) 28.284271247461902
Minimum → Here, the minimum element in the array is found using the np.min() method. The index of a minimum element can be determined using the argmin() method.
Maximum → We find the principal element in the array usingthe np.max() method. The index of the top element can be determined using the argmax() method.
Array Sum → Use the sum() method to find the array sum.
np_array.sum() # Output: 210 np.min(a,axis=0) # Output: 10 np.max(a,axis=0) # Output: 90 ### In above case we determine min and max element along the column
Often in data science problems, we need to sort elements. Depending on its implementation and algorithm used, the time required for sorting can vary greatly. NumPy has implemented various algorithms like mergesort, quicksort, time sort, etc.
a = np.array([10,40,20,500]) np.sort(a, kind='mergesort') # Output: array([10, 20, 400, 5000])
We can say that NumPy is a boon for Python developers, which helps us to perform mathematical operations effectively and efficiently. As a quick summary, in this article, we discussed all the basics of the numpy library, starting with the installation and performing various operations on ndarrays. To know more about this library, you can see the official documentation. We hope you enjoyed it.
Next Blog: Introduction to Pandas
Previous Blog: Introduction to OOPS in python
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.