Introduction to Pandas For Beginners

Pandas is a famous library package of Python used by Data scientists and analysts for data understanding, data preprocessing, and much more. It comes with a wider range of support that makes our journey easy towards becoming an expert in Machine Learning and Data Science. It is easily integrable with data analyzing packages in Python like sci-kit-learn, which is a complete framework for building Machine learning applications. 

In this article, we will start with the installation and then get used to some essential Pandas functions we frequently use while building machine learning projects.

Key takeaways from this blog

  • What is Pandas and Its various everyday uses?
  • What are multiple Pandas Datastructures?
  • Creating and Analyzing Pandas DataFrame
  • Indexing and Sorting DataFrames
  • Null or empty value handling in Dataframe
  • Concatenating and Merging two different data frames
  • How to use the Lambda function in a data frame?
  • When to use pandas list, numpy ndarrays, or Python DataFrame?

Let us start by learning more about Pandas.

What is Pandas?

Pandas is a software library for data manipulation and analysis. It provides us with numerous tools to do these manipulations and analysis efficiently. Wes McKinney developed the basics of Pandas in 2008, and it was made public in 2009. 
As you all might be wondering, in the machine learning community, Pandas is highly praised, but why? Let's see some practical use-cases of Pandas.

Practical use-cases of Pandas in ML problems

We use Pandas in multiple domains, including Economics, Recommendation Systems, Stock Prediction, etc. In Economics, we need to visualize multiple forms of data where pandas play a role in managing massive datasets. They have structures like DataFrame for this purpose which we will discuss shortly. The recommendation system needs a model to predict output; pandas helps models manage data. 

Now we know this library is essential, so let's begin our journey toward understanding the basics. But first, let's install this library in our python-enabled systems.

Installation and Import of Pandas

One can find the detailed instruction to install Pandas on all operating systems in our make your system machine learning enabled blog. To install Pandas via Python PyPI (pip), we can use the commands below,

Python2 on terminal →pip install pandas 

Python3 on terminal →pip3 install pandas

Jupyter notebook python2 → !pip install pandas

Once installed, we can import this library and use it in our codes. For example:

import pandas as pd

We have imported Pandas and shortened its name to "pd". So in future sections, while using this library, we will use 'pd' and not the complete name Pandas. As we discussed, Pandas is mainly a data handling library, and whenever data is stored or accessed efficiently, we need Data Structures. So let's start discussing the various data structures in Pandas. 

What are multiple Pandas Datastructures?

There are mainly two data structures present in Pandas, 

  • Pandas Series
  • Pandas DataFrame

Pandas Series

Pandas Series can be said to be a column of a table. It is a one-dimensional array holding data of any data type in Python.

pd.Series([1,2,3])

###output
'''
0    1
1    2
2    3
dtype: int64
'''

The output is contacting elements and their corresponding position as indexes. We can define indexes at our convenience. The default index ranges from 0 to n-1, where n is the length of the series. We access elements the same way as in a python list, where the index can be a default or a custom one, as shown.

l = pd.Series([1,2,3],index = ["A","B","C"])
print(l)
'''
## Output
A    1
B    2
C    3
dtype: int64

## To access any element at index A,
l["A"]
###output
1
'''

Pandas DataFrame

Pandas DataFrame is a 2-dimensional labeled data structure that consists of columns and rows. Its different columns can have different data types.

DataFrame Snippet

Creating Pandas Dataframe

We can create a pandas DataFrame by reading it from a CSV, Excel, JSON, python dictionary, etc. Let's see some of the most frequent ones,

Using Dictionary

Please note that in Dictionary, each (key, value) is treated as a separate column where the key is the column heading and values of different records for that column which will be stored in subsequent rows.

df = pd.DataFrame({'Name of Track':["Roar","Dark Horse","Blank Space"],"Duration of track ":[269,225,272],"Singer":["Katy Perry","Katy Perry","Taylor Swift"]})
df
'''
###output
  Name of Track  Duration of track         Singer
0          Roar         269              Katy Perry
1    Dark Horse         225              Katy Perry
2   Blank Space         272              Taylor Swift
'''

Reading from a CSV

This is the most frequent form of making pandas dataframe as most of the real-life datasets are available online in this form. One can use any other csv of their choice.

df = pd.read_csv('songs_stats.csv')
df
'''
###output
  Name of track  Duration of track    Singer name
0          Roar        269             Katy Perry
1    Dark Horse        225             Katy Perry
2   Blank Space        272             Taylor Swift

'''

Please note that we have not specified a delimiter, i.e., a symbol that separates different column values in a CSV file. By default, it is a comma.

Reading from JSON

df10 = pd.read_json(json_object)
df10
'''
###output
  Name of Track  Duration of track         Singer
0          Roar         269              Katy Perry
1    Dark Horse         225              Katy Perry
2   Blank Space         272              Taylor Swift
'''

Please note that we have this JSON object when we read a JSON file in Python using the Python' json' library.

Analyzing Pandas DataFrame

  • df.head(n) → Using this, we can see a sample of our DataFrame from the staring index. By default, it shows starting five lines of DataFrame unless we specify a value of n

    df.head(2)
    '''
    #Here we specified value of n as 2 so we see first two rows
    ###output
    Name of track  Duration of track  Singer name
    0          Roar                 269  Katy Perry
    1    Dark Horse                 225  Katy Perry
    '''
  • df.tail(n) → Using this, we can see the last 'n' sample of our DataFrame. By default, it shows the last five lines of DataFrame unless we specify a value of n

    df.tail(1)
    '''
    #Here we specified value of n as 1 so we see last row
    ###output
      Name of track  Duration of track    Singer name
    2    Blank Space           272           Taylor Swift
    '''
  • df.info() → It summarises our DataFrame information. It shows column names, memory utilization of our DataFrame, non-null values, data types, etc. Also, as we can see non-null values, it helps us see which columns have null values and helps us better understand the data.

    df.info()
    '''
    ###ouput
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 3 columns):
    #   Column              Non-Null Count  Dtype 
    ---  ------              --------------  ----- 
    0   Name of track       3 non-null      object
    1   Duration of track   3 non-null      int64 
    2   Singer name         3 non-null      object
    dtypes: int64(1), object(2)
    memory usage: 200.0+ bytes
    '''
  • df.shape() → As the name suggests, it tells us the shape of DataFrame, i.e., the number of rows and columns in the form of a tuple.

    df.shape()
    '''
    ###output
    (3, 3)
    '''
  • df.to_numpy() → This gives us a Numpy representation of our dataset

    df.to_numpy()
    '''
    ###output
    array([['Roar', 269, 'Katy Perry'],
         ['Dark Horse', 225, 'Katy Perry'],
         ['Blank Space', 272, 'Taylor Swift']], dtype=object)
    '''
  • df.describe() → It tells us mathematical statistics for all numerical columns like average, mean, etc.

    df.describe()
    '''
    ###output
         Duration of track 
    count            3.000000
    mean           255.333333
    std             26.312228
    min            225.000000
    25%            247.000000
    50%            269.000000
    75%            270.500000
    max            272.000000
    '''
  • df.columns → It gives us all names of columns present in our DataFrame

    df.columns
    '''
    ###output
    Index(['Name of track', 'Duration of track ', 'Singer name'], dtype='object')
    '''
  • df.setindex(columnname) → Here, we can use a particular column of DataFrame as index instead of default 0 to n-1. 

    df_custom_index = df.set_index("Name of Track")
    df_custom_index
    '''
    ###output
                       Duration of track         Singer
                                    
    Roar                   269                   Katy Perry
    Dark Horse             225                   Katy Perry
    Blank Space            272                   Taylor Swift
    '''
  • df[col_name].unique() → It returns all unique values present in that particular column.
    df[colname].valuecounts() → It is advanced version of df[col_name]unique() as it returns unique values along with there counts

    df['Singer name'].unique()
    
    ###output
    ###here we see that we get 1 Katy Perry instead of two
    
    array(['Katy Perry', 'Taylor Swift'], dtype=object)
    df['Singer name'].value_counts()
    
    ###output
    Katy Perry      2
    Taylor Swift    1
    Name: Singer name, dtype: int64

We have seen the essential functions of Pandas. But it is also important that after making changes to any DataFrame, we can store it in our server for later usage. Let's see how we can do that.

Converting dataframe back to CSV, JSON 

Here we use the filename and a required extension to save our DataFrame in the required formats. For example:

df.to_csv('songs.csv')
df.to_json('songs.json')

Indexing DataFrames

Rows Slicing

  • df[rowindex1:rowindex2] → Here, we have to index the numbers of rows required where we have to give a starting and ending index. We have to use the symbol ':' and other values are numbers(0 to n-1 in case of default indexing) filled just like the list. Please note that directly accessing any rowindex will give an error, i.e., accessing the first row as df[0] will result in an error.
    The above process is called slicing. Note that row
    index1 is inclusive while row_index2 is exclusive in the above slicing. Do note that in the output type, we always get a new dataframe.

    1)###code when both indexes are mentioned of row 
    df[0:1]
    
    ###output 
    Name of track  Duration of track  Singer name
    0          Roar                 269  Katy Perry
    
    2)###code when one index is mentioned for row 
    df[1:]
    ###output 
    Name of track  Duration of track    Singer name
    1    Dark Horse                 225    Katy Perry
    2   Blank Space                 272  Taylor Swift
    
    
    3)###code for accesing direct index element
    df[0]
    ### we get the following error
    The above exception was the direct cause of the following exception:
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
      indexer = self.columns.get_loc(key)
    File "/home/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
      raise KeyError(key) from err
    KeyError: 0
  • Using loc → Here, we use the name for an index instead of 0 to n-1. Do note that dfcustomindex, which we defined in df.set_index

    loc_result = df_custom_index.loc["Dark Horse"]
    loc_result
    '''
    ###output
    Duration of track            225
    Singer                Katy Perry
    Name: Dark Horse, dtype: object
    '''
  • Using iloc → Here, we use a numerical index ranging from 0 to n-1. 

    iloc_result = df_custom_index.iloc[1]
    >>> iloc_result
    '''
    ###output
    Duration of track            225
    Singer                Katy Perry
    Name: Dark Horse, dtype: object
    '''

    Note that iloc and loc functions can be used like a list for slicing. An example of both is shown below.

    # 1)Using loc
    df_custom_index.loc["Roar":"Blank Space"]
    '''
    ###output
                 Duration of track         Singer
    Name of Track                                  
    Roar                          269    Katy Perry
    Dark Horse                    225    Katy Perry
    Blank Space                   272  Taylor Swift
    2)Using iloc
    df_custom_index.iloc[0:2]
    ###output
                 Duration of track       Singer
    Name of Track                                
    Roar                          269  Katy Perry
    Dark Horse                    225  Katy Perry
    '''

Please note that in the case of loc, both indexes are inclusive, while in the case of iloc, only starting index is inclusive.

Column Retrieved

We can use any of the following to access any column from a dataframe

  1. df[columnname], df.columnname: This will produce an output of Output type →< class' pandas.core.series.Series'>
  2. df[[columnname,columnname2]]: This will produce an output of data type →DataFrame

As mentioned in point2, we have to give names of columns as a list

1)###code for Point1
df["Singer name"]

###output
0      Katy Perry
1      Katy Perry
2    Taylor Swift
Name: Singer name, dtype: object

2)###code for Point2
df[["Singer name","Name of track"]]
###output
    Singer name Name of track
0    Katy Perry          Roar
1    Katy Perry    Dark Horse
2  Taylor Swift   Blank Space

Retrieving Required Columns and Rows Together

We can combine the techniques learned earlier.

  • df[rowindex1:rowindex2][column_name]
  • df[rowindex1:rowindex2][[columnname,columnname2]]

Please note that, as mentioned with columns, we get a series if we mention only a single column name. If we mention a list of columns, we get a DataFrame. Here it follows the same rules for rowsindex as in Retrieving Rows, and for columnindex, it follows the same rules mentioned in Retrieving Columns.

#1)###Example 1
df[1:3]["Singer name"]
'''
###output
1      Katy Perry
2    Taylor Swift
Name: Singer name, dtype: object
'''
#2)###Example 2
df[1:3][["Singer name"]]
'''
###output
    Singer name
1    Katy Perry
2  Taylor Swift
'''

Sorting DataFrames

  • df.sort_index(ascending=False) → It will help us sort DataFrame in descending order of row index values. If we specify ascending=True, then it will be sorted in ascending order.

    df.sort_index(ascending=False) 
    '''
    ###output
    Name of track  Duration of track    Singer name
    2   Blank Space        272           Taylor Swift
    1    Dark Horse        225             Katy Perry
    0          Roar        269             Katy Perry
    '''
  • df.sortvalues(by=columnname) →It will help us sort DataFrame using the values in the column name we specified. The output is shown in the image below after the code.

    df.sort_values(by='Duration of track ')

Null and duplicate handling in Dataframe

In real-world datasets, we may expect some missing or inconsistent values present in rows or columns of our dataframe. Let's create dummy data with some None values.

df3 = df.replace({"Taylor Swift":None})
'''
### here we create a dataset with None value
###output
  Name of track  Duration of track  Singer name
0          Roar       269            Katy Perry
1    Dark Horse       225            Katy Perry
2   Blank Space       272               None
'''

Now let's see some important functions that can inform us about these missing values and provide some remedies.

  • df.isnull.sum() → It tells us all column's count of null values present in the dataset

    df3.isnull().sum()
    '''
    ###output
    Name of track         0
    Duration of track     0
    Singer name           1
    dtype: int64
    '''
  • df.dropna(subset=["col_name"]) → Here, this command drops all rows containing null values if the subset is not specified. If a subset value is specified, then it drops rows only for those column names we specified, and the None value is present in that column name.

    df3.dropna()
    '''
    ###output
    Name of track  Duration of track  Singer name
    0          Roar       269            Katy Perry
    1    Dark Horse       225            Katy Perry
    '''
  • df.drop_dupliactes() → Here, this function is used to drop duplicate rows from DataFrame. It is beneficial while analyzing data or building a model as it prevents us from giving more importance to data points due to their duplication.

Please note that the above methods do not change the existing DataFrame. They make changes to a DataFrame and return a new DataFrame. To prevent this and make sure that changes happen to the same DataFrame, we have the value "inplace=True", which we can use with many Pandas methods.

Concatenating two different data frames

Here we want to concatenate two different DataFrames. These are useful when the data is in additional files, but we want it in a single data structure.

###here we try to make single dataframe into two using slicing and then concatenating them
df4 = df[0:1]
df5 = df[1:3].reset_index()
df5
'''
  Name of track  Duration of track    Singer name
0    Dark Horse                 225    Katy Perry
1   Blank Space                 272  Taylor Swift
>>> df4
  Name of track  Duration of track  Singer name
0          Roar                 269  Katy Perry
'''

pd.concat([dataframe1,dataframe2],ignore_index=True) → It helps us to contact two DataFrames having same columns. 

pd.concat([df4,df5],ignore_index=True)
'''
###output
  Name of track  Duration of track    Singer name
0          Roar                 269    Katy Perry
1    Dark Horse                 225    Katy Perry
2   Blank Space                 272  Taylor Swift
'''

Also, 'ignore_index=True' is used, so original index values are not in the final DataFrame. Otherwise, in our new DataFrame, we will have two or more rows having the same index number.

pd.concat([df4,df5])
'''
###output
  Name of track  Duration of track    Singer name  index
0          Roar                 269    Katy Perry    NaN
0    Dark Horse                 225    Katy Perry    1.0
1   Blank Space                 272  Taylor Swift    2.0

'''

Merging DataFrames

pd.merge() function helps join two DataFrames based on a standard column.

###please note that df4 and df5 are defined in previous section
pd.merge(df4,df5,on="Singer name")
'''
###output , here common column is "singer name"
     Name of track_x  Duration of track _x Singer name   Name of track_y  Duration of track _y
0           Roar            269             Katy Perry      Dark Horse            225
'''

How to use the Lambda function in a DataFrame?

Here we use the lambda function in the DataFrame using apply function as shown in the example. We use the lambda function when we want to operate on each column record.

df["Singer name"] = df["Singer name"].apply(lambda x : x[:4])
df
'''
##output
  Name of track  Duration of track  Singer name
0          Roar                 269        Katy
1    Dark Horse                 225        Katy
2   Blank Space                 272        Tayl
'''

Here above, we used the lambda function using apply function. We want to shorten singer names and consider the first four characters of the singer's name. Also, do note we reassign the output to the singer column of the original DataFrame; otherwise, it will just give us output in series format, and nothing will change in the original DataFrame.

That's all for the basics about Pandas, but before closing this session, we want to answer the most important question, i.e.,

When to use list, numpy ndarrays, or python DataFrames?

We need to know that different data structures have different speeds. The order of time taken for operating is → list < numpy ndarray < DataFrame. Now the question comes to our mind if the list is fast, why do we even need DataFrame or ndarray? It is not convenient to do mathematical processing in lists, so we use ndarray. If we want to merge data from multiple datasets or read data from excel, HDFS we use DataFrame. So each data structure has its benefits and uses, but if we want a one-stop destination library of Python for analyzing data, it will be Pandas.

Conclusion

We can say that Pandas is a boon for Python developers, which helps us visualize, explore, and clean data better. In today's world, Pandas has become necessary for all developers, whether for building a model or data engineering, which involves massive amounts of data, etc. In this blog, we have covered Pandas' basics, which include handling null values, performing basic operations, and concatenating two DataFrames. We hope you enjoyed it.

Enjoy Learning!

Share on social media:

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.