Introduction to Databases for Data Science

With the increase in data generation, data has become an essential source of information for organizations. Companies are collecting as much data as possible in order to not miss any valuable information, which has pushed the limits of storage devices. This requires a more reliable and efficient method of storing the data.

This is where databases come into play, providing a way to store, manage, and scale large amounts of data. This data is then used for analysis and to gain meaningful insights. In this blog, we will dive into the concept of databases, and explore how data science relies on them and their advantages.

Key takeaway from this blog

What is a database?
Why do we need a database in data science?
A case study on the evolution of Supermarkets.
Benefits of using a database.

What is a Database?

A database is an organized way to store and manage data in various forms for easier access. You can think of a database as a large computer that stores data according to a pre-defined format or data model. The data model outlines the structure of the database and the rules for organizing and maintaining it.

In order to perform actions on the data stored in a database, we need a Database Management System (DBMS). DBMS is the interface between the database and the end user, providing a platform to create, modify, and retrieve data. There are many different database management systems available, depending on the type of database being used.

For example, relational databases, which store data in tables, can be managed by database management systems such as MySQL, Oracle, and IBM Db2. These are the most commonly used databases in Data Science, as much of the data we work with is presented in a tabular format. Non-relational databases, which store data in forms such as key-value pairs, column families, or graphs, can be managed by database management systems like MongoDB and Cassandra.

What is difference between relational and non-relational databases?

We will see the types of databases and how to select the best suited for our data science projects in the next blog. But first, let’s go through the benefits and needs of databases in data science.

Why do we need a database in data science?

A research conducted in 2020 found that an incredible three quintillion bytes of data are generated every day, leading to the creation of 65 zettabytes of data globally. This figure is expected to reach 181 zettabytes by 2025.

Until the 1960s, we relied on spreadsheets such as Excel sheets to store and manage our data. However, without a structured method of storing data, it would be of little use. This is why databases were introduced to manage and store large amounts of data.

There are two key reasons why databases have become so popular in recent years:

The rapid increase in data generation
The dependence of data science on data

To better understand the importance of databases in our daily lives, let's take an example of supermarkets evolution.

Case study for the use case of Database

Earlier, shopping was not a very enjoyable experience as shops often lacked adequate space and customers had to wait or rely on the shopkeeper to find the items they were looking for. The introduction of supermarkets, however, changed that, as they made shopping much more pleasant by displaying all the products in a large space and making them easily accessible to customers. As the number of products and customers in supermarkets increased, the need for a database system to keep track of all the purchases became critical.

After the rise of malls across the country, the 2008 recession hit, putting many of them on the brink of closing down. However, those that managed to survive realized the importance of incorporating new technologies like AI, Machine Learning, and Data Science in order to implement more structured and organized management strategies.

Let's take some example:

Data Science plays a crucial role in determining the products that should be placed in which section of the supermarket, the shelf life of each product, and the required quantity of products during different seasons.
By analyzing sales data from different supermarket branches, supermarket owner can identify which products need to be stocked in larger quantities and during which months the sales are highest.
An analysis of the sales data based on the types of products in the store revealed that fashion accessories, food and beverages, and home and lifestyle products were the most popular. Having a well-stocked inventory of these items is crucial for maintaining customer traffic.
To determine the months with the heaviest customer traffic, a graph was plotted between the month and gross income. The analysis showed that the sales are highest in January and February, due to New Year offers, and follow a rising and falling trend every 2-3 months. This is because most customers tend to make bulk purchases every 3 months. Based on these insights, the supermarket owner could be advised to import goods during these high-traffic months.

Reference: https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

Data science is extensively used to make these kinds of analyses and help businesses to grow. The success of these analyses depends on the data. And most importantly, this data should be easy to access, fast to retrieve, and structured. Database fulfils all these requirements.

Benefits of using a database in data science

Apart from providing storage space for data, databases have far more benefits.

Reliability
Multiple user access
Remove data redundancy
Data Scalability

Databases are Reliable

In today's world, where consumer policies dictate that financial fraud and identity theft can result from the misuse or inadequate protection of personal information, the use of databases is crucial for safeguarding user data. In other words, databases are a trustworthy source for storing data because they are designed to follow a predefined data model or schema that adheres to laws for collecting and protecting data.

This ensures the safety of personal information and reduces the chances of collecting irrelevant or unnecessary data. Additionally, databases provide the ability to restore data if it is accidentally deleted or erased, ensuring the reliability of the data stored.

Let’s take an example of an Employee-project database to get an idea of data schemas.

Employee project mapping example to show the use of database

Schema is the logical representation of data storage in the database. It defines the structure of the database, including tables, relationships between tables, data type descriptions for columns, primary keys, and foreign keys. The data type defines what type of data can be stored in a particular column, such as an integer, string, or date/time. The primary key is a unique identifier for each record in a table, and the foreign key is a column that connects two tables by referencing the primary key in another table.

For example, in the schema illustrated above, there are three tables: Employee, Department, and Project. Each table is identified by its own primary key and the data type of each column is specified.

Databases can handle multiple users

Now, after creating the schema for the data, the next challenge is to handle multiple access to the database. A stock exchange company is a good example. The company keeps track of the daily changes in the price of stocks. They decided to create a model to predict the next day's low and high stock prices. Different teams were assigned to create these models, and the one with the highest accuracy would be deployed.

Now dynamic nature of stock prices, which change daily, presented a challenge. To make the models, each team needed access to the database to get the latest updates. Instead of assigning one employee to process these requests, each team was given access to the database using their unique access keys.

This is an example of how databases can support multiple users.

Databases Reduce Data Redundancy

Data redundancy refers to the repetition of data, which is not efficient for businesses as it wastes storage space and weakens machine learning models. Companies often collect data through sensors or scraping tools, but these tools can sometimes cause data to be redundant or irrelevant.

For example, consider a car equipped with a camera and sensors for the accelerator and brakes. If the accelerator sensor stops functioning and no longer records data when the accelerator is pressed, but the camera is still recording, the database can detect this issue and not permit the entry of a null value. It will also generate an alert.

How do databases handle the data redundancy problem?

In some cases, companies introduce data redundancy intentionally, mostly when the shared data size is small. Hadoop is famously used to manage copies of data so that various teams can work on it separately.

Data Scalability

So far, databases have helped to solve all the major problems of data redundancy, multiple users, and data reliability. However, as the organization grows, data also grows, and soon we reach our storage limit. To overcome this challenge, databases have the property of scalability, which means they can increase or decrease their capacity to handle changes in user traffic.

For example, popular social media platforms like Instagram, Facebook, and Twitter are experiencing an increasing number of users, resulting in the following issues:

An increase in user requests to the CPU can lead to processing delays or even result in an inability to process them in a timely manner.
The database server can become overloaded and run out of storage space

The problem of increased requests can be solved through the use of caching and optimizing database queries. Caching involves storing frequently accessed data in lower-level storage called a cache. This allows for faster retrieval of data in the future, without having to access the central database.

To solve the issue of database overload, two techniques can be used: vertical scaling and horizontal scaling. Vertical scaling involves increasing the computational power of the server by adding more storage space and CPUs while maintaining the same logical schema and database infrastructure. On the other hand, horizontal scaling expands storage capacity by adding parallel nodes to the central database server, thus reducing the load on the main server.

Conclusion

So far in our discussion, we have analyzed the importance of databases in data science by taking the example of supermarkets. We have highlighted the benefits of using a database system and the challenges that come along with it.

Enjoy learning, Enjoy data science!