The gaming industry is growing tremendously, and because of this, many tech-giant companies are heavily investing in this domain. Google is developing the game development kit and the cloud platforms to build/play games online. Similarly, Facebook and Microsoft are investing heavily in VR games. Hence we can anticipate how big the gaming industry is. With the recent user engagement, it is observed that the number of people who love competitive games is highly significant.
But this engagement highly depends upon the fundamental question: "How fair it is for all the players?" If a competitive game is biased, genuine players will get disappointed and avoid playing. Cheaters try to modify game controls to win competitive matches, which is unfair to many. Hence companies take many measures to make the game as fair as possible. One such step is to detect the cheaters or hackers and eliminate them from the competition.
Likewise all other industries, the gaming industry collects various players' data to analyze their performance. Let's try to understand this by an example of PUBG, Player Unknown's Battlegrounds, one of the most famous and most played games on the internet. On average, PUBG has 30 million daily active users. It collects data from players and categorizes them into different segments. But it is impossible to categorize 30 million players manually. That's where machine learning comes into the picture. ML allows us to analyze tons of data, draw meaningful results and categorize cheaters into separate categories.
Let's start without any further delay.
In October 2021, PUBG permanently banned 2.5 million accounts from Battlegrounds Mobile India (BGMI) and temporarily banned a further 706,319 accounts. According to Tencent's ( gaming engine of PUBG) report, 46 percent of banned accounts were caught using auto-aim and x-ray vision hacks, 18 percent were using speed hacks, 20 percent were modifying their area damage and the character model, and 16 percent were banned for some other reason.
Machine learning is the answer. The security team of PUBG constantly monitors tons of data, with over 10 million reports on average daily. They identify and remove hackers by scanning for suspect software and modifying game data. They try to find impossible events happening during the match, for example, someone taking a shot from a considerable distance and still able to connect and deal damage to the enemy, players killing enemies without moving a single step, and many more.
PUBG also provides an option of reporting players in a match. You have to click on the report button, and a new window pops up where you can select the type of misbehavior or manually write a message. The team will review your report and provide a notification if the player is banned or not. This is not the end PUBG uses many more techniques to provide fair gameplay to its users. After this section, we will start building our model by defining our problem statement and classifying our machine learning model on the five bases we have already discussed in this blog.
Our problem statement is that we want to detect whether the player is genuine or false in PUBG. We have the labeled data, and hence we will be using a supervised learning approach. This is a binary classification problem as we have two classes, cheater and non-cheater. We will solve this problem using a classical machine learning algorithm which we will discuss later in this blog. It will be a non-parametric model with non-probabilistic outputs because we are not dealing with probability distributions. The last two classifications depend on the algorithm you choose to solve this problem. Now let's proceed ahead to implement the PUBG cheater detection model.
There are two famous PUBG datasets available on the Kaggle:
We will use the pubg placement prediction dataset to build our model. We need to download the trainV2.csv file in zip format, unzip it and read it using pandas.readcsv().
This dataset is in the shape of (4446966, 29), which means that the 29 different attributes are collected from the 4446966 instances of various players in different matches. Let's first understand these 29 features in detail.
Once we get insight into all these features, it is essential to understand how they help us identify the cheaters.
Using the features described above, we need to list some impossible events and spot the players of these events. After categorizing these impossible events, we will make a separate cheaters_data to contain information about the potential cheaters from our actual data and remove them from the existing dataset.
If you have played any Battle Royal game, you can easily relate that killing an enemy without taking a single step is nearly impossible. Players having 'kills '> 0 with 'toatalDistance' = 0 can be considered as an impossible event, and players involved with this event can be potential cheaters.
pubg_data['totalDistance'] = pubg_data['rideDistance'] + pubg_data['walkDistance'] + pubg_data['swimDistance'] pubg_data['potential cheaters']=((pubg_data['kills'] > 0) & (pubg_data['totalDistance'] == 0)) cheaters_data=pubg_data[pubg_data['potential cheaters']==True] pubg_data.drop(pubg_data[pubg_data['potential cheaters']==True].index,inplace=True)
Most of the players fall in the 0–15 kills range. It is rarely possible that someone might break the world record of 59 kills in a match. And if someone does so, it is better to put them in the cheaters' category.
plt.figure(figsize=(12,4)) sns.countplot(data=pubg_data, x=pubg_data['kills']).set_title('Kills') plt.show()
We might be thinking: What if someone breaks the world record? Will PUBG consider him a cheater? Let's try to relate this with real-life; we don't judge people on just one action, right? We consider his past actions and behavior. Similarly, PUBG also considers players' past actions, report history, in-game behavior, and many more before banning their account.
pubg_data['potential cheaters']=((pubg_data['kills'] > 59)) cheaters_data=pd.concat([cheaters_data,pubg_data[pubg_data['potential cheaters']==True]]) pubg_data.drop(pubg_data[pubg_data['potential cheaters']==True].index,inplace=True)
Killing an enemy from a distance of more than 1000 m sounds insane until or unless you are an ultra pro max player or you get on some vehicle and run away. Both chances are less, so we consider these players as potential cheaters.
plt.figure(figsize=(12,4)) sns.distplot(pubg_data['longestKill'],kde=True,color='orange') plt.show()
pubg_data['potential cheaters']=((pubg_data['longestKill'] >= 1000)) cheaters_data=pd.concat([cheaters_data,pubg_data[pubg_data['potential cheaters']==True]]) pubg_data.drop(pubg_data[pubg_data['potential cheaters']==True].index,inplace=True)
In a match, the player acquires 0–10 weapons on average. Players acquiring more than 50 weapons are for sure cheaters or hackers. It's better to drop them off. And also, if a player is killing enemies without using any weapons, there is something fishy about him. Please put them in the potential cheater's category.
plt.figure(figsize=(12,4)) sns.distplot(pubg_data['weaponsAcquired'], bins=10) plt.show()
pubg_data['potential cheaters']=((pubg_data['weaponsAcquired'] >= 50)) cheaters_data=pd.concat([cheaters_data,pubg_data[pubg_data['potential cheaters']==True]]) pubg_data.drop(pubg_data[pubg_data['potential cheaters']==True].index,inplace=True) pubg_data['potential cheaters']=((pubg_data['weaponsAcquired'] == 0) & (pubg_data['kills']>10)) cheaters_data=pd.concat([cheaters_data,pubg_data[pubg_data['potential cheaters']==True]]) pubg_data.drop(pubg_data[pubg_data['potential cheaters']==True].index,inplace=True)
Most players are using less than ten healing items in a match, but you can see some players are using more than 30 heals. Isn't it fishy? We shouldn't take risks and mark them as potential cheaters.
plt.figure(figsize=(12,4)) sns.distplot(pubg_data['heals'], bins=10) plt.show()
pubg_data['potential cheaters']=((pubg_data['heals'] >=30)) cheaters_data=pd.concat([cheaters_data,pubg_data[pubg_data['potential cheaters']==True]]) pubg_data.drop(pubg_data[pubg_data['potential cheaters']==True].index,inplace=True)
We have identified the impossible events and made separate data out of them. But we still have 29 features, which is vast. Models built using all these features will be heavy and impractical. So let's drop some lesser important features.
In this section, we will apply different feature selection techniques and try to select the best set of features for our model. One such technique is to verify whether two attributes are correlated. We ensure that our features are not highly correlated, and to check that, we can visualize the cross-correlation matrix.
A correlation matrix is a table depicting correlation coefficients of all possible pairs of attributes present in the dataset. It is a very intuitive method to visualize the dependencies of features on each other.
plt.figure(figsize=[25,12]) sns.heatmap(pubg_data.corr(),annot = True,cmap = "BuPu")
We can see that "winPoints & killPoints" and "kills & damageDealt" are strong-positively correlated. If killpoints increase two-fold, winpoint will also increase two-fold; hence, we can drop one. This method of feature selection is called filtering. You can read more on this blog.
We can use totalDistance to store "swimDistance+rideDistance+walkDistance" for spotting cheaters, and we can drop the three distances from the final feature set. We are eliminating categorical features also as they are of no use in predicting the output.
Our feature engineering is done, and it's time to decide which machine learning algorithm we should use to build our model. As we already have stated, it is a classification problem statement; some famous classification ML algorithms used are logistic regression, SVM, KNN, and Random Forest.
In this article, we will be using the Random Forest algorithm to train our model. This algorithm needs minimal data cleaning efforts and gives heroic results by measuring the relative importance of each feature on prediction.
Random Forest is a supervised learning algorithm that uses an ensemble learning approach for regression and classification. It builds multiple decision trees and merges their predictions for more accurate predictions. This process takes place in three steps :
If we have 'n' data samples and 'm' decision trees, this method will randomly assign 'n' samples to 'm' trees with replacement after each assignment in an iterative manner.
Each decision tree is trained independently and in parallel.
In aggregation, we use the concept of majority voting in classification. An average of all the outputs predicted by individual decision trees is taken in regression for a more accurate and stable prediction.
Before building the model, we need to split the dataset into training and testing samples. We will use the test set to evaluate the model's performance later.
target = pubg_data['potential cheaters'] features = pubg_data.drop('potential cheaters',axis=1) x_train,x_test,y_train,y_test = train_test_split(features,target,train_size=0.3,random_state=0)
Let's use Scikit-learn random forest for building the model on our training data.
model = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features='sqrt') model.fit(x_train,y_train) y_pred = model.predict(x_test) y_predtrain = model.predict(x_train) print("test data accuracy: ", accuracy_score(y_test, y_pred)) print("test data precision score: ", precision_score(y_test, y_pred)) print("test data recall score: ", recall_score(y_test, y_pred)) print("test data f1 score: ", f1_score(y_test, y_pred)) print("test data area under curve (auc): ", roc_auc_score(y_test, y_pred))
Now our algorithm is ready, so we must check the performance of our model. There are various evaluation metrics for the classification models. If you are not familiar with them, please have a look here. We will be evaluating our model on each of them because some researchers feel that considering only one metric for evaluation is not enough.
The best way to represent the accuracy of a classification model is the confusion matrix. It is the most used evaluation metric for classification problems.
test data accuracy: 0.9999942175660065 test data precision score: 0.9994222992489891 test data recall score: 0.9902690326273612 test data f1 score: 0.9948246118458884 test data area under curve (auc): 0.995134355600319
Awesome! We successfully achieved 99.99% accuracy over our test dataset. That's how PUBG'S security team uses Machine Learning to identify and ban cheaters for improving the quality and making the gameplay fairer for all.
In online games use of NPCs is widespread. These are non-player characters or computer-controlled players that offer users more exciting gameplay. PUBG categorizes them as bots, and we can use the same approach to build a machine learning model to detect them.
Bots are nothing but machine algorithms interacting with the virtual environment. We can increase their advancement by providing them the ability to learn from their mistakes or learn from historical data. Via this method, they will become more advanced, and dependency on the available online players will reduce.
If you are mentioning this project in your resumes, these are some possible interview questions that can be asked in Machine Learning interviews:
Machine learning has a great potential to change the gaming industry. It can enhance the user experience by making games less complex, introducing NPCs, and making games more realistic. Microsoft is working with NVidia to make games more realistic by quickly rendering objects without pixel losses. Cheating detection is a widespread problem that every game faces in some manner, and we have discussed the solution in this blog.
Enjoy learning, Enjoy algorithms!
Naive Bayes is a popular supervised machine learning algorithm that predicts the categorical target variables. This algorithm makes some silly assumptions while making any predictions. But the most exciting thing is: It still performs better or equivalent to the best algorithms. So let's learn about this algorithm in greater detail.
This article will guide you through the step of detecting fraudulent transactions performed on credit cards by developing a machine learning model. Several classification algorithms can perform best and are easily deployable, like support vector machines, logistic regression, etc. We will be using a Random Forest Classifier to build our fraud detector.
This is a glossary of Machine Learning terms commonly used in the industry. We will add more terms related to machine learning, data science, and artificial intelligence in the coming future. Meanwhile, if you want to suggest adding more terms, please let us know.
K-Nearest Neighbor is a supervised learning algorithm that can be used to solve classification as well as regression problems. This algorithm learns without explicitly mapping input variables to the target variables. It is probably the first "machine learning" algorithm, and due to its simplicity, it is still accepted in solving many industrial problems.
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.