Transfer Price Prediction Model

(Estimates transfer value of a player and Potential Replacements)

CS460 Project : Group 07
Swadeepta Mandal and Manoj Sampath
GitHub Link


Contents:

1. Problem Statement
2. Motivation
3. Introduction
4. Proposed Timeline
5 Datasets
6 Project Proposal Presentation
7. Project Midway
8. Data Acquisition
9. Literature Review
10. Project Midway Presentation
11. Test and Validation of the Model
12. Final Project Presentation
13. References

Project Proposal:


"Transfer Price Prediction and Potential Replacements of a football player in the market" is the problem we will be working on, where we make use of efficient machine learning algorithms to train our model so as to tackle prevailing issues in baseline models.We would be aiming at training with more globally accurate datasets instead of using the routine ratings available.We will also work on improving the algorithms if possible,considering other influencing parameters and on reducing the error function inorder to end up with a precise value on a player in transfer market.With this data in hand,we also aim to design our model that can suggest suitable replacements of the particular player.


In the world of sports,Football is not just a sport between 2 teams but a war of emotions between people and also the growth of wealth in the market.A tiny decision would have a great impact in the teamplay which indirectly affects the game to an unexpected level.We being interested in such a domain,motivated ourselves to make use of this project opportunity to work on and train a model that can precisely assess the prices of incoming players in the transfer market using which it can suggest the potential replacements improving team integrity and gameplay.The model would prove useful in franchising and decision making for strategically obtaining desirable outcomes by optimizing their budget and the players in the transfer market.


The term "transfer market" refers to the arena in which football players are allowed for transfer to clubs.Predicting this transfer market price of a football player would mean predicting the amount of money a club can spend on the player in the transfer market.Bound players are players in bonds and contracts who wouldnt account in the transfer market.Market value differs from transfer value as it focuses on the exact market worth of a given player while transfer value focuses on the players who are available for exchange or replacements in the transfer market.Our model aims to serve this purpose in the transfer market by considering all the heavy factors influencing this transfer value to end up with reliable price predictions and results which can be trusted for replacements.


Data acquisition: (6-15 September)

Literature Review: (16 September-3 October)

Test and Validation of Architecture: (4-17 October)

Fine Tuning: (18-24 October)

Final Report: (25-31 October)


We plan to use BeautifulSoup web scraping to extract datasets from the following websites:

1)TransferMarkt

2)FBref.com

3)footballdatabase.com

4)eu-football.info

We also focus on filtering noise datasets if possible so that we can use reliable and accurate information to train our model.


Project Midway:


As per our proposed timeline,we have worked on:


In this phase,we gave efforts to acquire and extract required data from different sites on various grounds as stated below

In this process,the difficulties we faced include gathering of data and dealing with missing data,but we have adopted certain strategies to deal with these difficulties as stated below:


  1. The Wisdom of Crowds and Transfer Market Values(2021) -Dennis Coates, Petr Parshakov Transfermarkt, one of the most popular site uses crowd sourcing to decide Market Value of a player. This paper give us insight that people generally overestimate the value of player in lower leagues and underestimate the value of player playing at top league. Actual fees for players with time remaining on their contract rise by between £550000 and £800000 on average per year of time left. Although "market value" reported by TM is biased predictor it predicts better than FIFA score and ELO(based on head to head) rating.

  2. "Beyond crowd judgments: Data-driven estimation of market value in association football (2017)." Oliver Müller, Alexander Simons, Markus Weinmann. As the model's residuals are likely not independent, which would violate a central assumption of linear regression, multilevel regression was used and some of the factors were specified as random factors and allowed intercept to vary. Also using data analysis will help to predict accurate value for less known result which generally get more biased in crowd based sourcing.

  3. "Football player's performance and market value (2015)." He, Miao & Cachucho, Ricardo & Knobbe, Arno. Players position highly dictates their price tag. Performance of the player matters but in case of the top player correlation between their performance and Market Value is less than usual.

  4. "Money Talks: Team Variables and Player Positions that Most Influence the Market Value of Professional Male Footballers in Europe (2020)." Jose Luis Felipe et al. Teams having better ranking in league have more player from top ranking countries than the lower teams. Teams particates in UCL tends to pay more than market value compare to team who particate in UEl or dont participate in continental championship.


Project Final:


Insight : There are different positions in football and different attribute matters based on the position player plays in. Depending on players position we decided to split them and train our model for better accuracy.


We used MLR and Decision trees as we ran into a problem commonly known as P>>N problem (very less numbers of data and comparitively high numbers of parameter). Generally in these cases MLR shouldn't work but we dropped a number of parameter(which were not so important) and in most of the cases (especially 'forward' and 'goalkeeper') it worked. We went with Decision trees where we weren't satisfied with the result of MLR(in case 'Midfielder' and 'Defenders/Wing-Back' where we couldn't drop many parameter due to versatality of these roles). Given not much work done in this to solve this exact problem we lacked a ML algorithm to compare our code efficieny but fortunately because it is a real life problem we had Market Value to compare with to(Market Value is people prediction of player valuation is accepted to be one of the best prediction to transfer value) and we decided use it as a baseline and try to acheive better result than that.

This is the intial part of code which is common for all the codes use below. It was to load the database and convert the non - numerical data/symbols into numerical one.

        
        
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

path = "transfer.csv"

headernames = ['player', 'pos', 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl',  'tv', 'link', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal  Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']
datast = pd.read_csv(path, names=headernames)
z=0
b="winger"
print(b)

n=347
i=0



dataset = pd.DataFrame( columns = ['player', 'pos', 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl',  'tv', 'link', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal  Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved'])

while i < n:
 a = datast.iloc[i, 1]
 a = a.lower()

 if b in a:
         dataset.loc[-1] = datast.iloc[i,]
         dataset.index = dataset.index + 1  # shifting index
         dataset = dataset.sort_index()
         z = z+1
 i = i+1
print(z)

dataset = dataset.replace('-',0)  
dataset.fillna(0, inplace=True)


def handle_non_numerical_data(dataset):
    columns = dataset.columns.values
    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

        if dataset[column].dtype != np.int64 and dataset[column].dtype != np.float64:
            column_contents = dataset[column].values.tolist()
            unique_elements = set(column_contents)
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1

            dataset[column] = list(map(convert_to_int, dataset[column]))

    return dataset

dataset = handle_non_numerical_data(dataset)
        
        

Loop in the above code is a optional one. "b" stored the position we want player from and loop was to go through database and z stores the number of players in that position.


Below is the code based on MLR and we added a temporary loop around in certain condition to check its creditibilty

        
x = dataset[['pos', 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin', 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal  Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]
y = dataset['tv']

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

mlr = LinearRegression()  
mlr.fit(x_train, y_train)

print("Intercept: ", mlr.intercept_)
print("Coefficients:", mlr.coef_)
list(zip(x, mlr.coef_))

y_pred_mlr= mlr.predict(x_test)

print("Prediction for test set: {}".format(y_pred_mlr))

mlr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_mlr})
mlr_diff.head()

#Model Evaluation
from sklearn import metrics
meanAbErr = metrics.mean_absolute_error(y_test, y_pred_mlr)
meanSqErr = metrics.mean_squared_error(y_test, y_pred_mlr)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_mlr))
print('R squared: {:.2f}'.format(mlr.score(x,y)*100))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
plt.plot(y_test,abs(y_pred_mlr),'x')
plt.plot(x, x, linestyle='dotted')
plt.axis('scaled')
plt.show()
        
        

x parameter in above code depends on the position we are trying to get result for. The result in each of the position are



We got pretty decent result here, better mean absoulte error and root mean square error less than when compared to market values



Goal Keeper's price prediction were quiet nice and we got well fitted graph as no. of parameter reduced in case of GK



Defender weren't that good as in some case MAE and RMSE were more that that of Market Value.



Similar to Defender's case. The 1st image shows a good result but the 2nd one not so.

For winger, algorithm didnt found any good fit hence we decided to go with decision trees.

        
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

x = dataset[[ 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl',  'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal  Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]

attrs = x.columns

print(x.columns)

y = dataset['tv']

data = asarray(x)

scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
x = pd.DataFrame(scaled, columns=attrs)

print(y)

from sklearn.model_selection import train_test_split
from sklearn import metrics

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

train_classes = []

for i in y_train:
    if i < 1000000:
        train_classes.append(0)
    elif i > 100000000:
        train_classes.append(11)
    else:
        i /= 1000000
        train_classes.append(int(i/10)+1)

test_classes = []

for i in y_test:
    if i < 1000000:
        test_classes.append(0)
    elif i > 100000000:
        test_classes.append(11)
    else:
        i /= 1000000
        test_classes.append(int(i/10)+1)

# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):
  
    # Creating the classifier object
    clf_gini = DecisionTreeClassifier(criterion = "gini",
            random_state = 100,max_depth=3, min_samples_leaf=5)
  
    # Performing training
    clf_gini.fit(X_train, y_train)
    return clf_gini
      
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
  
    # Decision tree with entropy
    clf_entropy = DecisionTreeClassifier(
            criterion = "entropy", random_state = 100,
            max_depth = 3, min_samples_leaf = 5)
  
    # Performing training
    clf_entropy.fit(X_train, y_train)
    return clf_entropy
  
  
# Function to make predictions
def prediction(X_test, clf_object):
  
    # Predicton on test with giniIndex
    y_pred = clf_object.predict(X_test)
    return y_pred
      
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):

      
    print ("Accuracy : ",
    accuracy_score(y_test,y_pred)*100)

def predictions(test_classes, y_pred, y_test):
    # print(test_classes)
    for i in range(len(test_classes)):

        st = ""
        corr = "F"

        if test_classes[i] == y_pred[i]:
            corr = "T"
        if y_pred[i] == 0:
            st = "<1M"
        elif y_pred[i] == 11:
            st = ">100M"
        else:
            s = (y_pred[i]-1)
            p=s*10
            if p==0:
                p="1"
            st = str(p)+"M - "+str((s+1)*10)+"M"

        print("Actual : ", y_test.iloc[i], "   Predicted :", st, "   Result :", corr)

clf_gini = train_using_gini(x_train, x_test, train_classes)
clf_entropy = tarin_using_entropy(x_train, x_test, train_classes)

# Operational Phase
print("Results Using Gini Index:")

# Prediction using gini
y_pred_gini = prediction(x_test, clf_gini)
predictions(test_classes, y_pred_gini, y_test)
cal_accuracy(test_classes, y_pred_gini)
print()

print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(x_test, clf_entropy)
predictions(test_classes, y_pred_entropy, y_test)
cal_accuracy(test_classes, y_pred_entropy)
        
        


We got pretty good result with accuracy around 80% although we want to predict a number not range



We got a decent result with a accuracy of around 60- 70% failing to predict the higher values (as there are not many transfer happening in that range).



Similiar to Midfielder.

Principal Component Analysis is a technique to reduce no. of parameter so we decided to use it to see if it help us in our case to improve results. The below codes are just to compare 2 scenario (with and without PCA).

        

from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

x = dataset[[ 'age', 'mv', 'bfc', 'bfl', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal  Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]

attrs = x.columns
print(x.columns)

y = dataset['tv']

data = asarray(x)

scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
x = pd.DataFrame(scaled, columns=attrs)

from sklearn.model_selection import train_test_split
from sklearn import metrics

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
y1 = []
y2 = []
x1 = []

def getbatches(datasetx,datasety,sz):
    
  batchesx=[]
  batchesy=[]
  batchx = []
  batchy = []
  for i in range(0,len(datasetx)):  
    
    batchx.append(datasetx[i])
    batchy.append(datasety[i])

    if len(batchx)==sz:
        batchesx.append(batchx)
        batchesy.append(batchy)
        batchx=[]
        batchy=[]
  
  batchesx = np.array(batchesx)
  batchesy = np.array(batchesy)
  return batchesx,batchesy

def mse2(x,y,w,b):
    error = 0
    err_perc = []

    for i in range(0,len(x)):
        diff = y[i] - np.sum(x[i]*w) - b
        error = error + diff*diff
        per = (diff/y[i])*100
        err_perc.append(per)

    error = error/len(x)
    error = math.sqrt(error)
    return err_perc

def mse(x,y,w,b):
    error = 0
    err_perc = []

    for i in range(0,len(x)):
        diff = y[i] - np.sum(x[i]*w) - b
        error = error + diff*diff
        per = (diff/y[i])*100
        err_perc.append(per)


    error = error/len(x)
    error = math.sqrt(error)
    return error


def meanerr(x,y,w,b):

    merr=0 #error in m
    cerr=0 #error in c

    for i in range(0,len(x)):
        yp=np.sum(x[i]*w)+b                            #y predicted
        merr = merr-x[i]*((y[i]-yp))  # sigma xi*(y predicted-y actual)
        cerr = (cerr-(y[i]-yp))                # sigma y predicted-y actual

    merr = (2*merr)/len(x)
    cerr = (2*cerr)/len(x)

    return [merr,cerr]      

def grad_desc(x, y, epochs, lr, batchsz, xts, yts):
    
    w = np.array([0]*npca)
    b = 0

    batchesx, batchesy = getbatches(x, y, batchsz)

    nbatches = len(batchesx) #calculates length of batches
    for i in range(0,epochs):
        print(i,mse(x,y,w,b), mse(xts,yts,w,b))
        x1.append(i)
        y1.append(mse(x,y,w,b))
        y2.append(mse(xts,yts,w,b))

        ind = i%nbatches
        error = meanerr(batchesx[ind],batchesy[ind],w,b)
        
        w = w-lr*error[0]
        b = b-lr*error[1]
    print(i,mse2(xts,yts,w,b))
    return w,b

ytrain = np.array(y_train)
xtrain = np.array(x_train)
ytest = np.array(y_test)
xtest = np.array(x_test)
w,b = grad_desc(xtrain,ytrain,10000,0.05,len(xtrain) , xtest, ytest)
print(w)

graph1 = plt.plot(x1, y1, label = 'Traning_data_error',color = 'red') 

graph2 = plt.plot(x1, y2, label = 'Test_data_error',color = 'green')

plt.legend()

plt.xlabel('No. of epochs') 

plt.ylabel('Root Mean Squared Error') 
  
plt.title('Errors vs Epochs') 
  
plt.show() 
        
        

        
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

x = dataset[[ 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl',  'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal  Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]

attrs = x.columns

y = dataset['tv']

data = asarray(x)

scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
x = pd.DataFrame(scaled, columns=attrs)


from sklearn.decomposition import PCA

npca = 2

print(len(attrs))
pca = PCA(n_components = npca)

 
x = pca.fit_transform(x)

x = pd.DataFrame(x,columns=attrs[0:npca])

from sklearn.model_selection import train_test_split
from sklearn import metrics

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
y1 = []
y2 = []
x1 = []

def getbatches(datasetx,datasety,sz):
    
  batchesx=[]
  batchesy=[]
  batchx = []
  batchy = []
  for i in range(0,len(datasetx)):  
    
    batchx.append(datasetx[i])
    batchy.append(datasety[i])

    if len(batchx)==sz:
        batchesx.append(batchx)
        batchesy.append(batchy)
        batchx=[]
        batchy=[]
  
  batchesx = np.array(batchesx)
  batchesy = np.array(batchesy)
  return batchesx,batchesy

def mse2(x,y,w,b):
    error = 0
    err_perc = []

    for i in range(0,len(x)):
        diff = y[i] - np.sum(x[i]*w) - b
        error = error + diff*diff
        per = (diff/y[i])*100
        err_perc.append(per)


    error = error/len(x)
    error = math.sqrt(error)
    return err_perc

def mse(x,y,w,b):
    error = 0
    err_perc = []

    for i in range(0,len(x)):
        diff = y[i] - np.sum(x[i]*w) - b
        error = error + diff*diff
        per = (diff/y[i])*100
        err_perc.append(per)


    error = error/len(x)
    error = math.sqrt(error)
    return error


def meanerr(x,y,w,b):

    merr=0 #error in m
    cerr=0 #error in c

    for i in range(0,len(x)):
        yp=np.sum(x[i]*w)+b                            #y predicted
        merr = merr-x[i]*((y[i]-yp))  # sigma xi*(y predicted-y actual)
        cerr = (cerr-(y[i]-yp))                # sigma y predicted-y actual

    merr = (2*merr)/len(x)
    cerr = (2*cerr)/len(x)

    return [merr,cerr]      

def grad_desc(x, y, epochs, lr, batchsz, xts, yts):
    
    w = np.array([0]*npca)
    # print("gg",w)
    b = 0

    batchesx, batchesy = getbatches(x, y, batchsz)

    nbatches = len(batchesx) #calculates length of batches
    for i in range(0,epochs):
        # print("yo",w)
        print(i,mse(x,y,w,b), mse(xts,yts,w,b))
        x1.append(i)
        y1.append(mse(x,y,w,b))
        y2.append(mse(xts,yts,w,b))

        ind = i%nbatches
        error = meanerr(batchesx[ind],batchesy[ind],w,b)
        
        w = w-lr*error[0]
        b = b-lr*error[1]
        # print('yo2',error)
    print(i,mse2(xts,yts,w,b))
    return w,b

ytrain = np.array(y_train)
xtrain = np.array(x_train)
ytest = np.array(y_test)
xtest = np.array(x_test)
# print(len(xtrain))
# print(ytrain)
w,b = grad_desc(xtrain,ytrain,10000,0.05,len(xtrain) , xtest, ytest)
print(w)

graph1 = plt.plot(x1, y1, label = 'Traning_data_error',color = 'red') 

graph2 = plt.plot(x1, y2, label = 'Test_data_error',color = 'green')

plt.legend()

plt.xlabel('No. of epochs') 

plt.ylabel('Root Mean Squared Error') 
  
plt.title('Errors vs Epochs') 
  
plt.show()
        
        


PCA(top) and no PCA(down) in case of defender

In conclusion we got pretty good result in case of 'forward' and 'goal keepers' and in case of 'winger' after using decision tree. For Midfielder and Defender we didnt get safisfactory result which is due to the fact these 2 are dependendent on too many factors and that is hindering the models. It wasn't possible to split them apart any further as then we wouldn't have enough entries in each of the categories to apply ML.