"Transfer Price Prediction and Potential Replacements of a football player in the market" is the problem we will be working on, where we make use of efficient machine learning algorithms to train our model so as to tackle prevailing issues in baseline models.We would be aiming at training with more globally accurate datasets instead of using the routine ratings available.We will also work on improving the algorithms if possible,considering other influencing parameters and on reducing the error function inorder to end up with a precise value on a player in transfer market.With this data in hand,we also aim to design our model that can suggest suitable replacements of the particular player.
In the world of sports,Football is not just a sport between 2 teams but a war of emotions between people and also the growth of wealth in the market.A tiny decision would have a great impact in the teamplay which indirectly affects the game to an unexpected level.We being interested in such a domain,motivated ourselves to make use of this project opportunity to work on and train a model that can precisely assess the prices of incoming players in the transfer market using which it can suggest the potential replacements improving team integrity and gameplay.The model would prove useful in franchising and decision making for strategically obtaining desirable outcomes by optimizing their budget and the players in the transfer market.
The term "transfer market" refers to the arena in which football players are allowed for transfer to clubs.Predicting this transfer market price of a football player would mean predicting the amount of money a club can spend on the player in the transfer market.Bound players are players in bonds and contracts who wouldnt account in the transfer market.Market value differs from transfer value as it focuses on the exact market worth of a given player while transfer value focuses on the players who are available for exchange or replacements in the transfer market.Our model aims to serve this purpose in the transfer market by considering all the heavy factors influencing this transfer value to end up with reliable price predictions and results which can be trusted for replacements.
Data acquisition: (6-15 September)
Literature Review: (16 September-3 October)
Test and Validation of Architecture: (4-17 October)
Fine Tuning: (18-24 October)
Final Report: (25-31 October)
We plan to use BeautifulSoup web scraping to extract datasets from the following websites:
1)TransferMarkt
2)FBref.com
3)footballdatabase.com
4)eu-football.info
We also focus on filtering noise datasets if possible so that we can use reliable and accurate information to train our model.
Project Midway:
As per our proposed timeline,we have worked on:
Data Acquisition for training our model
Literature Review for Insights and Inferences
Test and Validation which is our final goal
In this phase,we gave efforts to acquire and extract required data from different sites on various grounds as stated below
Transfermarkt (for all the transfer happened in summer 2021 and MV)
Sofifa (only for player contract status)
fbref (to fill up missing stats for some player)
In this process,the difficulties we faced include gathering of data and dealing with missing data,but we have adopted certain strategies to deal with these difficulties as stated below:
Gathering of data: We had to use different tools like Beautiful soup, Scrapy and api to gather required data
Dealing with missing data: Deleting data and taking average are one of the easiest way to deal with this is issue but it wasn't possible for us to apply because it would caused huge data loss if we deleted
the rows with missing data.Average isnt good for our problem as it doesnt factor in covariance and that could result inaccurate result.
There were many instances where certian data wasn't available for some players (because we used different website to collect data and merge them (all the required data wasnt
available on a single website or either locked behind a paywall.))
We took case by case instance to deal the issue.
For example player missing contract details and moved on free transfer were set to 2021(player are allowed to leave club for free when their contract expires. There are
cases where club agree to mutually terminate contract with player(because of wage) or clause allowing them to leave for free(in case of relegation) or injury release clause
but those cases are few therefore setting it to 2021 is a good approximation)
[check fig 1 and 2]
For player missing national stats(because page structure wasnt same so data didnt get captured/ it was not simply available on that website). We manually looked it up to the
internet and filled up.
[check fig 3 and 4]
The Wisdom of Crowds and Transfer Market Values(2021) -Dennis Coates, Petr Parshakov
Transfermarkt, one of the most popular site uses crowd sourcing to decide Market Value of a player. This paper give us insight that people generally overestimate the
value of player in lower leagues and underestimate the value of player playing at top league. Actual fees for players with time remaining on their contract rise by between
£550000 and £800000 on average per year of time left.
Although "market value" reported by TM is biased predictor it predicts better than FIFA score and ELO(based on head to head) rating.
"Beyond crowd judgments: Data-driven estimation of market value in association football (2017)." Oliver Müller, Alexander Simons, Markus Weinmann.
As the model's residuals are likely not independent, which would violate a central assumption of linear regression, multilevel regression was used and some of the factors were specified as random factors and allowed intercept to vary.
Also using data analysis will help to predict accurate value for less known result which generally get more biased in crowd based sourcing.
"Football player's performance and market value (2015)." He, Miao & Cachucho, Ricardo & Knobbe, Arno.
Players position highly dictates their price tag. Performance of the player matters but in case of the top player correlation between their performance and Market Value is less than usual.
"Money Talks: Team Variables and Player Positions that Most Influence the Market Value of Professional Male Footballers in Europe (2020)." Jose Luis Felipe et al.
Teams having better ranking in league have more player from top ranking countries than the lower teams. Teams particates in UCL tends to pay more than market value compare to team who particate in UEl or dont participate in continental championship.
Project Final:
Insight : There are different positions in football and different attribute matters based on the position player plays in. Depending on players position we decided
to split them and train our model for better accuracy.
We used MLR and Decision trees as we ran into a problem commonly known as P>>N problem (very less numbers of data and comparitively high numbers of parameter). Generally
in these cases MLR shouldn't work but we dropped a number of parameter(which were not so important) and in most of the cases (especially 'forward' and 'goalkeeper')
it worked.
We went with Decision trees where we weren't satisfied with the result of MLR(in case 'Midfielder' and 'Defenders/Wing-Back' where we couldn't drop many parameter due to versatality
of these roles).
Given not much work done in this to solve this exact problem we lacked a ML algorithm to compare our code efficieny but fortunately because it is a real life problem
we had Market Value to compare with to(Market Value is people prediction of player valuation is accepted to be one of the best prediction to transfer value) and
we decided use it as a baseline and try to acheive better result than that.
This is the intial part of code which is common for all the codes use below. It was to load the database and convert the non - numerical data/symbols into numerical one.
Loop in the above code is a optional one. "b" stored the position we want player from and loop was to go through database and z stores the number of players
in that position.
Below is the code based on MLR and we added a temporary loop around in certain condition to check its creditibilty
x parameter in above code depends on the position we are trying to get result for. The result in each of the position are
We got pretty decent result here, better mean absoulte error and root mean square error less than when compared to market values
Goal Keeper's price prediction were quiet nice and we got well fitted graph as no. of parameter reduced in case of GK
Defender weren't that good as in some case MAE and RMSE were more that that of Market Value.
Similar to Defender's case. The 1st image shows a good result but the 2nd one not so.
For winger, algorithm didnt found any good fit hence we decided to go with decision trees.
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
x = dataset[[ 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]
attrs = x.columns
print(x.columns)
y = dataset['tv']
data = asarray(x)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
x = pd.DataFrame(scaled, columns=attrs)
print(y)
from sklearn.model_selection import train_test_split
from sklearn import metrics
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
train_classes = []
for i in y_train:
if i < 1000000:
train_classes.append(0)
elif i > 100000000:
train_classes.append(11)
else:
i /= 1000000
train_classes.append(int(i/10)+1)
test_classes = []
for i in y_test:
if i < 1000000:
test_classes.append(0)
elif i > 100000000:
test_classes.append(11)
else:
i /= 1000000
test_classes.append(int(i/10)+1)
# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
return y_pred
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)
def predictions(test_classes, y_pred, y_test):
# print(test_classes)
for i in range(len(test_classes)):
st = ""
corr = "F"
if test_classes[i] == y_pred[i]:
corr = "T"
if y_pred[i] == 0:
st = "<1M"
elif y_pred[i] == 11:
st = ">100M"
else:
s = (y_pred[i]-1)
p=s*10
if p==0:
p="1"
st = str(p)+"M - "+str((s+1)*10)+"M"
print("Actual : ", y_test.iloc[i], " Predicted :", st, " Result :", corr)
clf_gini = train_using_gini(x_train, x_test, train_classes)
clf_entropy = tarin_using_entropy(x_train, x_test, train_classes)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(x_test, clf_gini)
predictions(test_classes, y_pred_gini, y_test)
cal_accuracy(test_classes, y_pred_gini)
print()
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(x_test, clf_entropy)
predictions(test_classes, y_pred_entropy, y_test)
cal_accuracy(test_classes, y_pred_entropy)
We got pretty good result with accuracy around 80% although we want to predict a number not range
We got a decent result with a accuracy of around 60- 70% failing to predict the higher values (as there are not many transfer happening in
that range).
Similiar to Midfielder.
Principal Component Analysis is a technique to reduce no. of parameter so we decided to use it to see if it help us in our case to improve results. The below codes
are just to compare 2 scenario (with and without PCA).
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
x = dataset[[ 'age', 'mv', 'bfc', 'bfl', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]
attrs = x.columns
print(x.columns)
y = dataset['tv']
data = asarray(x)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
x = pd.DataFrame(scaled, columns=attrs)
from sklearn.model_selection import train_test_split
from sklearn import metrics
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
y1 = []
y2 = []
x1 = []
def getbatches(datasetx,datasety,sz):
batchesx=[]
batchesy=[]
batchx = []
batchy = []
for i in range(0,len(datasetx)):
batchx.append(datasetx[i])
batchy.append(datasety[i])
if len(batchx)==sz:
batchesx.append(batchx)
batchesy.append(batchy)
batchx=[]
batchy=[]
batchesx = np.array(batchesx)
batchesy = np.array(batchesy)
return batchesx,batchesy
def mse2(x,y,w,b):
error = 0
err_perc = []
for i in range(0,len(x)):
diff = y[i] - np.sum(x[i]*w) - b
error = error + diff*diff
per = (diff/y[i])*100
err_perc.append(per)
error = error/len(x)
error = math.sqrt(error)
return err_perc
def mse(x,y,w,b):
error = 0
err_perc = []
for i in range(0,len(x)):
diff = y[i] - np.sum(x[i]*w) - b
error = error + diff*diff
per = (diff/y[i])*100
err_perc.append(per)
error = error/len(x)
error = math.sqrt(error)
return error
def meanerr(x,y,w,b):
merr=0 #error in m
cerr=0 #error in c
for i in range(0,len(x)):
yp=np.sum(x[i]*w)+b #y predicted
merr = merr-x[i]*((y[i]-yp)) # sigma xi*(y predicted-y actual)
cerr = (cerr-(y[i]-yp)) # sigma y predicted-y actual
merr = (2*merr)/len(x)
cerr = (2*cerr)/len(x)
return [merr,cerr]
def grad_desc(x, y, epochs, lr, batchsz, xts, yts):
w = np.array([0]*npca)
b = 0
batchesx, batchesy = getbatches(x, y, batchsz)
nbatches = len(batchesx) #calculates length of batches
for i in range(0,epochs):
print(i,mse(x,y,w,b), mse(xts,yts,w,b))
x1.append(i)
y1.append(mse(x,y,w,b))
y2.append(mse(xts,yts,w,b))
ind = i%nbatches
error = meanerr(batchesx[ind],batchesy[ind],w,b)
w = w-lr*error[0]
b = b-lr*error[1]
print(i,mse2(xts,yts,w,b))
return w,b
ytrain = np.array(y_train)
xtrain = np.array(x_train)
ytest = np.array(y_test)
xtest = np.array(x_test)
w,b = grad_desc(xtrain,ytrain,10000,0.05,len(xtrain) , xtest, ytest)
print(w)
graph1 = plt.plot(x1, y1, label = 'Traning_data_error',color = 'red')
graph2 = plt.plot(x1, y2, label = 'Test_data_error',color = 'green')
plt.legend()
plt.xlabel('No. of epochs')
plt.ylabel('Root Mean Squared Error')
plt.title('Errors vs Epochs')
plt.show()
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
x = dataset[[ 'age', 'mv', 'bfc', 'bfl', 'afc', 'afl', 'country', 'capp', 'cg', 'ca', 'cmin', 'contractyr', 'lmin' , 'Shot total' , 'Shot on Target' , 'Goals Total' , 'Goal Conceded' , 'Assists' , 'Saves' , 'Passes Total' , 'Key Passes' , 'Dribbles Attempts' , 'Dribble success' , 'Dribble Past' ,'Tackles Total' , 'Blocks' , 'Interception' , 'Duels Total' , 'Duels Won' , 'Fouls Drawn' , 'Fouls Commited' , 'Yellow Card' , 'Second Yellow Card' , 'Straight Red' , 'Penalties Won' , 'Penalties Commited' ,'Penalties Scored' , 'Penalties Missed' , 'Penalties Saved']]
attrs = x.columns
y = dataset['tv']
data = asarray(x)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
x = pd.DataFrame(scaled, columns=attrs)
from sklearn.decomposition import PCA
npca = 2
print(len(attrs))
pca = PCA(n_components = npca)
x = pca.fit_transform(x)
x = pd.DataFrame(x,columns=attrs[0:npca])
from sklearn.model_selection import train_test_split
from sklearn import metrics
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
y1 = []
y2 = []
x1 = []
def getbatches(datasetx,datasety,sz):
batchesx=[]
batchesy=[]
batchx = []
batchy = []
for i in range(0,len(datasetx)):
batchx.append(datasetx[i])
batchy.append(datasety[i])
if len(batchx)==sz:
batchesx.append(batchx)
batchesy.append(batchy)
batchx=[]
batchy=[]
batchesx = np.array(batchesx)
batchesy = np.array(batchesy)
return batchesx,batchesy
def mse2(x,y,w,b):
error = 0
err_perc = []
for i in range(0,len(x)):
diff = y[i] - np.sum(x[i]*w) - b
error = error + diff*diff
per = (diff/y[i])*100
err_perc.append(per)
error = error/len(x)
error = math.sqrt(error)
return err_perc
def mse(x,y,w,b):
error = 0
err_perc = []
for i in range(0,len(x)):
diff = y[i] - np.sum(x[i]*w) - b
error = error + diff*diff
per = (diff/y[i])*100
err_perc.append(per)
error = error/len(x)
error = math.sqrt(error)
return error
def meanerr(x,y,w,b):
merr=0 #error in m
cerr=0 #error in c
for i in range(0,len(x)):
yp=np.sum(x[i]*w)+b #y predicted
merr = merr-x[i]*((y[i]-yp)) # sigma xi*(y predicted-y actual)
cerr = (cerr-(y[i]-yp)) # sigma y predicted-y actual
merr = (2*merr)/len(x)
cerr = (2*cerr)/len(x)
return [merr,cerr]
def grad_desc(x, y, epochs, lr, batchsz, xts, yts):
w = np.array([0]*npca)
# print("gg",w)
b = 0
batchesx, batchesy = getbatches(x, y, batchsz)
nbatches = len(batchesx) #calculates length of batches
for i in range(0,epochs):
# print("yo",w)
print(i,mse(x,y,w,b), mse(xts,yts,w,b))
x1.append(i)
y1.append(mse(x,y,w,b))
y2.append(mse(xts,yts,w,b))
ind = i%nbatches
error = meanerr(batchesx[ind],batchesy[ind],w,b)
w = w-lr*error[0]
b = b-lr*error[1]
# print('yo2',error)
print(i,mse2(xts,yts,w,b))
return w,b
ytrain = np.array(y_train)
xtrain = np.array(x_train)
ytest = np.array(y_test)
xtest = np.array(x_test)
# print(len(xtrain))
# print(ytrain)
w,b = grad_desc(xtrain,ytrain,10000,0.05,len(xtrain) , xtest, ytest)
print(w)
graph1 = plt.plot(x1, y1, label = 'Traning_data_error',color = 'red')
graph2 = plt.plot(x1, y2, label = 'Test_data_error',color = 'green')
plt.legend()
plt.xlabel('No. of epochs')
plt.ylabel('Root Mean Squared Error')
plt.title('Errors vs Epochs')
plt.show()
PCA(top) and no PCA(down) in case of defender
In conclusion we got pretty good result in case of 'forward' and 'goal keepers' and in case of 'winger' after using decision tree.
For Midfielder and Defender we didnt get safisfactory result which is due to the fact these 2 are dependendent on too many factors and
that is hindering the models. It wasn't possible to split them apart any further as then we wouldn't have enough entries in each of the
categories to apply ML.
"Beyond crowd judgments: Data-driven estimation of
market value in association football (2017)." Oliver
Müller, Alexander Simons, Markus Weinmann.
"Football player's performance and market value (2015)."
He, Miao & Cachucho, Ricardo & Knobbe, Arno.
"Money Talks: Team Variables and Player Positions that
Most Influence the Market Value of Professional Male
Footballers in Europe (2020)." Jose Luis Felipe et al.