Solar cells are the future of energy, existing technology is based on silicon or germanium the so-called conventional semiconductors. They are very hard to manufacture as well as it's very expensive to fabricate. Perovskite can replace this if we use Perovskite instead of conventional semiconductors, it'll drastically reduce the effort for making solar cells.
The Perovskites are crystals with a specific crystal structure given by,
$$ ABX_{3} $$

A & B are cations, and
X is an anion. So, we can see that there can be an enormous number of combinations which yields a Perovskite crystal,
but it's impossible to try all these combinations are get the properties that we need.
In this project, we are concentrating on bandgap energy denoted by Eg , which determines the efficiency of the solar cell, and the color of light emitted if used as a LED. We also want to find a substitute of Lead (Pb), which is the B cation in most of the Perovskites we are using today. I think I don't need to explain why because we all know that Pb is a heavy metal and the environmental issues it causes. On a side note, from the Perovskite solar cells, the Pb which reaches the ground is not at all significant it's similar to how much Pb we will get exposed if we are walking on an asphalt road barefoot. Hence finding Pb-free Perovskite is not an immediate goal.
By midway we are hoping we will be able to start on woking on training the machine to perdict the Eg
Bandgap engineering or obtaining a material with desired bandgap value is a crucial step to obtain pervoskite solar cells, which are stable and highly efficient. Here, in this paper by Yaogao LI, Yao Lu and their group, they are proposing a way to use Machine Learing as tool to determine how does the bandgap of the material is dependent on the composition of the material thereby predict the composition of the material with desired bandgap value. They used experimental datas of bandgaps of various compositions of perovskite crystals as the training set for their ML algorithms and used linear regression and neural networks to predict the bandgap values the test dataset. The class of perovskite crystal they considered in their work Mixed Lead halide perovskites(APbX3, A is the cation, X is the halide ion) and the reason for considering them in particular is their unique optical, electrical and properties. By varying the composition of these crystals they have obtained the data of perovskites with bandgap values ranging from 1.5 to 3.2 eV. Here the different compositions of perovskite crystals are made by varying the compositions of the cations(A) and anions(X) keeping the Lead composition fixed. Here for the cations(A), they used Ca(Cesium), FA(formamidinium), MA(methylammonium) and for the anions(here halides) ,they used Cl, Br and I. The dataset of past reported experiments covers a large range of compositions of perovskites, including pure Cl, pure Br, pure I, Br-I and Cl-Br mixed, aiming to get a deep and correct relation among the composition of the lead halide perovskite and its bandgap. The dataset they used contains more than 300 data points, which covers more than 120 recently published research papers. Then for further increasing the accuracy of their model, they screened their dataset reasonably. In the screaning process they removed the duplicate datas present in the dataset and for different experimental datas of the perovskite of same composition they considered only the most frequently and recently reported experimental data. After this screening process, the final dataset they obtained contained 109 datapoints. Among this final dataset the maximum bandgap obtained was 3.16 eV for MAPbCl3 and minimum bandgap obtained was 1.48 eV, which was for FAPbI3.
In this paper, the main goal is to find the lead-free perovskites using ML Density functional theory (DFT). We all know the harmful effects of Lead, lead is used widely in Perovskites because of the stability and desirable bandgap of lead-based perovskites.
In this research article they are generating data of double perovskites, that is they have 2 types of atoms/molecules in B site, by first order DFT. The features are atomic properties like Pauling’s electronegativity, ionization potential, highest occupied atomic level, lowest unoccupied atomic level etc. of 540 hypothetical perovskites They have used decision tree-based algorithm known as Gradient boosted Regression Tree (GBRT) , which is prone to overfitting but GBRT is prevents this by gradient boosting method. They have also taken Heat of formation into consideration because there is a direct relationship between the heat of formation and the stability of the crystal
Major advantages
The major drawbacks we face here is,
We have implimented the Linear regression on the data which have obtained from
the Bandgap tuning strategy by cations and halide ions of lead halide perovskites learned from machine learning
and implementation of ANN is almost done, both the algorithms are obtained from Scikit learn
package. Data Set
The results are given below
And the corresponding Regression coefficients of the features and the intercept were obtained as,
Then the final model of predicting the Bandgap energy was obtained as, $$E_{g} = 2.66635970285812 - 0.347986 MA - 0.429095 FA - 0.365870 Cs + 0.816619 Cl - 0.065292 Br - 0.751327 I$$ This model can be used to predict composition of Lead Halide Perovskite required to obtain a desired value of bandgap energy(within error limits). Now if consider an LED, this bandgap energy is the parameter which determines the wavelength of light emitted from it. So for that case, this model can used to predict composition of Lead Halide Perovskite which is required to make an LED which emits wavelength of light of our interest(The practical construction such predicted composition of Perovskite can be possible, difficult or in some cases even impossible. So practical construction/experimental verification is required to understand the practical feasibility of such model predicted by the ML algorithm)
Linear Regression Implimentation Code
ANN Implimentation Code
What we have implemented after midway
As we first implemented ANN with 100 hidden layers which tend to overfit the data and performed very bad in the test data, hence we changed the hidden layers and optimized all the parameters including the activation function, no. of epochs etc. And we also changed our solve from SGD based “Adam” to quasi-Newton based “LBFGS” which is unconventional but worked for us.
We implemented Random Forest and Regression tree algorithms and tried to fit our data with Tin-Lead perovskites, and we got satisfactory results, but it needed further improvement. Since we didn’t had a very huge dataset, the ideal partition of train and test data was necessary and we used Repeated K-fold validation to find the optimum division of test and train data and our results was much better after that.
As mentioned above, the process of acquiring data for training the models was a tedious process, and we were able to get only 101 data points, which is not enough to train the models. Hence, we wanted to make more data, here is the role of GANs which stands for Generative Adversarial Network. This idea is relatively new, this was proposed in 2014 by Ian Goodfellow and his colleagues. Now a days there are so many controversies and restrictions on the usage of GANs because it is so good in generating face of people.
The GAN consist of two deep learning neural networks named as generative and discriminative networks. What happens in this framework is, these 2 neural networks is playing a game know as zero sum game in game theory, which essentially means one gain is other’s loss (Extreme Competition!).
The generative network tries to make candidates with same statistics as the data we give, and the discriminator tries to reject those candidates, as the process repeats the generative model will come up with better and better candidates so that the discriminator can’t differentiate the original data and the data generated by generative network
$$RMSE(obtained) : 0.1052021325418953 I$$
The expression we got as a result from linear regression is given below
$$ E_{g} = -0.081040a + -0.017869b + 1.567896x + 0.685927y + 1.5671077288723425 $$ Linear Regression Implimentation Code
$$RMSE(obtained) : 0.08131064613786528 I$$
DicisionTree Regression Implimentation Code
RandomForest Regression
$$RMSE(obtained) : 0.051499223621781595 I$$
RandomForest Regression Implimentation Code
$$RMSE(obtained) : 0.07509514516846788 I$$
Neural Network(Lead only) Implimentation Code
$$RMSE(obtained) : 0.08809039177363746 I$$
Neural Network(Tin-Lead) Implimentation Code
$$RMSE(obtained) : 0.05400406798593189 I$$
Neural Network(Tin-Lead) with GAN Implimentation Code
$$RMSE(obtained) : 0.1399951363070642 I$$
Neural Network(Tin-Lead) with GAN Implimentation Code(See the last part of the code for this plot)
$$RMSE(obtained) : 0.09197694758600451 $$
Neural Network(Tin-Lead) with GAN Implimentation Code(See the last part of the code for this plot) Code
The ability to predict the bandgaps will give us a vision about the desired composition. In this study we have used the perovskites with a wide bandgap, those which can be used as a solar cell and LED depended on the bandgap. In this study the major hurdle we faced was the lack of large amount of data, since the area of study is not much explored by the way we have proceed. And we fulfilled most of our targets as proposed in our presentations
We have implemented various machine learning algorithms and the most accurate one was Random Forest, with very less amount of data, using repeated K-fold cross validation we able to get excellent result from the Random Forest algorithm. The linear regression was only used in the Lead only perovskite because of the non-linearity we observed from the data, from linear regression we got an empirical formula for the bandgap. We have implemented neural network with and without Tin, and in Tin-Lead perovskite when we trained with the data we collected the results were not as good as we expected, the same was with the Regression Tree algorithm.
We have collected data from various sources, especially for Tin-Lead perovskites there were no previous publication which used the experimental data to derive at the result, which means our study is one of the first! Since we lacked data, we have come up with a very innovative solution, that is to use GANs to generate data from the data we have collected, and we trained it using the generated data and we tested it with the original data(experimental data) and we have got a much-improved result as you can see from the results given above. With all our models except linear regression we are able to make prediction in the outlier regions where there is a sudden dip in the bandgap, for which there are no published articles as of now.