computational design and discovery of high …...discovery process which involves doing experiments...
TRANSCRIPT
Computational Design and Discovery
of High-entropy Alloys
Zhipeng Li
U6766505
A report submitted for the course
COMP8755
Supervised by: Nick Birbilis
The Australian National University
October 2019
2
Except where otherwise indicated, this report is my own original work.
Zhipeng Li
24 October 2019
3
Acknowledgement
I would like to thank my supervisor Professor Nick Birbilis for providing me this
excellent research opportunity. This project really deepened my understanding of
using machine learning to solve real-world problems, and it helped me develop my
research interests and skills, which I believe will have far-reaching consequences to
my academic development and career path.
This project cannot be finished without the instructions and help from Professor Nick
Birbilis and my external advisor, Will Nash, who have been supportive throughout
the whole semester. I offer my sincere appreciation for this precious learning
experience they provided.
4
Abstract
High-entropy Alloys (HEAs) are alloys that are composed of five or more metallic
elements with nearly equal proportions [2]. This novel class of materials have
potentially desirable properties such as better strength-to-weight ratios, higher
strength, and fracture resistance [1]. Traditional high-entropy alloy experiments are
usually costly and time-consuming, partly due to the inefficiency of the early
discovery process which involves doing experiments on a large amount of predicted
alloy compositions. Hence, it is intuitively to apply machine learning techniques in
this novel alloy’s design and discovery process.
Generative adversarial networks (GANs) have been studied and applied in various
domains in the past few years. In this project, a generative model called cardiGAN,
which is designed based on Wasserstein GAN, was developed, where the name
cardiGAN stands for compositionally complex alloy research directive inference GAN.
Our results suggested that the proposed cardiGAN model could estimate the
underlying probability density function of existing high-entropy alloys in the element
and thermodynamic space. We hope this model could provide insights to people who
are interested in this field and help accelerate the discovery of novel high-entropy
alloys.
5
CONTENTS
Acknowledgement ............................................................................................................ 2
Abstract ............................................................................................................................... 3
1 Introduction ................................................................................................................. 7
1.1 Motivations ......................................................................................................... 7
1.2 Objective ............................................................................................................. 7
1.3 Project Scope....................................................................................................... 7
1.4 Contribution ....................................................................................................... 8
1.5 Report Outline .................................................................................................... 8
2 Background and Related Work .................................................................................. 9
2.1 High-entropy Alloys .......................................................................................... 9
2.1.1 Empirical Parameters of High-entropy Alloys ...................................... 9
2.2 Generative Adversarial Networks .................................................................... 9
2.3 Wasserstein GAN ............................................................................................. 10
2.3.1 Wasserstein Distance ............................................................................. 11
2.3.2 Advantages of Wasserstein GAN ......................................................... 11
3 Methodology .............................................................................................................. 12
3.1 Construction of Training Set ........................................................................... 13
3.1.1 Data Collection and Data Engineering ................................................. 13
3.1.2 Empirical Parameter Calculator............................................................ 13
3.1.3 Construction of Training Set ................................................................. 14
3.2 Model Configuration ....................................................................................... 16
3.2.1 Network Overview ................................................................................ 16
3.2.2 Empirical Parameter Selection .............................................................. 16
3.2.3 Calculator Neural Network .................................................................. 17
3.2.4 Configuration of cardiGAN .................................................................. 19
3.2.5 Training of cardiGAN ........................................................................... 19
6
3.2.6 Stop Criteria ........................................................................................... 22
4 Evaluation and Analysis ........................................................................................... 24
4.1 Model Evaluation ............................................................................................. 24
4.1.1 Matching Score ....................................................................................... 24
4.1.2 Visualization........................................................................................... 28
4.2 Result Analysis ................................................................................................. 30
Conclusion ....................................................................................................................... 32
Reference .......................................................................................................................... 33
7
Chapter 1. Introduction
1.1 Motivations
High-entropy Alloys (HEAs) are alloys that are synthesized by mixing five or more
metallic elements with nearly equal proportions [3]. This novel class of materials are
currently the focus of significant attention because of their desirable properties, such
as better strength-to-weight ratios, higher strength and fracture resistance etc. [1].
Due to the inefficiency of the early discovery process, conventional high-entropy alloy
developments can be costly in both time and money, which usually involves doing
experiments on a large number of predicted alloy compositions. Generative
adversarial networks (GANs) have been studied and applied in various domains in
the past few years, such as image generation and natural language processing, to name
just a couple. The generative adversarial networks may handle more complex density
functions than other generative models e.g., variational autoencoders [4]. By applying
such GANs in the high-entropy alloy development (i.e. computational design and
discovery), we aim to build a model that may assist engineers and scientists reduce
the cost of new material discovery process and shorten the development cycle.
1.2 Objective
The objective of this project is to build a generative adversarial network model to
estimate the probability distribution of high-entropy alloys in the element and
thermodynamic space.
1.3 Project Scope
This project could be divided into six tasks:
1. Construction of HEA dataset – In this task, around 1500 high-entropy alloy
formulas were manually collected from hundreds of published papers. After parsing
and data engineering, 724 formatted, non-repetitive high-entropy alloy formulas were
added to the HEA dataset.
2. Development of empirical parameter calculator – The stabilization of high-
entropy alloys’ solid solution state relies on six empirical material parameters: entropy
of mixing ∆𝑆𝑚𝑖𝑥 , enthalpy of mixing ∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿, a unitless
parameter Ω, valence electron concentration (VEC) and average melting point (Tm)[2].
In this project, these six empirical parameters were treated as important features for
8
the proposed cardiGAN model. An empirical parameter calculator was developed to
calculate these six empirical parameters.
3. Construction of training set – The training set of the cardiGAN model was
constructed from the 724 chemical formulas in the HEA dataset and their six empirical
parameters.
4. Development of cardiGAN – A generative model named cardiGAN, which is
designed based on Wasserstein GAN, was developed. The model’s configuration and
training procedure are explained in Chapter 3.
5. Model evaluation – The proposed cardiGAN model was evaluated in both
machine learning and material aspects.
1.4 Contribution
The main contribution of this project is the study of applying GANs in computational
design and development of high-entropy alloys. In this project, a general adversarial
model that could estimate the underlying probability distribution of high-entropy
alloys, namely cardiGAN, was developed. The model can be used to improve the
discovery efficiency and provide insights into the development of novel high-entropy
alloys.
1.5 Report Outline
Chapter 1 provides the introduction of the project. Chapter 2 introduces some
background knowledges that are related to this work. Chapter 3 describes the model’s
architecture and the basic procedures for applying GANs in novel high-entropy alloy
discovery. Finally, in Chapter 4, we evaluate the model in both machine learning and
material aspects, discuss problems encountered in the current research status, and
propose possible solutions.
9
Chapter 2. Background and Related Work
2.1 High-entropy Alloys
High-entropy Alloys (HEAs) are alloys that are composed of five or more metallic
elements with nearly equal proportions [2]. Before the invention of this novel class of
materials, traditional alloys are composed of one or two principle elements with minor
proportions of other elements [2]. Because of their special crystal structures, some
high-entropy alloys have exceptional mechanical properties in comparison to
conventional alloys especially at high temperatures, such as higher strength, better
ductility etc. [3].
2.1.1 Empirical Parameters of High-entropy Alloys
There are six empirical parameters that have been considered as the main factors of
high-entropy alloy’s stable solid solution phase, namely entropy of mixing ∆𝑆𝑚𝑖𝑥 ,
enthalpy of mixing ∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿, valence electron concentration
𝑉𝐸𝐶, average melting point 𝑇𝑚 and a unitless parameter Ω [5]. The design guidelines
for high-entropy alloy are shown in [3, Table 1].
The unitless parameter Ω is calculated from Tm, ∆𝑆𝑚𝑖𝑥 and ∆𝐻𝑚𝑖𝑥 [6].
Ω = Tm∆𝑆𝑚𝑖𝑥
|∆𝐻𝑚𝑖𝑥|
2.2 Generative Adversarial Networks (GANs)
Generative adversarial networks are generative models invented by Ian Goodfellow
and researchers at the University of Montreal, which was originally proposed as
generative models for unsupervised learning [7].
Table 1. Empirical parameters and
design guidelines for forming solid
solution HEAs. Note that although
most existing high-entropy alloys
follow these design guidelines, there
also exists synthesizable high-entropy
alloys having empirical parameters
outside of these intervals.
10
A Generative adversarial network comprises two sub-networks, a generator network
and a discriminator network [7]. Given a training set, the generator is trained to map
from a latent space to the data distribution of training set, and the discriminator’s
training objective is to distinguish the fake data produced by the generator from real
data in the training set [7][8]. These two neural networks are trained together and
compete against each other until the generator could produce real-enough data that
even an optimized discriminator couldn’t distinguish, which is why the model named
“adversarial”.
As shown in Figure 1, when training GANs, the generator is fed with random noise
input which is sampled from a latent space (e.g. a multivariate Gaussian distribution).
The fake data produced by the generator, along with real data from the training set,
are the inputs of the discriminator. The discriminator evaluates the input data by
producing a probability indicating whether the input data is real or fake. The loss
generated from the evaluation, typically a binary cross entropy loss, is back-
propagated to both networks so that the generator could learn the data distribution of
the training set [9].
2.3 Wasserstein GAN
Wasserstein GAN (WGAN) is a variant of the general GAN proposed by Martin
Arjovsky and his colleagues [10]. This type of GAN could solve the main training
problems of GANs, such as mode collapse, unbalance between the generator and
discriminator, sensitivity to the hyperparameter selections etc. [10].
Figure 1: The architecture of generative adversarial networks.
11
2.3.1 Wasserstein Distance
Wasserstein GAN uses Wasserstein distance to calculate the loss function, which is
defined as the greatest lower bound (infimum) for any transport plan from the real
data distribution Pr to the generated data distribution P𝑔 [11].
The mathematical equation of Wasserstein distance:
𝑊(𝑃𝑟, 𝑃𝑔) = inf𝛾∈Π(𝑃𝑟,𝑃𝑔)
𝐸𝑥,𝑦~𝛾[|𝑥 − 𝑦|]
where Pr and Pg denotes the density function of real data and generated data, and
Π(𝑃𝑟, 𝑃𝑔) is the set of all joint distributions γ(𝑥, 𝑦) whose marginals are respectively
Pr and P𝑔 [10]. Since the above equation is highly intractable, by using the
Kantorovich-Rubinstein duality, the Wasserstein distance can be simplified as:
𝑊(𝑃𝑟, 𝑃𝑔) = sup|𝑓|𝐿≤1
𝐸𝑥~𝑃𝑟[𝑓(𝑥)] − 𝐸𝑥~𝑃𝑔
[𝑓(𝑥)]
where sup is the least upper bound and f is a 1-Lipschitz function following this
constraint [11]:
|𝑓(𝑥1) − 𝑓(𝑥2)| ≤ |𝑥1 − 𝑥2|
In practical, the discriminator is trained to find the Lipschitz function which will be
used to calculate the Wasserstein distance between the distributions of generated data
and the training set [10][11]. Unlike standard GANs, the discriminator no longer
outputs the probability of the sample to be real; it produces a scalar score 𝐷(𝑥) which
could be interpreted as the “realness” of input data [11].
To maintain Lipschitz constraint on f , the inventors of WGAN proposed a simple but
effective way [10]. By clipping the weights of the discriminator in the small range,
[−𝑐. 𝑐], f will always have an upper bound and a lower bound, which allows the
discriminator to maintain Lipschitz constraint [10][12].
2.3.2 Advantages of Wasserstein GAN
The most significant advantage of Wasserstein GAN is the smooth gradient of its loss
function. When training general GANs, as we can see from [10 Figure 2], in the case of
the generator is underperforming, the loss function would get saturated and has a
vanishing gradient. This creates an unsavable imbalance between the generator and
the discriminator, where the discriminator could easily distinguish data produced by
the generator from data in the training set, while the generator learns almost nothing
from the saturated loss and stays the same. The only way to avoid getting into this
situation is to keep the generator optimized and maintain the equilibrium all the time,
12
which is practically hard to achieve since the discriminator usually gets optimized
faster than the generator.
With Wasserstein GAN, when the generator underperforms, the imbalance between
the generator and the discriminator could be resolved. This is because Wasserstein
GAN uses Wasserstein distance as the loss function [10]. By maintaining the optimality
of the discriminator, the Wasserstein distance always has a smooth gradient no matter
how bad the generator performs [10].
Another significant advantage of Wasserstein GAN is that it has a meaningful loss
function which correlates with the generator’s performance and the generated data’s
quality [10][11]. When training standard GANs, the output of the discriminator is the
probability of the samples being real, and the loss is calculated using the binary cross
entropy. There is no way to check the quality of the generated data. But with
Wasserstein GAN, the loss function measures how far the generated probability
distribution differs from the probability distribution of the training set [11]. We could
always monitor the quality of the generated samples during training by looking at the
loss.
Figure 2. Optimal discriminator and critic when learning to
differentiate two Gaussians.
13
Chapter 3. Methodology
This chapter includes the principal components of this project and can be divided into
two sections: construction of training set and model configuration. The first section
describes the procedure of constructing the training set, which involves data collection
and data engineering. In the model configuration section, the architecture and training
procedure of the model are explained following a chronological order. In this project,
we use cardiGAN as the name of our model, which stands for compositionally
complex alloy research directive inference GAN.
3.1 Construction of Training Set
3.1.1 Data Collection and Data Engineering
High-entropy alloys are a novel class of alloys which has only been developed around
fifteen years. Currently, there is no available dataset containing all the existing high-
entropy alloys. The high-entropy alloy formulas in the training set were manually
collected from hundreds of published papers. Because the same chemical formula
could occur in various forms in different papers, these raw data must be parsed and
formatted.
To generate a unified dataset, a formula parser was developed, which could parse
element compositions from the collected formulas and remove duplicated data. For
example, MnCoFeNiCr and Fe20Mn20Ni20Cr20Co20 are two high-entropy alloy
formulas in the original dataset, and they both indicate the same material. By using
the formula parser, these two formulas are merged into one single formatted chemical
formula, where the elements are arranged in alphabetical order:
Co 0.2 Cr 0.2 Fe 0.2 Mn 0.2 Ni 0.2
After parsing and data engineering, 724 formatted, non-repetitive high-entropy alloys
were added to our HEA dataset, which includes most of the existing high-entropy
alloy formulas developed from 2004 until now. This dataset will be used to create the
training set in the next section.
3.1.2 Empirical Parameter Calculator
As we have discussed in Chapter 2, there are six empirical parameters will be used to
train the cardiGAN model, namely entropy of mixing ∆𝑆𝑚𝑖𝑥 , enthalpy of mixing
∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿 , valence electron concentration 𝑉𝐸𝐶 , average
melting point 𝑇𝑚 and a unitless parameter Ω.
14
In this section, an empirical parameter calculator, which could calculate the six
empirical parameters for any given high-entropy alloy formulas, was developed. The
empirical parameter calculator integrates the elements’ attributes and properties, such
as atomic radii, number of valence electron, and melting point etc. It applies the
corresponded chemical and thermodynamic equations to do the calculation.
This empirical parameter calculator plays an important role in the training set
generation and model analysis procedures. A user interface of this software was
developed, so it can be used by people who are interested in this area.
3.1.3 Construction of Training Set
The training set contains 724 high-entropy alloy formulas along with their empirical
parameters. The information of each chemical formula is represented as a list of molar
ratios of the 56 selected elements which are shown in Table 2.
The training set has 724 rows with each row representing the information of a specific
high-entropy alloy formula. The first 56 columns are the molar ratios, and the last 6
columns are the six empirical parameters. The molar ratios and empirical parameters
comprise the feature space of the cardiGAN model. Because most high-entropy alloys
are composed of 3 to 10 elements, this representation requires the generator of the
cardiGAN model to produce a sparse, nonnegative output.
Figure 3. User Interface of Empirical Parameter Calculator
developed in this project
15
Ag Al Au B Be Bi C Ca
Cd Ce Co Cr Cu Dy Er Fe
Gd Ge Hf Ho In Ir La Li
Lu Mg Mn Mo Nb Nd Ni Os
P Pb Pd Pr Pt Re Rh Ru
Sb Sc Si Sm Sn Sr Ta Tb
Ti Tm V W Y Yb Zn Zr
Figure 4. The pipeline of the cardiGAN model. There are three components: a
generator network for generating fake chemical formulas, a calculator network for
calculating empirical parameters and a discriminator network for detecting whether a
given sample is real or generated.
Table 2. Elements explored to predict novel high-entropy alloys. It is noted that
numerous metals were excluded on the basis of impracticality (such as those that are
radioactive or high reactive).
16
3.2 Model Configuration
3.2.1 Network Overview
As shown in Figure 4, the cardiGAN model has three sub-networks: a generator, a
discriminator and a pre-trained calculator neural network. The most significant
difference between the cardiGAN model and Wasserstein GAN is that cardiGAN has
a pretrained calculator neural network. This neural network is used to simulate the
functionality of the empirical parameter calculator. The generator network is trained
to produce the element compositions of fake HEA formulas. The calculator network
takes the generated element compositions as inputs and estimate the empirical
parameters of the corresponding high-entropy alloy formulas. Then, the outputs of
these two neural networks are concatenated as the fake inputs of the discriminator.
The cardiGAN model is trained with the molar ratios and empirical parameters of the
724 high-entropy alloy formulas in the training set. The input of the discriminator
looks like this:
𝑥 = [𝑚1, 𝑚2, … . . , 𝑚56, 𝜎, ∆𝑆𝑚𝑖𝑥 , ∆𝐻𝑚𝑖𝑥 , 𝛿, 𝑉𝐸𝐶, 𝑇𝑚]𝑇
where 𝑚1, 𝑚2, … . . , 𝑚56 are the molar ratios of the 56 selected elements, 𝜎 is the
summation of molar ratios, 𝜎 = 𝑚1 + 𝑚2 + ⋯ + 𝑚56. The last five dimensions of 𝑥
are empirical parameters.
The real HEA formulas’ empirical parameters are calculated using the empirical
parameter calculator before training and can be accessed from the training set, while
the generated formulas’ empirical parameters are calculated by the pre-trained
calculator network. This calculator neural network maps the generated element
compositions to their five empirical parameters.
3.2.2 Empirical Parameter Selection
There are six empirical parameters in traditional high-entropy alloy design and
development, but only five of them, namely entropy of mixing ∆𝑆𝑚𝑖𝑥 , enthalpy of
mixing ∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿, valence electron concentration 𝑉𝐸𝐶 and
average melting point 𝑇𝑚, are used to train the cardiGAN model. The reason of not
including the unitless parameter Ω are as follows:
1. The sixth empirical parameter Ω is calculated from the other three empirical
parameters; the information of Ω is repetitive. Including Ω will increase the
model’s training difficulty especially for the calculator network. To estimate Ω, the
calculator network not only has to learn to map the element compositions to this
extra parameter, but also need to capture the nonlinear relationship between Ω,
17
Tm, ∆𝑆𝑚𝑖𝑥 and ∆𝐻𝑚𝑖𝑥, which contains an inverse absolute value function that is
not continuous when ∆𝐻𝑚𝑖𝑥 = 0.
Ω = Tm∆𝑆𝑚𝑖𝑥
|∆𝐻𝑚𝑖𝑥|
2. Ω has no direct connection to the element compositions. All the other five
empirical parameters are directly calculated from the molar ratios and chemical
properties of the composite elements. By only including these five empirical
parameters, the calculator neural network could be easily trained with a higher
accuracy.
3. The distribution of Ω in the training set is sparse. In the training set, there are
several extremely large Ω values, which also follow the design guidelines of
HEAs, whereas most Ω in the training set are less than 10. In this case, enforcing
the generator to learn this sparse distribution of Ω would lead to overfitting.
3.2.3 Calculator Neural Network
Before training the generator and discriminator networks, a calculator neural network
is constructed and trained. The reasons for having a calculator neural network in our
model are as follows:
1. The empirical parameters are important features, and it is hard to train the
generator to produce element compositions and empirical parameters at the same
time.
Figure 5. Configuration of the calculator neural network
18
2. Using the empirical parameter calculator script to do the calculation is inefficient
and will block the back-propagation path of the empirical-parameters-related loss.
3. Since the calculations of empirical parameters involve doing permutation and
combination on chemical properties, such as atomic radii and dual enthalpy, it is
almost impossible to hardcode the empirical parameter calculator inside neural
networks.
4. The 5 selected empirical parameters are directly associated to the element
compositions, so they can be accurately estimated by a neural network.
The configuration of the calculator network is shown in Figure 5. The calculator
network is trained with a large dataset that contains molar ratios and empirical
parameters of 100,000 randomly generated formulas. During training, 30,000 formulas
and the 724 real high-entropy alloy formulas are used as the test set.
Figure 6. Training and test loss of the calculator neural network.
19
3.2.4 Configuration of cardiGAN
Figure 7. The network configuration of the proposed cardiGAN model. Note that the
empirical parameter calculator network is already trained before cardiGAN’s training,
so this network only functions as a calculating tool and is not being trained.
The cardiGAN model has three sub-networks: a generator, a discriminator, and a pre-
trained calculator neural network.
As shown in Figure 7, the generator is composed of two fully-connected layers. The
input of the generator is a 12-dimensional standard distributed noise 𝑧 ~ 𝑁(0,1), and
the output of the generator 𝐺(𝑧) is a 56-dimensional vector with each dimension
representing the molar ratio of a specific element.
Since the molar ratios of elements cannot be negative, and most existing high-entropy
alloys are composed of 3 to 10 elements, the output of the generator should be a non-
negative, sparse vector. To produce such output, we use the Rectified Linear Unit
(ReLU) activation function as the output function of the generator, which will set
negative values to zero.
𝑅𝑒𝐿𝑈(𝑣) = max(0, 𝑣)
The ReLU activation function is shown in Figure 8.
20
The input of the discriminator includes the molar ratios of generated formulas and real
formulas along with their five empirical parameters. The real formulas’ molar ratios
and empirical parameters can be retrieved from the training set, while the generated
formulas’ empirical parameters are calculated by the pre-trained calculator network.
The discriminator is composed of two fully connected latent layers and one output
layer, and each latent layer’s output is activated by the LeakyReLU activation function.
The output layer of the discriminator produces a scalar value D(x) which represents
the “realness” of the input data. Note that the output layer does not have a sigmoid
function, and the scalar value is used to calculate the Wasserstein loss.
3.2.5 Training of cardiGAN
Due to the sparsity of the molar ratios’ distribution (most molar ratios of a HEA
formula are 0, and some elements didn’t occur in the training set), the molar ratios are
not normalized. Since the molar ratios are in the interval [ 0, 1] , the empirical
parameters were normalized to be around the interval [ 0, 1], which made it easier for
the neural network to converge.
During training, the discriminator is trained five times as much as the generator. This
is because the discriminator has to be optimized to calculate the Wasserstein distance,
which helps better train the generator. And our experiments showed that an emphasis
on training the discriminator will also help reduce mode collapsing and enhance
stability.
The output of the generator is a 56-dimensional vector containing the molar ratios of
the generated formula, 𝑚 = [𝑚1, 𝑚2, … . . , 𝑚56]𝑇 . This output is fed into the pre-trained
calculator network to estimate the 5 empirical parameters of the associated chemical
formula. The normalized molar ratios 𝑚𝑛𝑜𝑟𝑚 and the calculated empirical parameters
Figure 8. Rectified Linear Unit (ReLU) activation function
21
then be concatenated along with 𝜎𝑔, the sum of generated molar ratios, to produce the
fake input of the discriminator.
𝑥𝑔 = [(𝑚𝑛𝑜𝑟𝑚)𝑇 , 𝜎𝑔 , ∆𝑆𝑚𝑖𝑥 , ∆𝐻𝑚𝑖𝑥 , 𝛿, 𝑉𝐸𝐶, 𝑇𝑚]𝑇
where 𝑚𝑛𝑜𝑟𝑚 =𝑚
Σi𝑚𝑖 is the vector of normalized molar ratios of the generated formula.
Note that the input of the calculator network is also 𝑚𝑛𝑜𝑟𝑚 ; this is because calculating
empirical parameters for chemical formulas having a summation of molar ratios
bigger or less than 1 is meaningless. The empirical parameters are only defined for
legal chemical formulas.
The 𝜎 value is the summation of molar ratios of the chemical formula. For formulas
in the training set, this value always equal to 1. Without this feature, although the
generator could still learn the element distribution of formulas in the training set, the
generated molar ratios have no restriction on their summation. In this case, some
generated formulas would have very small summation of molar ratios, which is
harmful to the model’s training. This is because during training the generated molar
ratios could get too small and eventually their summation becomes 0. A zero
summation cannot be divided to calculate the normalized molar ratios, which would
cause the training process to stop. The 𝜎 value also helps the generator to produce
more realistic element compositions, since the sum of molar ratios for a real chemical
formula should always equal to one.
When training cardiGAN, the generator is trained to maximize the loss function
Lg(D, gθ) = 𝐸𝑥𝑔~𝑃g[𝐷(𝑥𝑔)]
where Pg denotes the density function of real data and generated data.
Figure 9. The pipeline of back-propagation in generator training process
22
As shown in the Figure. 9, when doing back-propagation for the generator, the loss
coming from molar ratios is back-propagated directly from the discriminator to the
output layer of the generator. And for the part of loss relating to the empirical
parameters, it will be passed to the generator through the calculator neural network.
Note that, since the calculator network is already trained, the training process of
cardiGAN doesn’t involve the calculator.
The discriminator is trained in a similar process, except the loss function is the
Wasserstein distance between the probability distributions of training data and
generated data
Ld(D, gθ) = 𝐸𝑥𝑔~𝑃g[𝐷(𝑥𝑔)] − 𝐸𝑥𝑟~𝑃𝑟
[𝐷(𝑥𝑟)]
where Pr and Pg denotes the density function of real data and generated data.
Since we used Wasserstein distance as the loss function, both training processes
applied RMSprop optimization algorithm which was recommended by the author of
the original WGAN paper [10].
3.2.6 Stop Criteria
In the training process of the cardiGAN model, three parameters are used as the stop
criteria: the Wasserstein distance (discriminator loss) Ld(D, gθ), the average matching
score 𝑠𝑚𝑒𝑎𝑛 , and the number of finely regenerated formulas. During training, we
monitor the model’s performance by generating 10,000 samples after each epoch and
calculating these parameters. Once the three parameters have reached optimality, we
stop the training process and save the model.
Figure 10. The Wasserstein loss of cardiGAN during training
23
As we can see in Figure 10, the Wasserstein distance between the probability
distributions of the generated data and training data decreases as we train the model.
The reason for using Wasserstein distance as the stop criteria is that it correlates with
the generated data’s quality, which is explained in chapter 2.
The other two parameters can be calculated using an evaluating parameter called
matching score. The average matching score measures the similarity between the
distribution of the generated dataset and the training set. The number of finely
regenerated formulas is the number of real formulas that can be generated by our
model with a high matching score. The definition of matching score and its calculation
procedure is explained in chapter 4. This evaluation parameter is designed for
quantifying the similarity between generated formulas and real high-entropy alloy
formulas.
24
Chapter 4. Evaluation and Analysis
This chapter describes the evaluation methods and evaluation results of the cardiGAN
model. The model is evaluated in both machine learning and material aspects. In the
machine learning aspect, an evaluation parameter called matching score is introduced,
which can be used to evaluate the ‘mode diversity’ of our model. In the material aspect,
the model is evaluated by comparing the distributions of generated formulas’
empirical parameters against the real high-entropy alloy formulas’ empirical
parameters.
4.1 Model Evaluation
Due to the small size and the sparse distribution of our training set, evaluating our
model is quite difficult. And because the data in the training set are unlabeled, popular
evaluation metrics such as Inception Score (IS) cannot be implemented. The sparsity
of the distribution in our training set also makes it hard to do kernel density estimation
via the Parzen-Rosenblatt window method; there are some elements only occurred
once or twice in our training set. Hence, we introduced an evaluation parameter
named matching score which could be used to evaluate the model by assessing the
diversity of the generated dataset.
4.1.1 Matching Score
In this section, an evaluation parameter named matching score is introduced. This
parameter is designed to quantify the similarity between generated and real high-
entropy alloy formulas. Although this parameter is simple in math, it could help us
check mode collapsing during training by estimating the nearest neighbors of
generated formulas. Here we first introduce how to calculate the matching score for a
batch of generated data.
Suppose there are 𝑛 formulas in the generated dataset, 𝐺 is a matrix containing the
molar ratios of all the generated fake formulas:
𝐺 = [
𝑚1,1 ⋯ 𝑚1,56
⋮ ⋱ ⋮𝑚𝑛,1 ⋯ 𝑚𝑛,56
]
And 𝑅 is the matrix containing the molar ratios of the 724 high-entropy alloy
formulas.
25
𝑅 = [
𝑟1,1 ⋯ 𝑟1,56
⋮ ⋱ ⋮𝑟724,1 ⋯ 𝑟724,56
]
Because all the molar ratios are nonnegative, we could take the square root of each
entry of these two matrices and create 𝐺𝑠𝑞𝑟𝑡 and 𝑅𝑠𝑞𝑟𝑡
𝐺𝑠𝑞𝑟𝑡 = [√𝑚1,1 ⋯ √𝑚1,56
⋮ ⋱ ⋮
√𝑚𝑛,1 ⋯ √𝑚𝑛,56
]
𝑅𝑠𝑞𝑟𝑡 = [√𝑟1,1 ⋯ √𝑟1,56
⋮ ⋱ ⋮
√𝑟724,1 ⋯ √𝑟724,56
]
Then we define the matching score matrix 𝑀 as the multiplication of 𝐺𝑠𝑞𝑟𝑡 and 𝑅𝑠𝑞𝑟𝑡 .
𝑀 = 𝐺𝑠𝑞𝑟𝑡𝑅𝑠𝑞𝑟𝑡𝑇 = [
𝑠1,1 ⋯ 𝑠1,724
⋮ ⋱ ⋮𝑠𝑛,1 ⋯ 𝑠𝑛,724
]
where 𝑠𝑖,𝑗 is the matching score between the 𝑖𝑡ℎ generated formula and the 𝑗𝑡ℎ real
formula:
𝑠𝑖,𝑗 = [√𝑚𝑖,1 … √𝑚𝑖,56] [√𝑟𝑗,1
…
√𝑟𝑗,56
]
Then the matching score of a generated formula to the training set is defined as the
largest matching score it can get with the 724 real formulas. And the nearest neighbor
of the generated formula is the real formula in the training set that it has the largest
matching score with.
𝒔 = [
𝑠1,𝑚𝑎𝑥
…𝑠𝑛,𝑚𝑎𝑥
]
where 𝑠𝑖,𝑚𝑎𝑥 is the largest value of the 𝑖𝑡ℎ row of 𝑀. The average matching score of
the generated dataset is the average of matching scores of all the formulas in the
generated dataset:
𝑠𝑚𝑒𝑎𝑛 =𝑠1,𝑚𝑎𝑥 + ⋯ + 𝑠𝑛,𝑚𝑎𝑥
𝑛
26
The range of matching score is [0, 1]. When two formulas have exactly the same
element compositions, their matching score equals to 1, since in this case the matching
score is just the summation of the formula’s molar ratios.
𝑠 = [√𝑚1 … √𝑚56] [√𝑚1
…
√𝑚56
] = Σ𝑖=156 𝑚𝑖 = 1
By calculating the matching scores of a batch of generated formulas, we can estimate
their nearest neighbors in the training set. Unlike calculating the Euclidean distances,
the calculation of matching scores only requires several steps of matrix operations,
which make it computationally more efficient and easier to code. And our experiments
showed that the nearest neighbors we found by calculating matching scores and
Euclidean distances are nearly identical, only about 1% of them are different, which is
acceptable as our task is to evaluate the diversity of the generated formulas. Because
of its low computational cost and effectiveness of finding nearest neighbors, we are
able to apply the matching score to evaluate our model at each epoch of training
without losing training efficiency.
The matching score of two chemical formulas correlates with the similarity between
them. We displayed some generated fake high-entropy alloy formulas with different
matching scores in Table 3. As we can see from Table 3, the formulas having matching
scores higher than 0.98 are really similar to their nearest neighbors. And those
formulas that have closing to 1 matching scores are almost identical to existing high-
entropy alloys.
Generated Formula Matching Score Nearest Neighbor
Al0.14Co0.2Cr0.2Cu0.06Fe0.2Ni0.2 0.9997 Al0.15Co0.2Cr0.2Cu0.05Fe0.2Ni0.2
Al0.2Nb0.21Ti0.26V0.13Zr0.2 0.995 Al0.2Nb0.2Ti0.2V0.2Zr0.2
Mo0.15Nb0.31Ti0.22Zr0.32 0.99 Mo0.25Nb0.25Ti0.25Zr0.25
Hf0.16Mo0.08Nb0.14Ta0.24Ti0.27Zr0.11 0.9854 Hf0.18Mo0.1Nb0.18Ta0.18Ti0.18Zr0.18
Co0.2Cr0.1Fe0.45Mn0.22Si0.03 0.98 Co0.2Cr0.15Fe0.4Mn0.2Si0.05
Table 3. Generated formulas and their nearest neighbors found using matching score
27
The average matching score could assess the overall similarity between the generated
dataset and the training set.
Figure 11 contains the plots of Wasserstein distance and summation of matching scores
of 10,000 generated formulas. From the Fig.11, we can see that the average matching
score inversely correlates to the value of Wasserstein distance between the probability
distributions of generated data and training data.
The reason for using average matching score as stop criteria is that the Wasserstein
distance is calculated by the Lipschitz function defined by the discriminator whose
weights are different at each time of training, which means the Wasserstein distance
would converge to different values each time. But the value of average matching score
only depends on the element compositions of generated formulas, which enables us to
assess the model during training in a more straightforward way. We could directly
judge the model’s performance by looking at the average matching score of the current
model. For example, an average matching score of 0.6 would indicate an underfitting.
And an average matching score of 0.99 indicates that there might be a mode collapse
or a memorizing GAN, which happens a lot when the generator has too many latent
layers.
Figure 11. Wasserstein distance vs. average matching score.
28
The number of finely regenerated formulas is calculated by counting the number of
real formulas that can be regenerated with high matching scores. For example,
suppose we set the lower bound of matching score to 0.98, we first select those fake
formulas having matching scores bigger than 0.98. Then we count how many different
real formulas do their nearest neighbors have included.
For a generative model, since it is almost impossible to produce exactly same formulas
from the training set, we use the number of real formulas that can be regenerated with
high matching scores as a reference to the generated dataset’s diversity. An ideal
model should be able to produce most of the chemical formulas in our training set
with high matching scores. During training, we set the stop criteria on the number of
regenerated formulas with matching scores higher than 0.98, which could cover about
50-55% high-entropy alloy formulas in our training set. In the generated dataset,
around 10-15% formulas have matching scores bigger than 0.99, which could cover 30%
of our training formulas.
4.1.2 Visualization
After the model is trained and saved, we create a dataset of 10,000 fake high-entropy
alloy formulas. Then we input the generated data and training data onto TensorBoard
to visualize their distributions. The following two figures are the 2D visualization of
the distribution of the generated dataset and our training set. The red dots represent
the generated data, and the blue dots are real data from our training set.
T-SNE (t-distributed Stochastic Neighbor Embedding) is a technique that could
visualize high-dimensional data in a two or three-dimensional map [13]. It uses
random walks on neighborhood graphs to enable the data’s implicit structure to be
Figure 12. Number of finely regenerated formulas with matching score > 0.98.
29
displayed in the 2D or 3D visualization [13]. And UMAP (Uniform Manifold
Approximation and Projection) is another dimension reduction algorithm, which has
been competitive with t-SNE for visualization quality [14]. As shown in these two
figures, although the distribution of the training data is rather sparse, the model could
still fit very well.
Figure 13. T-SNE figure of generated and training data
Figure 14. UMAP figure of generated and training data
30
4.2 Result Analysis
In this section, we evaluate the model by comparing the empirical parameters of
generated formulas against the empirical parameters of training formulas. Then we
use the design guidelines to assess the formulas in the generated dataset.
The mean and standard deviation of generated and training formulas are as follows:
∆𝑆𝑚𝑖𝑥(kJ/(K*mol)) ∆𝐻𝑚𝑖𝑥(kJ/mol) 𝛿(%) Ω 𝑉𝐸𝐶 𝑇𝑚(𝐾)
mean 13.99 -11.39 5.16 11.53 6.91 1835.10
std. 1.88 10.49 3.58 84.16 1.47 316.70
∆𝑆𝑚𝑖𝑥(kJ/(K*mol)) ∆𝐻𝑚𝑖𝑥(kJ/mol) 𝛿(%) Ω 𝑉𝐸𝐶 𝑇𝑚(𝐾)
mean 13.31 -11.21 5.13 7.85 6.80 1816.24
std. 1.93 9.76 3.67 24.72 1.59 345.21
Notations:
∆𝑆𝑚𝑖𝑥: entropy of mixing; ∆𝐻𝑚𝑖𝑥: enthalpy of mixing; 𝛿: difference in atomic radii;
𝑉𝐸𝐶: valence electron concentration; 𝑇𝑚: average melting point; Ω = Tm∆𝑆𝑚𝑖𝑥
|∆𝐻𝑚𝑖𝑥|
From the above tables we can see that, the generated formulas and the training
formulas have really close means and standard deviations of empirical parameters.
The only exception is Ω; both the Ω mean and standard deviation of the generated
dataset are bigger than the training set. This is because some of the generated formulas
have near-zero ∆𝐻𝑚𝑖𝑥 values. And because Ω is inversely proportional to the
absolute value of ∆𝐻𝑚𝑖𝑥, these formulas could have extremely large Ω values which
raised the mean and standard deviation. But since the design guideline doesn’t restrict
large Ω values, and large Ω values could help forming of high-entropy alloys [2], the
difference in Ω between the generated formulas and real high-entropy alloy formulas
shouldn’t be a big problem.
The visualization of the distribution of generated formulas and real high-entropy alloy
formulas are shown below. These two figures are produced using both the molar ratios
and the empirical parameters of the formulas. It is obvious that, after including
empirical parameters, the red dots could fit the blues dots much better. This is because
our model is trained with both the molar ratios and empirical parameters.
Table 4. Mean and standard deviation of empirical parameters of generated formulas
Table 5. Mean and standard deviation of empirical parameters of training formulas
31
By applying the design guidelines of high-entropy alloys in Table 1, we discovered
that around 60% of formulas generated by our model fall into these intervals, which is
same to the high-entropy alloy formulas in the training set.
Figure 15. T-SNE figure of generated and training data using
both molar ratios and empirical parameters
Figure 16. UMAP figure of generated and training data using
both molar ratios and empirical parameters
32
Conclusion
1. The first GAN as applied to alloy development, namely High-entropy Alloys, was
developed in this work, which could estimate the probability distribution of high-
entropy alloys in both element and thermodynamic space.
2. A single unified dataset containing 724 high-entropy alloys and their empirical
parameters was constructed.
3. An empirical parameter calculator was developed, which could calculate the six
empirical parameters of high-entropy alloys.
4. Owing to the limited empirical data, a quick evaluation metric, namely matching
score, was produced and implemented to quantify the similarity between high-
entropy alloy formulas.
5. Potentially novel HEAs that may possess an FCC structure and single phase are
Al6.6Co6Cr2.2Cu4.7Ni1.8Zr1.9 and Al1.6Co1.8Cr0.8Cu2Fe1.8Ni2.6Pd0.9 as some
examples.
33
References
[1] M.-H. Tsai and J.-W. Yeh, “High-Entropy Alloys: A Critical Review,” in Materials
Research Letters, vol. 2, no. 3, pp. 107–123, 2014.
[2] J.-W. Yeh, S.-K. Chen, S.-J. Lin, J.-Y. Gan, T.-S. Chin, T.-T. Shun, C.-H. Tsau, and S.-
Y. Chang, “Nanostructured High-Entropy Alloys with Multiple Principal Elements:
Novel Alloy Design Concepts and Outcomes,” in Advanced Engineering Materials, vol.
6, no. 5, pp. 299–303, 2004.
[3] “High entropy alloys,” Wikipedia, 31-Aug-2019. [Online]. Available:
https://en.wikipedia.org/wiki/High_entropy_alloys#cite_note-tsai-2. [Accessed: 24-
Oct-2019]
[4] Z. Wang, Q. She and T. Ward, "Generative Adversarial Networks: A Survey and
Taxonomy", arXiv preprint arXiv: 1906.01529, 2019.
[5] Y. Zhang, Y. J. Zhou, J. P. Lin, G. L. Chen, and P. K. Liaw, “Solid-Solution Phase
Formation Rules for Multi-component Alloys,” in Advanced Engineering Materials, vol.
10, no. 6, pp. 534–538, 2008.
[6] Y Zhang, T. Zuo, Z. Tang, M. C. Gao, K. A. Dahmen, P. K. Liaw, and Z. Lu,
"Microstructures and properties of high-entropy alloys," in Progress in Materials Science,
vol. 61, pp. 1–93. 2014.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
[8] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, "Semantic Segmentation using
Adversarial Networks", in NIPS, 2016
[9] J. Brownlee, “How to Implement Wasserstein Loss for Generative Adversarial
Networks,” in Machine Learning Mastery, 12-Jul-2019. [Online]. Available:
https://machinelearningmastery.com/how-to-implement-wasserstein-loss-for-
generative-adversarial-networks/. [Accessed: 25-Oct-2019].
[10] M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein GAN”, arXiv preprint
arXiv:1701.07875, 2017.
[11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved
training of wasserstein gans”, arXiv preprint arXiv:1704.00028, 2017.
[12] L. Weng, “From GAN to WGAN,” in Lil'Log, 20-Aug-2017. [Online]. Available:
https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html. [Accessed:
25-Oct-2019].
34
[13] K. Y. Wong and F.-L. Chung, “Visualizing Time Series Data with Temporal
Matching Based t-SNE”, in International Joint Conference on Neural Networks (IJCNN),
2019.
35
Appendix 1. Final Project Description
High-entropy alloys (HEAs) are alloys that are composed of five or more metallic
elements with nearly equal proportions. This novel class of materials has potentially
desirable properties such as better strength-to-weight ratios, higher strength and
fracture resistance.
This one semester project relies on the use of machine learning, to assist in the
development of A.I. predicted high-entropy alloy composition.
This project involves five main tasks:
1. Construction of HEA dataset – The training set should include the
majority of all the existing HEA compositions which can be found on published
paper related to HEAs.
2. Building an empirical parameter calculator – The stabilization
mechanism of HEA is related to the six empirical parameters: entropy of mixing,
enthalpy of mixing, difference in atomic radii (delta), omega, valence electron
concentration (VEC) and average melting point (Tm). These six empirical
parameters will also be treated as important features and used to predict potential
novel HEAs. This empirical property calculator should be able to do both
individual and large-scale calculation.
3. Construction of training set – The training set should consist of chemical
formulas along with their 6 empirical parameters calculated using the empirical
property calculator.
4. Construction of machine learning model – An appropriate machine
learning model needs to be constructed and optimized. The machine learning task
is to use the descriptive attributes in the training set to predict potential novel high-
entropy alloys.
5. Model evaluation – In this step, some evaluation techniques should be
applied to evaluate the model’s predictive performance.
36
Appendix 2. Independent Study Contract
37
38
Appendix 3. Description of Software
Program code files:
Java files:
DataCleaner.java : This class helps with normalization and reordering the chemical
formulas in the HEA dataset and removing duplicated entries. It also provides element
occurrence statistic functionality.
Elements.java : This class includes all the elements needed in the novel HEA
prediction.
Tokenizer.java : The Tokenizer class can tokenize elements and their molar ratios from
randomly formatted formulas and regenerate uniform, normalized chemical formulas.
Main.java : The Main class applies all the functionality in the above java files and
finish the HEAs dataset generation job. The element compositions, molar ratios and
elements occurrence statistics are saved in three .txt files.
TrainningSetTransformer.java : This class will transform the element compositions
and molar ratios into a single csv file. There are 56 columns in this csv files, with each
column corresponding to the molar ratio of a specific element and each row a specific
HEA formula.
JavaTest.java : This class integrated all the unit tests for the above java files.
Python files:
calculator.py* : This class provides the empirical parameter calculation functionality
of the empirical property calculator. This class imported ‘matminer’ open-source
external library which provides all the atomic radii of elements. This class also
included some unit tests.
trainingset_generator.py : This script will translate the element compositions and
molar ratios generated by the java files into a single csv file having all the six
parameters by applying calculator.py.
parameter_generator.py: This script is like trainingset_generator.py but only write the
six parameters in a csv file.
GUI.py : This python file provides a user interface that can be used to calculate the six
empirical parameters for any given chemical formulas, it also provides large-scale
calculation functionality.
39
random_data_generator.py : This script provides the functionality of generating large
quantity of randomly generated chemical formulas along with their six empirical
parameters. The generated dataset will then be used to train a calculator neural
network model, which plays a crucial part in the following GAN model’s training.
calculator_net.py : This script will use the randomly generated chemical formulas to
train a calculator neural network model written with PyTorch, which provides the
same functionality as the empirical property calculator itself. After training the model
will be saved as calculator_net.pt in the saved_models package.
parameters.py : This script saves all the tunable parameters of the GAN model like
learning rates, clip value and batch size etc. It functions as the control panel when fine
tuning the model.
cardiGAN.py : This file is the core of the whole project. It contains all the classes
needed during the GAN training, including a dataset loader and two neural network
class: one generator, one discriminator (both written with PyTorch). It also provides
functionality of training process monitor which will stop the training process once the
generator model is optimized. The name cardiGAN stands for ‘compositionally
complex alloy research directive inference GAN’. After training, the generator and
discriminator will be saved in saved_models package as generator.pt and
discriminator.pt.
fake_alloy_generator.py : This script will apply the saved generator model and
produce a dataset containing all the fake HEAs along with their six empirical
parameters.
analyzer_reporter.py : This script will analyze the generated High-entropy Alloy
formulas and produce an analysis report which is saved in analysis_report package.
During the analysis, a ranked dataset is produced, which explicitly calculates the
similarity between each generated formula to the formulas in the training set and finds
their closest matching formulas.
analyzer_visualizer.py : This script will load the generated data and real data into
TensorBoard to produce a visual presentation of the generated and real data.
model_analyzer.py : This script integrated all the functionality of the fake alloy
generator, analyzer reporter and analyzer visualizer.
All the above code files were implemented by the student except calculator.py which
was co-produced by the student and the supervisor.
40
Testing procedure description:
Since the code files were written using both Java and Python, the testing procedure is
divided into two sections. The Java test file mainly focused on the DataCleaner.java
and Tokenizer.java modules. And the Python unit test is inside calculator.py since the
rest Python files are either generative or discriminative machine learning models,
whose performance can only be tested during training, and user interface or file writer,
with which are hard to write test.
The Java test first tests the Tokenizer’s parse(), write_ele() and write_ratio() methods,
which are used to parse element formulas from a bunch of randomly formatted
formulas. Then the parsed results get cleaned by the DataCleaner which removes any
duplicates and reorder the formulas alphabetically. The test used a randomly
generated txt file which contains hundreds of random formulas. The DataCleaner and
Tokenizer parsed that txt file and created a formatted file contains only 22 formulas
which are the original formulas I prepared. Both the Tokenizer and DataCleaner
functioned well and passed the tests.
The Python unit test tested the calculator’s ability of calculating all six empirical
material parameters. It used an artificial formula to calculate all six parameters then
verified the calculation accuracy with hand calculated parameters using
thermodynamic equations. The calculator is both accurate and efficient.
Description of experiment tools:
The experiments were done using two different IDEs. The Java code files were
compiled and run using IntelliJ IDEA Community Edition with standard JDK 12.0. The
Python code files were compiled and run using JetBrains PyCharm Community
Edition with Anaconda 3 interpreter.
The dataset used for model training was manually collected from hundreds of
published papers and engineered by the students.
The model was trained on student’s personal computer.
41
Appendix 4. README
2019 Semester 2 COMP8755 Individual Project
u6766505 Zhipeng Li
Supervisor: Nick Birbilis
Abstract
High-entropy Alloys (HEAs) are alloys that are composed of five or more
metallic elements with nearly equal proportions. This novel class of materials
has potentially desirable properties such as better strength-to-weight ratios,
higher strength and fracture resistance. This one semester project relies on the
use of machine learning, to assist in the development of A.I. predicted High-
entropy Alloy composition.
Objectives
This project involves five main steps:
1. Construction of HEA dataset – The training set should include the majority
of all the existing HEA compositions which can be found on published paper
related to HEAs.
2. Building an empirical material property calculator – The stabilization
mechanism of HEA is related to the six empirical material properties: entropy,
enthalpy, difference in atomic radii (delta), omega, valence electron
concentration (VEC) and average melting point (Tm). These six empirical
parameters will also be treated as important features and used to predict
potential novel HEAs. This empirical property calculator should be able to do
both individual and large-scale calculation.
42
3. Construction of training set – The training set should consist of chemical
formulas along with their 6 empirical parameters calculated using the
empirical property calculator.
4. Construction of machine learning model – An appropriate machine
learning model needs to be constructed and optimized. The machine learning
task is to use the descriptive attributes in the training set to predict potential
novel High-entropy Alloys.
5. Model evaluation – In this step, some evaluation techniques should be
applied to evaluate the model’s predictive performance.
Installation
Before doing the follow installation, please make sure you have Java JDK 10.0
(or latest version), IntelliJ IDEA (or any preferred Java IDE), Python3 and
JetBrains PyCharm. Instructions for installing these softwares can be found on
their official websites.
1. git clone this repository.
2. Create an environment for running, e.g.:
3. $ conda create -n cardiGAN python=3.7
4. Run environment and install required packages:
5. $ source activate cardiGAN
6. $ conda (or pip) install pytorch torchvision -c pytorch -y
7. $ conda (or pip) install numpy pandas -y
8. $ conda (or pip) install pymatgen -c matsci -y
9. $ pip install matminer=
Instruction
• This project is divided into two sections.
o The Java code implementations help with data engineering and
dataset construction, which are the very first steps of this project.
43
o The Python section contains the rest of the project.
1. HEA dataset construction:
o Open the Java sub-project 'HEAParser' using any Java IDE.
o Run Main.java program, this program integrated the functionalities
of DataCleaner.java and Tokenizer.java, which will create
three .txt files inside 'parseResult' package, each contains the element
compositions, elements' molar ratios and formatted formulas of the
HEA dataset.
o Run TrainingSetTransformer.java, this will create a .csv file inside
package 'parseResult' in the form of 'n * 56', where n is the number of
chemical formulas in the training set. There are 56 columns in this csv
files, with each column corresponding to the molar ratio of a specific
element and each row a specific HEA formula. This csv file, along with
the training HEAs' empirical parameters will be used to train the GAN
model.
2. Training set construction:
o Open the Python project '2019Project' in JetBrains PyCharm or any
preferred Python IDE. Set the Python interpreter to Anaconda3's
default interpreter.
o Before contructing the Training set, find and copy
the GAN_training_set.csv inside 'HEAParser/parseResult', then paste
this csv file into package 'main/training_set'. Do not change the name
of this file.
o Find trainingset_generator.py and parameters_generator.py inside
package utility and run them. This will
create HEA_params.csv and train_params.csv inside package
training_set, which will be used during the model training and model
evaluation sections.
3. Training the calculator neural network model (Optional)
o Before training the GAN, a calculator neural network has to be trained
and saved (An accurate model is already saved inside package
'main/saved_models'). This pre-trained neural network will be used to
calculate the six empirical parameters of the generated fake formulas.
Which is more efficient than just using the calculator.py script we built,
and it also enables the parameter-related loss to be passed back
through this pre-trained network to the generator.
o Inside package 'main/utility', change attribute 'num_sample'
inside random_data_generator.py to the amount of training data you
need, then run this script, which will create random_result.csv inside
44
package 'main/generated_HEAs'. This file is deleted since the dataset is
big and no longer used in the following steps.
o Run calculator_net.py and train the calculator model. The trained
model will be saved inside package 'main/saved_models' as
calculator_net2.pt.
4. Training the GAN model
o The GAN model will be trained using all the datasets, scripts and
model built in the above steps. The training process of this GAN
model could be time-consuming if the stopping criterion was set high.
All the model's parameters are contained inside parameters.py. The
parameters are already tuned, any modification to these parameters
could give hard time in training or lead to model not converging.
o Run cardiGAN.py, once the generator has met the stopping criterion,
the trained generator model will be saved as generator_net.pt inside
package 'main/saved_models'. Since the loss function of discriminator
is Wasserstein loss, saving the discriminator is not very helpful for this
project.
5. Model Analysis
o Run model_analyzer.py. This script will automatically finish the
dataset generation, generated dataset analysis and TensorBoard
dataset visualization jobs. Then, inside package 'analysis_report',
created are two files. analysis_report.txt saved the analysis result,
and generated_novel_ranking.csv has all the generated formulas ranked
and paired with their closest formulas inside the training set. A
matching score is calculated for each generated formula. The higher
the score, the closer is the formula to the ones in the training set.
o (Optional) Find and Follow an installation video on YouTube, then
install TensorBoard to your computer. Run TensorBoard and see how
the generated data fit to the training data. This step is mainly about
dimensionality reduction and dataset visualization, which is optional,
since I already saved a set of distribution pictures inside
'main/visualization' package.