computational design and discovery of high …...discovery process which involves doing experiments...

Computational Design and Discovery

of High-entropy Alloys

Zhipeng Li

U6766505

A report submitted for the course

COMP8755

Supervised by: Nick Birbilis

The Australian National University

October 2019

2

Except where otherwise indicated, this report is my own original work.

Zhipeng Li

24 October 2019

3

Acknowledgement

I would like to thank my supervisor Professor Nick Birbilis for providing me this

excellent research opportunity. This project really deepened my understanding of

using machine learning to solve real-world problems, and it helped me develop my

research interests and skills, which I believe will have far-reaching consequences to

my academic development and career path.

This project cannot be finished without the instructions and help from Professor Nick

Birbilis and my external advisor, Will Nash, who have been supportive throughout

the whole semester. I offer my sincere appreciation for this precious learning

experience they provided.

4

Abstract

High-entropy Alloys (HEAs) are alloys that are composed of five or more metallic

elements with nearly equal proportions [2]. This novel class of materials have

potentially desirable properties such as better strength-to-weight ratios, higher

strength, and fracture resistance [1]. Traditional high-entropy alloy experiments are

usually costly and time-consuming, partly due to the inefficiency of the early

discovery process which involves doing experiments on a large amount of predicted

alloy compositions. Hence, it is intuitively to apply machine learning techniques in

this novel alloy’s design and discovery process.

Generative adversarial networks (GANs) have been studied and applied in various

domains in the past few years. In this project, a generative model called cardiGAN,

which is designed based on Wasserstein GAN, was developed, where the name

cardiGAN stands for compositionally complex alloy research directive inference GAN.

Our results suggested that the proposed cardiGAN model could estimate the

underlying probability density function of existing high-entropy alloys in the element

and thermodynamic space. We hope this model could provide insights to people who

are interested in this field and help accelerate the discovery of novel high-entropy

alloys.

5

CONTENTS

Acknowledgement ............................................................................................................ 2

Abstract ............................................................................................................................... 3

1 Introduction ................................................................................................................. 7

1.1 Motivations ......................................................................................................... 7

1.2 Objective ............................................................................................................. 7

1.3 Project Scope....................................................................................................... 7

1.4 Contribution ....................................................................................................... 8

1.5 Report Outline .................................................................................................... 8

2 Background and Related Work .................................................................................. 9

2.1 High-entropy Alloys .......................................................................................... 9

2.1.1 Empirical Parameters of High-entropy Alloys ...................................... 9

2.2 Generative Adversarial Networks .................................................................... 9

2.3 Wasserstein GAN ............................................................................................. 10

2.3.1 Wasserstein Distance ............................................................................. 11

2.3.2 Advantages of Wasserstein GAN ......................................................... 11

3 Methodology .............................................................................................................. 12

3.1 Construction of Training Set ........................................................................... 13

3.1.1 Data Collection and Data Engineering ................................................. 13

3.1.2 Empirical Parameter Calculator............................................................ 13

3.1.3 Construction of Training Set ................................................................. 14

3.2 Model Configuration ....................................................................................... 16

3.2.1 Network Overview ................................................................................ 16

3.2.2 Empirical Parameter Selection .............................................................. 16

3.2.3 Calculator Neural Network .................................................................. 17

3.2.4 Configuration of cardiGAN .................................................................. 19

3.2.5 Training of cardiGAN ........................................................................... 19

6

3.2.6 Stop Criteria ........................................................................................... 22

4 Evaluation and Analysis ........................................................................................... 24

4.1 Model Evaluation ............................................................................................. 24

4.1.1 Matching Score ....................................................................................... 24

4.1.2 Visualization........................................................................................... 28

4.2 Result Analysis ................................................................................................. 30

Conclusion ....................................................................................................................... 32

Reference .......................................................................................................................... 33

7

Chapter 1. Introduction

1.1 Motivations

High-entropy Alloys (HEAs) are alloys that are synthesized by mixing five or more

metallic elements with nearly equal proportions [3]. This novel class of materials are

currently the focus of significant attention because of their desirable properties, such

as better strength-to-weight ratios, higher strength and fracture resistance etc. [1].

Due to the inefficiency of the early discovery process, conventional high-entropy alloy

developments can be costly in both time and money, which usually involves doing

experiments on a large number of predicted alloy compositions. Generative

adversarial networks (GANs) have been studied and applied in various domains in

the past few years, such as image generation and natural language processing, to name

just a couple. The generative adversarial networks may handle more complex density

functions than other generative models e.g., variational autoencoders [4]. By applying

such GANs in the high-entropy alloy development (i.e. computational design and

discovery), we aim to build a model that may assist engineers and scientists reduce

the cost of new material discovery process and shorten the development cycle.

1.2 Objective

The objective of this project is to build a generative adversarial network model to

estimate the probability distribution of high-entropy alloys in the element and

thermodynamic space.

1.3 Project Scope

This project could be divided into six tasks:

1. Construction of HEA dataset – In this task, around 1500 high-entropy alloy

formulas were manually collected from hundreds of published papers. After parsing

and data engineering, 724 formatted, non-repetitive high-entropy alloy formulas were

added to the HEA dataset.

2. Development of empirical parameter calculator – The stabilization of high-

entropy alloys’ solid solution state relies on six empirical material parameters: entropy

of mixing ∆𝑆𝑚𝑖𝑥 , enthalpy of mixing ∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿, a unitless

parameter Ω, valence electron concentration (VEC) and average melting point (Tm)[2].

In this project, these six empirical parameters were treated as important features for

8

the proposed cardiGAN model. An empirical parameter calculator was developed to

calculate these six empirical parameters.

3. Construction of training set – The training set of the cardiGAN model was

constructed from the 724 chemical formulas in the HEA dataset and their six empirical

parameters.

4. Development of cardiGAN – A generative model named cardiGAN, which is

designed based on Wasserstein GAN, was developed. The model’s configuration and

training procedure are explained in Chapter 3.

5. Model evaluation – The proposed cardiGAN model was evaluated in both

machine learning and material aspects.

1.4 Contribution

The main contribution of this project is the study of applying GANs in computational

design and development of high-entropy alloys. In this project, a general adversarial

model that could estimate the underlying probability distribution of high-entropy

alloys, namely cardiGAN, was developed. The model can be used to improve the

discovery efficiency and provide insights into the development of novel high-entropy

alloys.

1.5 Report Outline

Chapter 1 provides the introduction of the project. Chapter 2 introduces some

background knowledges that are related to this work. Chapter 3 describes the model’s

architecture and the basic procedures for applying GANs in novel high-entropy alloy

discovery. Finally, in Chapter 4, we evaluate the model in both machine learning and

material aspects, discuss problems encountered in the current research status, and

propose possible solutions.

9

Chapter 2. Background and Related Work

2.1 High-entropy Alloys

High-entropy Alloys (HEAs) are alloys that are composed of five or more metallic

elements with nearly equal proportions [2]. Before the invention of this novel class of

materials, traditional alloys are composed of one or two principle elements with minor

proportions of other elements [2]. Because of their special crystal structures, some

high-entropy alloys have exceptional mechanical properties in comparison to

conventional alloys especially at high temperatures, such as higher strength, better

ductility etc. [3].

2.1.1 Empirical Parameters of High-entropy Alloys

There are six empirical parameters that have been considered as the main factors of

high-entropy alloy’s stable solid solution phase, namely entropy of mixing ∆𝑆𝑚𝑖𝑥 ,

enthalpy of mixing ∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿, valence electron concentration

𝑉𝐸𝐶, average melting point 𝑇𝑚 and a unitless parameter Ω [5]. The design guidelines

for high-entropy alloy are shown in [3, Table 1].

The unitless parameter Ω is calculated from Tm, ∆𝑆𝑚𝑖𝑥 and ∆𝐻𝑚𝑖𝑥 [6].

Ω = Tm∆𝑆𝑚𝑖𝑥

|∆𝐻𝑚𝑖𝑥|

2.2 Generative Adversarial Networks (GANs)

Generative adversarial networks are generative models invented by Ian Goodfellow

and researchers at the University of Montreal, which was originally proposed as

generative models for unsupervised learning [7].

Table 1. Empirical parameters and

design guidelines for forming solid

solution HEAs. Note that although

most existing high-entropy alloys

follow these design guidelines, there

also exists synthesizable high-entropy

alloys having empirical parameters

outside of these intervals.

10

A Generative adversarial network comprises two sub-networks, a generator network

and a discriminator network [7]. Given a training set, the generator is trained to map

from a latent space to the data distribution of training set, and the discriminator’s

training objective is to distinguish the fake data produced by the generator from real

data in the training set [7][8]. These two neural networks are trained together and

compete against each other until the generator could produce real-enough data that

even an optimized discriminator couldn’t distinguish, which is why the model named

“adversarial”.

As shown in Figure 1, when training GANs, the generator is fed with random noise

input which is sampled from a latent space (e.g. a multivariate Gaussian distribution).

The fake data produced by the generator, along with real data from the training set,

are the inputs of the discriminator. The discriminator evaluates the input data by

producing a probability indicating whether the input data is real or fake. The loss

generated from the evaluation, typically a binary cross entropy loss, is back-

propagated to both networks so that the generator could learn the data distribution of

the training set [9].

2.3 Wasserstein GAN

Wasserstein GAN (WGAN) is a variant of the general GAN proposed by Martin

Arjovsky and his colleagues [10]. This type of GAN could solve the main training

problems of GANs, such as mode collapse, unbalance between the generator and

discriminator, sensitivity to the hyperparameter selections etc. [10].

Figure 1: The architecture of generative adversarial networks.

11

2.3.1 Wasserstein Distance

Wasserstein GAN uses Wasserstein distance to calculate the loss function, which is

defined as the greatest lower bound (infimum) for any transport plan from the real

data distribution Pr to the generated data distribution P𝑔 [11].

The mathematical equation of Wasserstein distance:

𝑊(𝑃𝑟, 𝑃𝑔) = inf𝛾∈Π(𝑃𝑟,𝑃𝑔)

𝐸𝑥,𝑦~𝛾[|𝑥 − 𝑦|]

where Pr and Pg denotes the density function of real data and generated data, and

Π(𝑃𝑟, 𝑃𝑔) is the set of all joint distributions γ(𝑥, 𝑦) whose marginals are respectively

Pr and P𝑔 [10]. Since the above equation is highly intractable, by using the

Kantorovich-Rubinstein duality, the Wasserstein distance can be simplified as:

𝑊(𝑃𝑟, 𝑃𝑔) = sup|𝑓|𝐿≤1

𝐸𝑥~𝑃𝑟[𝑓(𝑥)] − 𝐸𝑥~𝑃𝑔

[𝑓(𝑥)]

where sup is the least upper bound and f is a 1-Lipschitz function following this

constraint [11]:

|𝑓(𝑥1) − 𝑓(𝑥2)| ≤ |𝑥1 − 𝑥2|

In practical, the discriminator is trained to find the Lipschitz function which will be

used to calculate the Wasserstein distance between the distributions of generated data

and the training set [10][11]. Unlike standard GANs, the discriminator no longer

outputs the probability of the sample to be real; it produces a scalar score 𝐷(𝑥) which

could be interpreted as the “realness” of input data [11].

To maintain Lipschitz constraint on f , the inventors of WGAN proposed a simple but

effective way [10]. By clipping the weights of the discriminator in the small range,

[−𝑐. 𝑐], f will always have an upper bound and a lower bound, which allows the

discriminator to maintain Lipschitz constraint [10][12].

2.3.2 Advantages of Wasserstein GAN

The most significant advantage of Wasserstein GAN is the smooth gradient of its loss

function. When training general GANs, as we can see from [10 Figure 2], in the case of

the generator is underperforming, the loss function would get saturated and has a

vanishing gradient. This creates an unsavable imbalance between the generator and

the discriminator, where the discriminator could easily distinguish data produced by

the generator from data in the training set, while the generator learns almost nothing

from the saturated loss and stays the same. The only way to avoid getting into this

situation is to keep the generator optimized and maintain the equilibrium all the time,

12

which is practically hard to achieve since the discriminator usually gets optimized

faster than the generator.

With Wasserstein GAN, when the generator underperforms, the imbalance between

the generator and the discriminator could be resolved. This is because Wasserstein

GAN uses Wasserstein distance as the loss function [10]. By maintaining the optimality

of the discriminator, the Wasserstein distance always has a smooth gradient no matter

how bad the generator performs [10].

Another significant advantage of Wasserstein GAN is that it has a meaningful loss

function which correlates with the generator’s performance and the generated data’s

quality [10][11]. When training standard GANs, the output of the discriminator is the

probability of the samples being real, and the loss is calculated using the binary cross

entropy. There is no way to check the quality of the generated data. But with

Wasserstein GAN, the loss function measures how far the generated probability

distribution differs from the probability distribution of the training set [11]. We could

always monitor the quality of the generated samples during training by looking at the

loss.

Figure 2. Optimal discriminator and critic when learning to

differentiate two Gaussians.

13

Chapter 3. Methodology

This chapter includes the principal components of this project and can be divided into

two sections: construction of training set and model configuration. The first section

describes the procedure of constructing the training set, which involves data collection

and data engineering. In the model configuration section, the architecture and training

procedure of the model are explained following a chronological order. In this project,

we use cardiGAN as the name of our model, which stands for compositionally

complex alloy research directive inference GAN.

3.1 Construction of Training Set

3.1.1 Data Collection and Data Engineering

High-entropy alloys are a novel class of alloys which has only been developed around

fifteen years. Currently, there is no available dataset containing all the existing high-

entropy alloys. The high-entropy alloy formulas in the training set were manually

collected from hundreds of published papers. Because the same chemical formula

could occur in various forms in different papers, these raw data must be parsed and

formatted.

To generate a unified dataset, a formula parser was developed, which could parse

element compositions from the collected formulas and remove duplicated data. For

example, MnCoFeNiCr and Fe20Mn20Ni20Cr20Co20 are two high-entropy alloy

formulas in the original dataset, and they both indicate the same material. By using

the formula parser, these two formulas are merged into one single formatted chemical

formula, where the elements are arranged in alphabetical order:

Co 0.2 Cr 0.2 Fe 0.2 Mn 0.2 Ni 0.2

After parsing and data engineering, 724 formatted, non-repetitive high-entropy alloys

were added to our HEA dataset, which includes most of the existing high-entropy

alloy formulas developed from 2004 until now. This dataset will be used to create the

training set in the next section.

3.1.2 Empirical Parameter Calculator

As we have discussed in Chapter 2, there are six empirical parameters will be used to

train the cardiGAN model, namely entropy of mixing ∆𝑆𝑚𝑖𝑥 , enthalpy of mixing

∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿 , valence electron concentration 𝑉𝐸𝐶 , average

melting point 𝑇𝑚 and a unitless parameter Ω.

14

In this section, an empirical parameter calculator, which could calculate the six

empirical parameters for any given high-entropy alloy formulas, was developed. The

empirical parameter calculator integrates the elements’ attributes and properties, such

as atomic radii, number of valence electron, and melting point etc. It applies the

corresponded chemical and thermodynamic equations to do the calculation.

This empirical parameter calculator plays an important role in the training set

generation and model analysis procedures. A user interface of this software was

developed, so it can be used by people who are interested in this area.

3.1.3 Construction of Training Set

The training set contains 724 high-entropy alloy formulas along with their empirical

parameters. The information of each chemical formula is represented as a list of molar

ratios of the 56 selected elements which are shown in Table 2.

The training set has 724 rows with each row representing the information of a specific

high-entropy alloy formula. The first 56 columns are the molar ratios, and the last 6

columns are the six empirical parameters. The molar ratios and empirical parameters

comprise the feature space of the cardiGAN model. Because most high-entropy alloys

are composed of 3 to 10 elements, this representation requires the generator of the

cardiGAN model to produce a sparse, nonnegative output.

Figure 3. User Interface of Empirical Parameter Calculator

developed in this project

15

Ag Al Au B Be Bi C Ca

Cd Ce Co Cr Cu Dy Er Fe

Gd Ge Hf Ho In Ir La Li

Lu Mg Mn Mo Nb Nd Ni Os

P Pb Pd Pr Pt Re Rh Ru

Sb Sc Si Sm Sn Sr Ta Tb

Ti Tm V W Y Yb Zn Zr

Figure 4. The pipeline of the cardiGAN model. There are three components: a

generator network for generating fake chemical formulas, a calculator network for

calculating empirical parameters and a discriminator network for detecting whether a

given sample is real or generated.

Table 2. Elements explored to predict novel high-entropy alloys. It is noted that

numerous metals were excluded on the basis of impracticality (such as those that are

radioactive or high reactive).

16

3.2 Model Configuration

3.2.1 Network Overview

As shown in Figure 4, the cardiGAN model has three sub-networks: a generator, a

discriminator and a pre-trained calculator neural network. The most significant

difference between the cardiGAN model and Wasserstein GAN is that cardiGAN has

a pretrained calculator neural network. This neural network is used to simulate the

functionality of the empirical parameter calculator. The generator network is trained

to produce the element compositions of fake HEA formulas. The calculator network

takes the generated element compositions as inputs and estimate the empirical

parameters of the corresponding high-entropy alloy formulas. Then, the outputs of

these two neural networks are concatenated as the fake inputs of the discriminator.

The cardiGAN model is trained with the molar ratios and empirical parameters of the

724 high-entropy alloy formulas in the training set. The input of the discriminator

looks like this:

𝑥 = [𝑚1, 𝑚2, … . . , 𝑚56, 𝜎, ∆𝑆𝑚𝑖𝑥 , ∆𝐻𝑚𝑖𝑥 , 𝛿, 𝑉𝐸𝐶, 𝑇𝑚]𝑇

where 𝑚1, 𝑚2, … . . , 𝑚56 are the molar ratios of the 56 selected elements, 𝜎 is the

summation of molar ratios, 𝜎 = 𝑚1 + 𝑚2 + ⋯ + 𝑚56. The last five dimensions of 𝑥

are empirical parameters.

The real HEA formulas’ empirical parameters are calculated using the empirical

parameter calculator before training and can be accessed from the training set, while

the generated formulas’ empirical parameters are calculated by the pre-trained

calculator network. This calculator neural network maps the generated element

compositions to their five empirical parameters.

3.2.2 Empirical Parameter Selection

There are six empirical parameters in traditional high-entropy alloy design and

development, but only five of them, namely entropy of mixing ∆𝑆𝑚𝑖𝑥 , enthalpy of

mixing ∆𝐻𝑚𝑖𝑥 , difference in atomic radii 𝛿, valence electron concentration 𝑉𝐸𝐶 and

average melting point 𝑇𝑚, are used to train the cardiGAN model. The reason of not

including the unitless parameter Ω are as follows:

1. The sixth empirical parameter Ω is calculated from the other three empirical

parameters; the information of Ω is repetitive. Including Ω will increase the

model’s training difficulty especially for the calculator network. To estimate Ω, the

calculator network not only has to learn to map the element compositions to this

extra parameter, but also need to capture the nonlinear relationship between Ω,

17

Tm, ∆𝑆𝑚𝑖𝑥 and ∆𝐻𝑚𝑖𝑥, which contains an inverse absolute value function that is

not continuous when ∆𝐻𝑚𝑖𝑥 = 0.

Ω = Tm∆𝑆𝑚𝑖𝑥


2. Ω has no direct connection to the element compositions. All the other five

empirical parameters are directly calculated from the molar ratios and chemical

properties of the composite elements. By only including these five empirical

parameters, the calculator neural network could be easily trained with a higher

accuracy.

3. The distribution of Ω in the training set is sparse. In the training set, there are

several extremely large Ω values, which also follow the design guidelines of

HEAs, whereas most Ω in the training set are less than 10. In this case, enforcing

the generator to learn this sparse distribution of Ω would lead to overfitting.

3.2.3 Calculator Neural Network

Before training the generator and discriminator networks, a calculator neural network

is constructed and trained. The reasons for having a calculator neural network in our

model are as follows:

1. The empirical parameters are important features, and it is hard to train the

generator to produce element compositions and empirical parameters at the same

time.

Figure 5. Configuration of the calculator neural network

18

2. Using the empirical parameter calculator script to do the calculation is inefficient

and will block the back-propagation path of the empirical-parameters-related loss.

3. Since the calculations of empirical parameters involve doing permutation and

combination on chemical properties, such as atomic radii and dual enthalpy, it is

almost impossible to hardcode the empirical parameter calculator inside neural

networks.

4. The 5 selected empirical parameters are directly associated to the element

compositions, so they can be accurately estimated by a neural network.

The configuration of the calculator network is shown in Figure 5. The calculator

network is trained with a large dataset that contains molar ratios and empirical

parameters of 100,000 randomly generated formulas. During training, 30,000 formulas

and the 724 real high-entropy alloy formulas are used as the test set.

Figure 6. Training and test loss of the calculator neural network.

19

3.2.4 Configuration of cardiGAN

Figure 7. The network configuration of the proposed cardiGAN model. Note that the

empirical parameter calculator network is already trained before cardiGAN’s training,

so this network only functions as a calculating tool and is not being trained.

The cardiGAN model has three sub-networks: a generator, a discriminator, and a pre-

trained calculator neural network.

As shown in Figure 7, the generator is composed of two fully-connected layers. The

input of the generator is a 12-dimensional standard distributed noise 𝑧 ~ 𝑁(0,1), and

the output of the generator 𝐺(𝑧) is a 56-dimensional vector with each dimension

representing the molar ratio of a specific element.

Since the molar ratios of elements cannot be negative, and most existing high-entropy

alloys are composed of 3 to 10 elements, the output of the generator should be a non-

negative, sparse vector. To produce such output, we use the Rectified Linear Unit

(ReLU) activation function as the output function of the generator, which will set

negative values to zero.

𝑅𝑒𝐿𝑈(𝑣) = max(0, 𝑣)

The ReLU activation function is shown in Figure 8.

20

The input of the discriminator includes the molar ratios of generated formulas and real

formulas along with their five empirical parameters. The real formulas’ molar ratios

and empirical parameters can be retrieved from the training set, while the generated

formulas’ empirical parameters are calculated by the pre-trained calculator network.

The discriminator is composed of two fully connected latent layers and one output

layer, and each latent layer’s output is activated by the LeakyReLU activation function.

The output layer of the discriminator produces a scalar value D(x) which represents

the “realness” of the input data. Note that the output layer does not have a sigmoid

function, and the scalar value is used to calculate the Wasserstein loss.

3.2.5 Training of cardiGAN

Due to the sparsity of the molar ratios’ distribution (most molar ratios of a HEA

formula are 0, and some elements didn’t occur in the training set), the molar ratios are

not normalized. Since the molar ratios are in the interval [ 0, 1] , the empirical

parameters were normalized to be around the interval [ 0, 1], which made it easier for

the neural network to converge.

During training, the discriminator is trained five times as much as the generator. This

is because the discriminator has to be optimized to calculate the Wasserstein distance,

which helps better train the generator. And our experiments showed that an emphasis

on training the discriminator will also help reduce mode collapsing and enhance

stability.

The output of the generator is a 56-dimensional vector containing the molar ratios of

the generated formula, 𝑚 = [𝑚1, 𝑚2, … . . , 𝑚56]𝑇 . This output is fed into the pre-trained

calculator network to estimate the 5 empirical parameters of the associated chemical

formula. The normalized molar ratios 𝑚𝑛𝑜𝑟𝑚 and the calculated empirical parameters

Figure 8. Rectified Linear Unit (ReLU) activation function

21

then be concatenated along with 𝜎𝑔, the sum of generated molar ratios, to produce the

fake input of the discriminator.

𝑥𝑔 = [(𝑚𝑛𝑜𝑟𝑚)𝑇 , 𝜎𝑔 , ∆𝑆𝑚𝑖𝑥 , ∆𝐻𝑚𝑖𝑥 , 𝛿, 𝑉𝐸𝐶, 𝑇𝑚]𝑇

where 𝑚𝑛𝑜𝑟𝑚 =𝑚

Σi𝑚𝑖 is the vector of normalized molar ratios of the generated formula.

Note that the input of the calculator network is also 𝑚𝑛𝑜𝑟𝑚 ; this is because calculating

empirical parameters for chemical formulas having a summation of molar ratios

bigger or less than 1 is meaningless. The empirical parameters are only defined for

legal chemical formulas.

The 𝜎 value is the summation of molar ratios of the chemical formula. For formulas

in the training set, this value always equal to 1. Without this feature, although the

generator could still learn the element distribution of formulas in the training set, the

generated molar ratios have no restriction on their summation. In this case, some

generated formulas would have very small summation of molar ratios, which is

harmful to the model’s training. This is because during training the generated molar

ratios could get too small and eventually their summation becomes 0. A zero

summation cannot be divided to calculate the normalized molar ratios, which would

cause the training process to stop. The 𝜎 value also helps the generator to produce

more realistic element compositions, since the sum of molar ratios for a real chemical

formula should always equal to one.

When training cardiGAN, the generator is trained to maximize the loss function

Lg(D, gθ) = 𝐸𝑥𝑔~𝑃g[𝐷(𝑥𝑔)]

where Pg denotes the density function of real data and generated data.

Figure 9. The pipeline of back-propagation in generator training process

22

As shown in the Figure. 9, when doing back-propagation for the generator, the loss

coming from molar ratios is back-propagated directly from the discriminator to the

output layer of the generator. And for the part of loss relating to the empirical

parameters, it will be passed to the generator through the calculator neural network.

Note that, since the calculator network is already trained, the training process of

cardiGAN doesn’t involve the calculator.

The discriminator is trained in a similar process, except the loss function is the

Wasserstein distance between the probability distributions of training data and

generated data

Ld(D, gθ) = 𝐸𝑥𝑔~𝑃g[𝐷(𝑥𝑔)] − 𝐸𝑥𝑟~𝑃𝑟

[𝐷(𝑥𝑟)]

where Pr and Pg denotes the density function of real data and generated data.

Since we used Wasserstein distance as the loss function, both training processes

applied RMSprop optimization algorithm which was recommended by the author of

the original WGAN paper [10].

3.2.6 Stop Criteria

In the training process of the cardiGAN model, three parameters are used as the stop

criteria: the Wasserstein distance (discriminator loss) Ld(D, gθ), the average matching

score 𝑠𝑚𝑒𝑎𝑛 , and the number of finely regenerated formulas. During training, we

monitor the model’s performance by generating 10,000 samples after each epoch and

calculating these parameters. Once the three parameters have reached optimality, we

stop the training process and save the model.

Figure 10. The Wasserstein loss of cardiGAN during training

23

As we can see in Figure 10, the Wasserstein distance between the probability

distributions of the generated data and training data decreases as we train the model.

The reason for using Wasserstein distance as the stop criteria is that it correlates with

the generated data’s quality, which is explained in chapter 2.

The other two parameters can be calculated using an evaluating parameter called

matching score. The average matching score measures the similarity between the

distribution of the generated dataset and the training set. The number of finely

regenerated formulas is the number of real formulas that can be generated by our

model with a high matching score. The definition of matching score and its calculation

procedure is explained in chapter 4. This evaluation parameter is designed for

quantifying the similarity between generated formulas and real high-entropy alloy

formulas.

24

Chapter 4. Evaluation and Analysis

This chapter describes the evaluation methods and evaluation results of the cardiGAN

model. The model is evaluated in both machine learning and material aspects. In the

machine learning aspect, an evaluation parameter called matching score is introduced,

which can be used to evaluate the ‘mode diversity’ of our model. In the material aspect,

the model is evaluated by comparing the distributions of generated formulas’

empirical parameters against the real high-entropy alloy formulas’ empirical

parameters.

4.1 Model Evaluation

Due to the small size and the sparse distribution of our training set, evaluating our

model is quite difficult. And because the data in the training set are unlabeled, popular

evaluation metrics such as Inception Score (IS) cannot be implemented. The sparsity

of the distribution in our training set also makes it hard to do kernel density estimation

via the Parzen-Rosenblatt window method; there are some elements only occurred

once or twice in our training set. Hence, we introduced an evaluation parameter

named matching score which could be used to evaluate the model by assessing the

diversity of the generated dataset.

4.1.1 Matching Score

In this section, an evaluation parameter named matching score is introduced. This

parameter is designed to quantify the similarity between generated and real high-

entropy alloy formulas. Although this parameter is simple in math, it could help us

check mode collapsing during training by estimating the nearest neighbors of

generated formulas. Here we first introduce how to calculate the matching score for a

batch of generated data.

Suppose there are 𝑛 formulas in the generated dataset, 𝐺 is a matrix containing the

molar ratios of all the generated fake formulas:

𝐺 = [

𝑚1,1 ⋯ 𝑚1,56

⋮ ⋱ ⋮𝑚𝑛,1 ⋯ 𝑚𝑛,56

]

And 𝑅 is the matrix containing the molar ratios of the 724 high-entropy alloy

formulas.

25

𝑅 = [

𝑟1,1 ⋯ 𝑟1,56

⋮ ⋱ ⋮𝑟724,1 ⋯ 𝑟724,56

]

Because all the molar ratios are nonnegative, we could take the square root of each

entry of these two matrices and create 𝐺𝑠𝑞𝑟𝑡 and 𝑅𝑠𝑞𝑟𝑡

𝐺𝑠𝑞𝑟𝑡 = [√𝑚1,1 ⋯ √𝑚1,56

⋮ ⋱ ⋮

√𝑚𝑛,1 ⋯ √𝑚𝑛,56

]

𝑅𝑠𝑞𝑟𝑡 = [√𝑟1,1 ⋯ √𝑟1,56

⋮ ⋱ ⋮

√𝑟724,1 ⋯ √𝑟724,56

]

Then we define the matching score matrix 𝑀 as the multiplication of 𝐺𝑠𝑞𝑟𝑡 and 𝑅𝑠𝑞𝑟𝑡 .

𝑀 = 𝐺𝑠𝑞𝑟𝑡𝑅𝑠𝑞𝑟𝑡𝑇 = [

𝑠1,1 ⋯ 𝑠1,724

⋮ ⋱ ⋮𝑠𝑛,1 ⋯ 𝑠𝑛,724

]

where 𝑠𝑖,𝑗 is the matching score between the 𝑖𝑡ℎ generated formula and the 𝑗𝑡ℎ real

formula:

𝑠𝑖,𝑗 = [√𝑚𝑖,1 … √𝑚𝑖,56] [√𝑟𝑗,1

…

√𝑟𝑗,56

]

Then the matching score of a generated formula to the training set is defined as the

largest matching score it can get with the 724 real formulas. And the nearest neighbor

of the generated formula is the real formula in the training set that it has the largest

matching score with.

𝒔 = [

𝑠1,𝑚𝑎𝑥

…𝑠𝑛,𝑚𝑎𝑥

]

where 𝑠𝑖,𝑚𝑎𝑥 is the largest value of the 𝑖𝑡ℎ row of 𝑀. The average matching score of

the generated dataset is the average of matching scores of all the formulas in the

generated dataset:

𝑠𝑚𝑒𝑎𝑛 =𝑠1,𝑚𝑎𝑥 + ⋯ + 𝑠𝑛,𝑚𝑎𝑥

𝑛

26

The range of matching score is [0, 1]. When two formulas have exactly the same

element compositions, their matching score equals to 1, since in this case the matching

score is just the summation of the formula’s molar ratios.

𝑠 = [√𝑚1 … √𝑚56] [√𝑚1

…

√𝑚56

] = Σ𝑖=156 𝑚𝑖 = 1

By calculating the matching scores of a batch of generated formulas, we can estimate

their nearest neighbors in the training set. Unlike calculating the Euclidean distances,

the calculation of matching scores only requires several steps of matrix operations,

which make it computationally more efficient and easier to code. And our experiments

showed that the nearest neighbors we found by calculating matching scores and

Euclidean distances are nearly identical, only about 1% of them are different, which is

acceptable as our task is to evaluate the diversity of the generated formulas. Because

of its low computational cost and effectiveness of finding nearest neighbors, we are

able to apply the matching score to evaluate our model at each epoch of training

without losing training efficiency.

The matching score of two chemical formulas correlates with the similarity between

them. We displayed some generated fake high-entropy alloy formulas with different

matching scores in Table 3. As we can see from Table 3, the formulas having matching

scores higher than 0.98 are really similar to their nearest neighbors. And those

formulas that have closing to 1 matching scores are almost identical to existing high-

entropy alloys.

Generated Formula Matching Score Nearest Neighbor

Al0.14Co0.2Cr0.2Cu0.06Fe0.2Ni0.2 0.9997 Al0.15Co0.2Cr0.2Cu0.05Fe0.2Ni0.2

Al0.2Nb0.21Ti0.26V0.13Zr0.2 0.995 Al0.2Nb0.2Ti0.2V0.2Zr0.2

Mo0.15Nb0.31Ti0.22Zr0.32 0.99 Mo0.25Nb0.25Ti0.25Zr0.25

Hf0.16Mo0.08Nb0.14Ta0.24Ti0.27Zr0.11 0.9854 Hf0.18Mo0.1Nb0.18Ta0.18Ti0.18Zr0.18

Co0.2Cr0.1Fe0.45Mn0.22Si0.03 0.98 Co0.2Cr0.15Fe0.4Mn0.2Si0.05

Table 3. Generated formulas and their nearest neighbors found using matching score

27

The average matching score could assess the overall similarity between the generated

dataset and the training set.

Figure 11 contains the plots of Wasserstein distance and summation of matching scores

of 10,000 generated formulas. From the Fig.11, we can see that the average matching

score inversely correlates to the value of Wasserstein distance between the probability

distributions of generated data and training data.

The reason for using average matching score as stop criteria is that the Wasserstein

distance is calculated by the Lipschitz function defined by the discriminator whose

weights are different at each time of training, which means the Wasserstein distance

would converge to different values each time. But the value of average matching score

only depends on the element compositions of generated formulas, which enables us to

assess the model during training in a more straightforward way. We could directly

judge the model’s performance by looking at the average matching score of the current

model. For example, an average matching score of 0.6 would indicate an underfitting.

And an average matching score of 0.99 indicates that there might be a mode collapse

or a memorizing GAN, which happens a lot when the generator has too many latent

layers.

Figure 11. Wasserstein distance vs. average matching score.

28

The number of finely regenerated formulas is calculated by counting the number of

real formulas that can be regenerated with high matching scores. For example,

suppose we set the lower bound of matching score to 0.98, we first select those fake

formulas having matching scores bigger than 0.98. Then we count how many different

real formulas do their nearest neighbors have included.

For a generative model, since it is almost impossible to produce exactly same formulas

from the training set, we use the number of real formulas that can be regenerated with

high matching scores as a reference to the generated dataset’s diversity. An ideal

model should be able to produce most of the chemical formulas in our training set

with high matching scores. During training, we set the stop criteria on the number of

regenerated formulas with matching scores higher than 0.98, which could cover about

50-55% high-entropy alloy formulas in our training set. In the generated dataset,

around 10-15% formulas have matching scores bigger than 0.99, which could cover 30%

of our training formulas.

4.1.2 Visualization

After the model is trained and saved, we create a dataset of 10,000 fake high-entropy

alloy formulas. Then we input the generated data and training data onto TensorBoard

to visualize their distributions. The following two figures are the 2D visualization of

the distribution of the generated dataset and our training set. The red dots represent

the generated data, and the blue dots are real data from our training set.

T-SNE (t-distributed Stochastic Neighbor Embedding) is a technique that could

visualize high-dimensional data in a two or three-dimensional map [13]. It uses

random walks on neighborhood graphs to enable the data’s implicit structure to be

Figure 12. Number of finely regenerated formulas with matching score > 0.98.

29

displayed in the 2D or 3D visualization [13]. And UMAP (Uniform Manifold

Approximation and Projection) is another dimension reduction algorithm, which has

been competitive with t-SNE for visualization quality [14]. As shown in these two

figures, although the distribution of the training data is rather sparse, the model could

still fit very well.

Figure 13. T-SNE figure of generated and training data

Figure 14. UMAP figure of generated and training data

30

4.2 Result Analysis

In this section, we evaluate the model by comparing the empirical parameters of

generated formulas against the empirical parameters of training formulas. Then we

use the design guidelines to assess the formulas in the generated dataset.

The mean and standard deviation of generated and training formulas are as follows:

∆𝑆𝑚𝑖𝑥(kJ/(K*mol)) ∆𝐻𝑚𝑖𝑥(kJ/mol) 𝛿(%) Ω 𝑉𝐸𝐶 𝑇𝑚(𝐾)

mean 13.99 -11.39 5.16 11.53 6.91 1835.10

std. 1.88 10.49 3.58 84.16 1.47 316.70

∆𝑆𝑚𝑖𝑥(kJ/(K*mol)) ∆𝐻𝑚𝑖𝑥(kJ/mol) 𝛿(%) Ω 𝑉𝐸𝐶 𝑇𝑚(𝐾)

mean 13.31 -11.21 5.13 7.85 6.80 1816.24

std. 1.93 9.76 3.67 24.72 1.59 345.21

Notations:

∆𝑆𝑚𝑖𝑥: entropy of mixing; ∆𝐻𝑚𝑖𝑥: enthalpy of mixing; 𝛿: difference in atomic radii;

𝑉𝐸𝐶: valence electron concentration; 𝑇𝑚: average melting point; Ω = Tm∆𝑆𝑚𝑖𝑥


From the above tables we can see that, the generated formulas and the training

formulas have really close means and standard deviations of empirical parameters.

The only exception is Ω; both the Ω mean and standard deviation of the generated

dataset are bigger than the training set. This is because some of the generated formulas

have near-zero ∆𝐻𝑚𝑖𝑥 values. And because Ω is inversely proportional to the

absolute value of ∆𝐻𝑚𝑖𝑥, these formulas could have extremely large Ω values which

raised the mean and standard deviation. But since the design guideline doesn’t restrict

large Ω values, and large Ω values could help forming of high-entropy alloys [2], the

difference in Ω between the generated formulas and real high-entropy alloy formulas

shouldn’t be a big problem.

The visualization of the distribution of generated formulas and real high-entropy alloy

formulas are shown below. These two figures are produced using both the molar ratios

and the empirical parameters of the formulas. It is obvious that, after including

empirical parameters, the red dots could fit the blues dots much better. This is because

our model is trained with both the molar ratios and empirical parameters.

Table 4. Mean and standard deviation of empirical parameters of generated formulas

Table 5. Mean and standard deviation of empirical parameters of training formulas

31

By applying the design guidelines of high-entropy alloys in Table 1, we discovered

that around 60% of formulas generated by our model fall into these intervals, which is

same to the high-entropy alloy formulas in the training set.

Figure 15. T-SNE figure of generated and training data using

both molar ratios and empirical parameters

Figure 16. UMAP figure of generated and training data using

both molar ratios and empirical parameters

32

Conclusion

1. The first GAN as applied to alloy development, namely High-entropy Alloys, was

developed in this work, which could estimate the probability distribution of high-

entropy alloys in both element and thermodynamic space.

2. A single unified dataset containing 724 high-entropy alloys and their empirical

parameters was constructed.

3. An empirical parameter calculator was developed, which could calculate the six

empirical parameters of high-entropy alloys.

4. Owing to the limited empirical data, a quick evaluation metric, namely matching

score, was produced and implemented to quantify the similarity between high-

entropy alloy formulas.

5. Potentially novel HEAs that may possess an FCC structure and single phase are

Al6.6Co6Cr2.2Cu4.7Ni1.8Zr1.9 and Al1.6Co1.8Cr0.8Cu2Fe1.8Ni2.6Pd0.9 as some

examples.

33

References

[1] M.-H. Tsai and J.-W. Yeh, “High-Entropy Alloys: A Critical Review,” in Materials

Research Letters, vol. 2, no. 3, pp. 107–123, 2014.

[2] J.-W. Yeh, S.-K. Chen, S.-J. Lin, J.-Y. Gan, T.-S. Chin, T.-T. Shun, C.-H. Tsau, and S.-

Y. Chang, “Nanostructured High-Entropy Alloys with Multiple Principal Elements:

Novel Alloy Design Concepts and Outcomes,” in Advanced Engineering Materials, vol.

6, no. 5, pp. 299–303, 2004.

[3] “High entropy alloys,” Wikipedia, 31-Aug-2019. [Online]. Available:

https://en.wikipedia.org/wiki/High_entropy_alloys#cite_note-tsai-2. [Accessed: 24-

Oct-2019]

[4] Z. Wang, Q. She and T. Ward, "Generative Adversarial Networks: A Survey and

Taxonomy", arXiv preprint arXiv: 1906.01529, 2019.

[5] Y. Zhang, Y. J. Zhou, J. P. Lin, G. L. Chen, and P. K. Liaw, “Solid-Solution Phase

Formation Rules for Multi-component Alloys,” in Advanced Engineering Materials, vol.

10, no. 6, pp. 534–538, 2008.

[6] Y Zhang, T. Zuo, Z. Tang, M. C. Gao, K. A. Dahmen, P. K. Liaw, and Z. Lu,

"Microstructures and properties of high-entropy alloys," in Progress in Materials Science,

vol. 61, pp. 1–93. 2014.

[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.

Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.

[8] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, "Semantic Segmentation using

Adversarial Networks", in NIPS, 2016

[9] J. Brownlee, “How to Implement Wasserstein Loss for Generative Adversarial

Networks,” in Machine Learning Mastery, 12-Jul-2019. [Online]. Available:

https://machinelearningmastery.com/how-to-implement-wasserstein-loss-for-

generative-adversarial-networks/. [Accessed: 25-Oct-2019].

[10] M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein GAN”, arXiv preprint

arXiv:1701.07875, 2017.

[11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved

training of wasserstein gans”, arXiv preprint arXiv:1704.00028, 2017.

[12] L. Weng, “From GAN to WGAN,” in Lil'Log, 20-Aug-2017. [Online]. Available:

https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html. [Accessed:

25-Oct-2019].

34

[13] K. Y. Wong and F.-L. Chung, “Visualizing Time Series Data with Temporal

Matching Based t-SNE”, in International Joint Conference on Neural Networks (IJCNN),

2019.

35

Appendix 1. Final Project Description

High-entropy alloys (HEAs) are alloys that are composed of five or more metallic

elements with nearly equal proportions. This novel class of materials has potentially

desirable properties such as better strength-to-weight ratios, higher strength and

fracture resistance.

This one semester project relies on the use of machine learning, to assist in the

development of A.I. predicted high-entropy alloy composition.

This project involves five main tasks:

1. Construction of HEA dataset – The training set should include the

majority of all the existing HEA compositions which can be found on published

paper related to HEAs.

2. Building an empirical parameter calculator – The stabilization

mechanism of HEA is related to the six empirical parameters: entropy of mixing,

enthalpy of mixing, difference in atomic radii (delta), omega, valence electron

concentration (VEC) and average melting point (Tm). These six empirical

parameters will also be treated as important features and used to predict potential

novel HEAs. This empirical property calculator should be able to do both

individual and large-scale calculation.

3. Construction of training set – The training set should consist of chemical

formulas along with their 6 empirical parameters calculated using the empirical

property calculator.

4. Construction of machine learning model – An appropriate machine

learning model needs to be constructed and optimized. The machine learning task

is to use the descriptive attributes in the training set to predict potential novel high-

entropy alloys.

5. Model evaluation – In this step, some evaluation techniques should be

applied to evaluate the model’s predictive performance.

36

Appendix 2. Independent Study Contract

38

Appendix 3. Description of Software

Program code files：

Java files:

DataCleaner.java : This class helps with normalization and reordering the chemical

formulas in the HEA dataset and removing duplicated entries. It also provides element

occurrence statistic functionality.

Elements.java : This class includes all the elements needed in the novel HEA

prediction.

Tokenizer.java : The Tokenizer class can tokenize elements and their molar ratios from

randomly formatted formulas and regenerate uniform, normalized chemical formulas.

Main.java : The Main class applies all the functionality in the above java files and

finish the HEAs dataset generation job. The element compositions, molar ratios and

elements occurrence statistics are saved in three .txt files.

TrainningSetTransformer.java : This class will transform the element compositions

and molar ratios into a single csv file. There are 56 columns in this csv files, with each

column corresponding to the molar ratio of a specific element and each row a specific

HEA formula.

JavaTest.java : This class integrated all the unit tests for the above java files.

Python files:

calculator.py* : This class provides the empirical parameter calculation functionality

of the empirical property calculator. This class imported ‘matminer’ open-source

external library which provides all the atomic radii of elements. This class also

included some unit tests.

trainingset_generator.py : This script will translate the element compositions and

molar ratios generated by the java files into a single csv file having all the six

parameters by applying calculator.py.

parameter_generator.py: This script is like trainingset_generator.py but only write the

six parameters in a csv file.

GUI.py : This python file provides a user interface that can be used to calculate the six

empirical parameters for any given chemical formulas, it also provides large-scale

calculation functionality.

39

random_data_generator.py : This script provides the functionality of generating large

quantity of randomly generated chemical formulas along with their six empirical

parameters. The generated dataset will then be used to train a calculator neural

network model, which plays a crucial part in the following GAN model’s training.

calculator_net.py : This script will use the randomly generated chemical formulas to

train a calculator neural network model written with PyTorch, which provides the

same functionality as the empirical property calculator itself. After training the model

will be saved as calculator_net.pt in the saved_models package.

parameters.py : This script saves all the tunable parameters of the GAN model like

learning rates, clip value and batch size etc. It functions as the control panel when fine

tuning the model.

cardiGAN.py : This file is the core of the whole project. It contains all the classes

needed during the GAN training, including a dataset loader and two neural network

class: one generator, one discriminator (both written with PyTorch). It also provides

functionality of training process monitor which will stop the training process once the

generator model is optimized. The name cardiGAN stands for ‘compositionally

complex alloy research directive inference GAN’. After training, the generator and

discriminator will be saved in saved_models package as generator.pt and

discriminator.pt.

fake_alloy_generator.py : This script will apply the saved generator model and

produce a dataset containing all the fake HEAs along with their six empirical

parameters.

analyzer_reporter.py : This script will analyze the generated High-entropy Alloy

formulas and produce an analysis report which is saved in analysis_report package.

During the analysis, a ranked dataset is produced, which explicitly calculates the

similarity between each generated formula to the formulas in the training set and finds

their closest matching formulas.

analyzer_visualizer.py : This script will load the generated data and real data into

TensorBoard to produce a visual presentation of the generated and real data.

model_analyzer.py : This script integrated all the functionality of the fake alloy

generator, analyzer reporter and analyzer visualizer.

All the above code files were implemented by the student except calculator.py which

was co-produced by the student and the supervisor.

40

Testing procedure description:

Since the code files were written using both Java and Python, the testing procedure is

divided into two sections. The Java test file mainly focused on the DataCleaner.java

and Tokenizer.java modules. And the Python unit test is inside calculator.py since the

rest Python files are either generative or discriminative machine learning models,

whose performance can only be tested during training, and user interface or file writer,

with which are hard to write test.

The Java test first tests the Tokenizer’s parse(), write_ele() and write_ratio() methods,

which are used to parse element formulas from a bunch of randomly formatted

formulas. Then the parsed results get cleaned by the DataCleaner which removes any

duplicates and reorder the formulas alphabetically. The test used a randomly

generated txt file which contains hundreds of random formulas. The DataCleaner and

Tokenizer parsed that txt file and created a formatted file contains only 22 formulas

which are the original formulas I prepared. Both the Tokenizer and DataCleaner

functioned well and passed the tests.

The Python unit test tested the calculator’s ability of calculating all six empirical

material parameters. It used an artificial formula to calculate all six parameters then

verified the calculation accuracy with hand calculated parameters using

thermodynamic equations. The calculator is both accurate and efficient.

Description of experiment tools:

The experiments were done using two different IDEs. The Java code files were

compiled and run using IntelliJ IDEA Community Edition with standard JDK 12.0. The

Python code files were compiled and run using JetBrains PyCharm Community

Edition with Anaconda 3 interpreter.

The dataset used for model training was manually collected from hundreds of

published papers and engineered by the students.

The model was trained on student’s personal computer.

41

Appendix 4. README

2019 Semester 2 COMP8755 Individual Project

u6766505 Zhipeng Li

Supervisor: Nick Birbilis

Abstract

High-entropy Alloys (HEAs) are alloys that are composed of five or more

metallic elements with nearly equal proportions. This novel class of materials

has potentially desirable properties such as better strength-to-weight ratios,

higher strength and fracture resistance. This one semester project relies on the

use of machine learning, to assist in the development of A.I. predicted High-

entropy Alloy composition.

Objectives

This project involves five main steps:

1. Construction of HEA dataset – The training set should include the majority

of all the existing HEA compositions which can be found on published paper

related to HEAs.

2. Building an empirical material property calculator – The stabilization

mechanism of HEA is related to the six empirical material properties: entropy,

enthalpy, difference in atomic radii (delta), omega, valence electron

concentration (VEC) and average melting point (Tm). These six empirical

parameters will also be treated as important features and used to predict

potential novel HEAs. This empirical property calculator should be able to do

both individual and large-scale calculation.

42

3. Construction of training set – The training set should consist of chemical

formulas along with their 6 empirical parameters calculated using the

empirical property calculator.

4. Construction of machine learning model – An appropriate machine

learning model needs to be constructed and optimized. The machine learning

task is to use the descriptive attributes in the training set to predict potential

novel High-entropy Alloys.

5. Model evaluation – In this step, some evaluation techniques should be

applied to evaluate the model’s predictive performance.

Installation

Before doing the follow installation, please make sure you have Java JDK 10.0

(or latest version), IntelliJ IDEA (or any preferred Java IDE), Python3 and

JetBrains PyCharm. Instructions for installing these softwares can be found on

their official websites.

1. git clone this repository.

2. Create an environment for running, e.g.:

3. $ conda create -n cardiGAN python=3.7

4. Run environment and install required packages:

5. $ source activate cardiGAN

6. $ conda (or pip) install pytorch torchvision -c pytorch -y

7. $ conda (or pip) install numpy pandas -y

8. $ conda (or pip) install pymatgen -c matsci -y

9. $ pip install matminer=

Instruction

• This project is divided into two sections.

o The Java code implementations help with data engineering and

dataset construction, which are the very first steps of this project.

43

o The Python section contains the rest of the project.

1. HEA dataset construction:

o Open the Java sub-project 'HEAParser' using any Java IDE.

o Run Main.java program, this program integrated the functionalities

of DataCleaner.java and Tokenizer.java, which will create

three .txt files inside 'parseResult' package, each contains the element

compositions, elements' molar ratios and formatted formulas of the

HEA dataset.

o Run TrainingSetTransformer.java, this will create a .csv file inside

package 'parseResult' in the form of 'n * 56', where n is the number of

chemical formulas in the training set. There are 56 columns in this csv

files, with each column corresponding to the molar ratio of a specific

element and each row a specific HEA formula. This csv file, along with

the training HEAs' empirical parameters will be used to train the GAN

model.

2. Training set construction:

o Open the Python project '2019Project' in JetBrains PyCharm or any

preferred Python IDE. Set the Python interpreter to Anaconda3's

default interpreter.

o Before contructing the Training set, find and copy

the GAN_training_set.csv inside 'HEAParser/parseResult', then paste

this csv file into package 'main/training_set'. Do not change the name

of this file.

o Find trainingset_generator.py and parameters_generator.py inside

package utility and run them. This will

create HEA_params.csv and train_params.csv inside package

training_set, which will be used during the model training and model

evaluation sections.

3. Training the calculator neural network model (Optional)

o Before training the GAN, a calculator neural network has to be trained

and saved (An accurate model is already saved inside package

'main/saved_models'). This pre-trained neural network will be used to

calculate the six empirical parameters of the generated fake formulas.

Which is more efficient than just using the calculator.py script we built,

and it also enables the parameter-related loss to be passed back

through this pre-trained network to the generator.

o Inside package 'main/utility', change attribute 'num_sample'

inside random_data_generator.py to the amount of training data you

need, then run this script, which will create random_result.csv inside

44

package 'main/generated_HEAs'. This file is deleted since the dataset is

big and no longer used in the following steps.

o Run calculator_net.py and train the calculator model. The trained

model will be saved inside package 'main/saved_models' as

calculator_net2.pt.

4. Training the GAN model

o The GAN model will be trained using all the datasets, scripts and

model built in the above steps. The training process of this GAN

model could be time-consuming if the stopping criterion was set high.

All the model's parameters are contained inside parameters.py. The

parameters are already tuned, any modification to these parameters

could give hard time in training or lead to model not converging.

o Run cardiGAN.py, once the generator has met the stopping criterion,

the trained generator model will be saved as generator_net.pt inside

package 'main/saved_models'. Since the loss function of discriminator

is Wasserstein loss, saving the discriminator is not very helpful for this

project.

5. Model Analysis

o Run model_analyzer.py. This script will automatically finish the

dataset generation, generated dataset analysis and TensorBoard

dataset visualization jobs. Then, inside package 'analysis_report',

created are two files. analysis_report.txt saved the analysis result,

and generated_novel_ranking.csv has all the generated formulas ranked

and paired with their closest formulas inside the training set. A

matching score is calculated for each generated formula. The higher

the score, the closer is the formula to the ones in the training set.

o (Optional) Find and Follow an installation video on YouTube, then

install TensorBoard to your computer. Run TensorBoard and see how

the generated data fit to the training data. This step is mainly about

dimensionality reduction and dataset visualization, which is optional,

since I already saved a set of distribution pictures inside

'main/visualization' package.

computational design and discovery of high …...discovery process which involves doing experiments...

Documents