diamond price model - cae usershomepages.cae.wisc.edu/~ece539/project/f17/yang_rpt.pdfthe last c of...

Diamond Price Model

Summary

To compare neural network with the traditional statistic model, I first visualize diamonds’

data set, construct an Econometrics model: log(price) = 7.56104309911 +

0.0320045461853 * cut + 0.0778686923705 * color + 0.123285127953 * clarity +

1.87768943035 * log(carat) and analyze the importance of each variable. Then I

construct a one-layer neural network, a deep neural network and tune the parameters of

the one-layer neural network carefully by trying a lot of combinations of different

parameter. Data show that one-layer neural network with 17 neurons, whose learning rate

is 0.04 and momentum is 0.86 works well and beats the statistic model in terms of ability

of prediction. Since we cannot analyze data by just using neural networks, here comes

conclusion that both traditional statistic model and neural networks play an important

role.

Keywords log transform, Econometrics, deep neural network, back propagation algorithm, TensorFlow

1. Introduction and related work

There is a very neat data set about 53,940 diamonds’ carat, cut, color, clarity, depth,

table, price, length, depth and width on kaggle. Fortunately, the database has no missing

data.

The most popular report on the dataset is another price model by traditional statistic way.

After visualize data in several graphs, the author simply applies linear model with log

transform and get a pretty good model whose adjusted R squared is 0.9.

The second popular report is Diamond Cut’s Prediction with XGBoost. Its confusion rate

is up to 32% and it uses the price as input and cut as output which is totally different from

a price model.

2. Background Knowledge of Diamonds

Diamond is a metastable allotrope of carbon, where the carbon atoms are arranged in a

variation of the face-centered cubic crystal structure called a diamond lattice. Since it is

http://www.kaggle.com/shivam2503/diamonds

the hardest natural mineral on the earth whose Mohs scale of mineral hardness is 10, it

used to be the material of glasscutter before the invention of artificial diamonds. It is one

of most important, famous and shiny gems in the world. Diamonds are so beautiful and

expensive that they are often used in the engagement rings to show groom-to-bes’

sincerity and love.

In fact, diamonds are not as rare as most people thinks. in September 2012, Russia held a

press conference, claiming that in the eastern part of Siberia a massive diamond mine

with trillions of carats of reserves has been discovered, which is tens of times the size of

the world's diamond reserves and enough to satisfy all humankind’s demand for 3000

years. However, De Beers Consolidated Mines, Ltd. controls more than 90% diamonds

mines all over the world and has established a trust with owners of other diamonds mines.

As a monopolist, De Beers intentionally extracts a few diamonds every year to keep a

high price and maximize its profits. If any other miners extract plenty of diamonds and

sell them to the market, De Beers will sell its large inventories temporarily to impact the

market in a short period until the disobedient miner go bankrupt. As a result, nobody dare

to decrease diamonds’ price.

Since it is light, convenient to carry, precious and easy to be sold on a market, arms

dealers use them in a trade as shown in the film king of war. Jews also bring diamonds

with themselves when holocaust occurred to them in the film Schindler's List. Now that

diamonds can be medium of exchange to influence a trade, figuring out their market

value is quite essential. Besides, as ordinary consumers, it is important to know whether

we purchase diamonds at a reasonable price to avoid frauds.

Usually, prices are affected by diamonds’ weight and physical appearances so we can

judge its price by diamond’s characters in all aspects. There is a generally admitted

international standards called 4Cs of diamonds quality to classify diamonds and judge

their qualities. Although it is first created by The Gemological Institute of America(GIA),

you can get a diamond grading report depending on 4Cs standard from the other institutes

like International Gemological Institute. To be brief, 4Cs are cut, color, clarity and carat.

The most important of the 4Cs is Cut1 because it has the greatest influence on a

diamond's sparkle. In determining the quality of the cut, the diamond grader evaluates the

cutter’s skill in the fashioning of the diamond. The more precise the cut, the more

captivating the diamond is to the eye.

The second most important of the 4Cs is Color, which refers to a diamond's lack of color.

Color Gem-quality diamonds occur in many hues. In the range from colorless to light

yellow or light brown. Colorless diamonds are the rarest. So The less color, the higher the

grade. It is noteworthy that color grading system starts at D. Before GIA universalized

the D-to-Z Color Grading Scale, a variety of other systems were used loosely, from A, B,

and C (used without clear definition), to Arabic (0, 1, 2, 3) and Roman (I, II, III)

numbers, to descriptive terms like “gem blue” or “blue white,” which are notorious for

misinterpretation. So the inventors of the 4Cs standard wanted to start fresh, without any

association with earlier systems. Thus the scale starts at the letter D. Other natural colors

(blue, red, pink for example) are known as "fancy,” and their color grading is different

than from white colorless diamonds.

Clarity is Often the least important of the 4Cs. Diamonds can have internal characteristics

known as inclusions or external characteristics known as blemishes.2 Diamonds without

inclusions or blemishes are rare; however, most characteristics are tiny imperfections

which can only be seen with magnification of a microscope so clarity is not as important

as cut and colors.

The last C of 4Cs is the carat, the diamond’s physical weight measured in metric carats.

One carat equals 1/5 gram and is subdivided into 100 points. Carat weight is the most

objective grade of the 4Cs. Diamond prices jump at the full- and half-carat weights.

Diamonds just below these weights cost significantly less, and, because carat weight is

distributed across the entirety of the diamond, small size differences are almost

impossible to detect. Visually, there’s little difference between a 0.99 carat diamond and

one that weighs a full carat. But the price differences between the two can be significant.

This feature is important because it has great influence and impact on the accuracy of a

prediction model.

1 www.bluenile.com/education/diamonds 2 www.gia.edu/gia-about/4cs-clarity

3. Visualization of the data

As our professor said in class, before we apply a neural network model, it is always

helpful to look at the graphs generated from the data.

The data set has following fields:

fields descriptions

price price in US dollars ($326--$18,823)

carat weight of the diamond (0.2--5.01)

cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color diamond color, from J (worst) to D (best)

clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1,

VVS2, VVS1, IF (best))

x length in mm (0--10.74)

y width in mm (0--58.9)

z depth in mm (0--31.8)

depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table width of top of diamond relative to widest point (43--95)

Table 1 attributes

Carat, cut, color and clarity have been introduced hereinabove, while x, y, z, depth and

table are attributes of cut, the most important element so that cut may have a greater

weight in the model and predict more precisely.

Before we dispose these fields in our model, we must convert some linguistic grade into

Arabic numbers. To make 4Cs have a positive relation with the price, the better the

diamond’s quality is, the greater numerical attributes will be. For cuts, fair is replaced

with 1, good is replaced with 2, very good is 3, premium is 4 and ideal is 5. As to color,

grade J gets 1 point, grade I gets 2 points … grade D gets 7 points. And for clarity, 1

stands for I1, 2 stands for SI2 … 8 stands for IF.

I also add lots of dummy variables like cutFair, cutGood, cutPremium, colorJ, colorI,

clarityI1, clarityI2 and so on. Their values are 1 when they are true and are 0 when they

are false.

Figure 1 carat distribution

The minimum weight is 0.2 carat while the maximum weight is 5.01 carats. The average

weight in the dataset is 0.798 carat and the sample standard deviation is 0.474carat. As

we can see in Figure 1 which is a bar graph in nature, most diamonds are lighter than 2.5

carats. More importantly, the distribution of carat does not obey the normal distribution.

Many diamonds have 0.3 or 0.4 or 0.5 or 0.7 or 0.9 or 1.2 or 1.5 carats while only a few

diamonds have 0.29, 0.39, 0.49, 0.69, 0.89, 1.19 or 1.49 carats. The price of a diamond

will jump a great step if it has 0.5 carat or 1 carat as mentioned above, so the jeweler pick

theses weights when cutting a raw diamond to make a fortune. Since we have only a few

data when diamonds are a litter lighter than 0.5 or 1 carat, it is difficult for a model to

reflect a price jump between 0.49 carat and 0.5 carat or the jump between 0.99 carat and

1 carat. We had better dispose these price jumps carefully and intentionally.

Figure 2 cuts distribution

Cuts are the most important elements and cutting technique is under human being’s

control so most diamonds dealer will sell diamonds with ideal cuts. The better the cut is,

the more diamonds there are in our dataset.

Figure 3 color distribution

0

5000

10000

15000

20000

25000

Fair Good VeryGood Premium Ideal

counts

cuts

0

2000

4000

6000

8000

10000

12000

J I H G F E D

counts

color

Figure 4 clarity distribution

It seems that the color distribution and the clarity distribution obey the normal

distributions. It is hard for our naked eyes to judge a diamond’s quality when its clarity is

better than grade VS2 and its color is colorless than grade G so the gem dealers do not

have to throw away diamonds with defective clarity or not perfect color.

Since we just define depth as 2𝑧

𝑥+𝑦, there is no need to visualize the relation between

depth and x or depth and y or depth and z.

From Figure 5, which are three almost straight lines, we can conclude that y is

proportional to x, z is proportional to x, z is proportional to y and they are highly related.

So we may only pick one variable to represent x, y and z.

0

2000

4000

6000

8000

10000

12000

14000

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

counts

clarity

Figure 5 x-y, x-z, y-z scatters

Figure 6 x – table scatter

From Figure 6, we may find that there may be no certain relation between x and table.

Now that x, y and z are highly related, there may be no certain relation between y and

table or z and table, either.

Since distribution of x among all kinds of cuts are quite even in Figure 7, x itself has

nothing to do with cuts, which means whatever the diamond is big or small in size, it

could have the worst cut or the best cut.

Figure 7 x-cut scatter

However, diamonds depth will influence cut greatly. As shown before, the number of

diamonds with ideal cut is far greater than the number of diamonds with fair cut. To

eliminate bias caused by the different sample numbers between groups, I pick about 1610

diamonds randomly in each cut grade. Figure 8 shows that the closer depth is to the range

from 55 to 65, the more likely its cut is ideal, which means its cut is better.

Figure 8 depth – cut scatter

Now we could use depth to represent cut, the most important factor of a price model. In

Figure 9, both cut and carat influence price. There are strong and positive correlations

between carat and price. Besides, the better diamond’s cut is, the higher price it has.

More importantly, we could find that when carat is fixed, the price obey normal

distribution and whatever carat is, depth has the same mean and standard deviation,

which means carat and cut are unrelated.

Figure 9 carat-depth-price

Both Figure 10 and figure 11 are carat-price scatters. Dots in figure 10 are colored

according to diamonds’ colors while dots in figure 11 are colored according to diamonds’

clarity. They reveal that perhaps there lies in a power function relation between carat and

price when color and clarity are fixed. Consistent with verbal description mentioned

above, more transparent color or purer clarity usually leads to higher prices in figure 10

and 11. So it is reasonable and justifiable to set color J as 1, color I as 2, color H as 3 …

and convert I1 to 1, SI2 to 2, SI1 to 3 and so forth.

Figure 10 carat-color-price

Figure 11 carat-clarity-price

4. Statistical Analysis

Since we will compare the performance of traditional statistical analysis and the effect of

neural networks later, I pick 90% data (48,546 diamonds) randomly as the training set

and let the rest part (5,394 diamonds) become the testing set to make the comparison fair.

According to description of diamonds’ 4Cs standards, we first just use cut, color, clarity

and carat as independent variables and regard variable depth as an instrumental variable

which may be used if necessary.

Since we infer that there lies in a power function relation between carat and price based

on figure 10 and 11, when cut, color and clarity are fixed, we can get the formula:

𝑝𝑟𝑖𝑐𝑒 = 𝛼 × 𝑐𝑎𝑟𝑎𝑡𝛽 (𝛼 𝑎𝑛𝑑 𝛽 𝑎𝑟𝑒 𝑢𝑛𝑘𝑜𝑤𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑛𝑜𝑤)

log(𝑝𝑟𝑖𝑐𝑒) = log(𝛼) + 𝛽log (𝑐𝑎𝑟𝑎𝑡)

So we use log(price) and log(carat) instead of price and carat themselves. Another

advantage of using log(price) and log(carat) is that we could analysis percentage change

of price based on variation of cut, color, clarity or carat later. Now we could establish the

first model:

log(price) = w0 + w1 ∗ numericalCut + w2 ∗ numericalColor + w3

∗ numericalClarity + w4 ∗ log (carat)

Table 1 regression model

and get their coefficient shown in the superior table. The p-value of F-statistic is near 0 so

the equation itself is statistically significant. All coefficient’s p-value of t-Statistic is near

0, so all variables are all statistically significant. The adjusted R-squared is 0.979328 so

the model can explain 98% causes of price’s formation.

Firstly, we use White heteroscedasticity test to check our model.

Dependent Variable: LOG(PRICE)Method: Least SquaresDate: 12/02/17 Time: 12:23Sample: 1 48546Included observations: 48546

Variable Coefficient Std. Error t-Statistic Prob.

C 7.561043 0.003316 2280.377 0.0000NUMERICALCUT 0.032005 0.000606 52.79007 0.0000

NUMERICALCOLOR 0.077869 0.000407 191.2331 0.0000NUMERICALCLARITY 0.123285 0.000444 277.7187 0.0000

LOG(CARAT) 1.877689 0.001287 1459.116 0.0000

R-squared 0.979329 Mean dependent var 7.786411Adjusted R-squared 0.979328 S.D. dependent var 1.015213S.E. of regression 0.145966 Akaike info criterion -1.010777Sum squared resid 1034.224 Schwarz criterion -1.009872Log likelihood 24539.60 Hannan-Quinn criter. -1.010493F-statistic 574938.5 Durbin-Watson stat 1.987218Prob(F-statistic) 0.000000

Table 2 White heteroscedasticity test

The p-value of F-statistic is still near 0, which is smaller than 0.05, so there is no

heteroscedasticity problem here and we do not have to use weighted linear squares

regressions.

Secondly, Figure 12 shows that residues of the model are evenly distributed so there may

be no autocorrelation problem. In table 2, the Durbin-Watson statistic value is 1.987218,

which is quite near 2, so first order autocorrelation does not exist. Now that there is no

first order autocorrelation, higher order autocorrelations could not exist either.

Heteroskedasticity Test: White

F-statistic 271.9102 Prob. F(14,48531) 0.0000Obs*R-squared 3530.954 Prob. Chi-Square(14) 0.0000Scaled explained SS 7743.107 Prob. Chi-Square(14) 0.0000

Test Equation:Dependent Variable: RESID^2Method: Least SquaresDate: 12/02/17 Time: 12:33Sample: 1 48546Included observations: 48546


C 0.100515 0.003449 29.14721 0.0000NUMERICALCUT^2 0.002080 0.000158 13.17050 0.0000

NUMERICALCUT*NUMERICALCOLOR 0.000122 0.000109 1.124595 0.2608NUMERICALCUT*NUMERICALCLARITY 0.001098 0.000123 8.954261 0.0000

NUMERICALCUT*LOG(CARAT) -0.002102 0.000361 -5.826661 0.0000NUMERICALCUT -0.021523 0.001205 -17.86120 0.0000

NUMERICALCOLOR^2 0.000453 6.80E-05 6.654784 0.0000NUMERICALCOLOR*NUMERICALCLAR -0.000308 8.40E-05 -3.660054 0.0003

NUMERICALCOLOR*LOG(CARAT) 0.001876 0.000237 7.929443 0.0000NUMERICALCOLOR -0.001799 0.000768 -2.342169 0.0192

NUMERICALCLARITY^2 0.002286 7.35E-05 31.11894 0.0000NUMERICALCLARITY*LOG(CARAT) 0.003184 0.000266 11.97124 0.0000

NUMERICALCLARITY -0.023094 0.000829 -27.87292 0.0000LOG(CARAT)^2 0.028311 0.000707 40.05595 0.0000LOG(CARAT) 0.009504 0.001965 4.836121 0.0000


Figure 12 residues

Thirdly, all t-statistic of variables in Table 1 are large enough so the model does not have

multicollinearity problem here.

Fourthly, covariance between cut and residue is -1.06 * 10-14, covariance between color

and residue is -1.13 * 10-14, covariance between clarity and residue is -1.67 * 10-14 and

covariance between log(carat) and residue is -1.58 * 10-14 . They are so small that there is

no stochastic explanatory variables problem here and we do not have to use varibles like

x, y, z, depth and table as instrumental variables and put them into our model.

At the end of day, we get the model by traditional statistic way:

log(price) = 7.56104309911 + 0.0320045461853 × numericalCut

+ 0.0778686923705 × numericalColor

+ 0.123285127953 × numericalClarity

+ 1.87768943035 × log (carat)

Assume we have a model log(y) 0

1 x

Δ log(y) 1 Δ x

When Δ log(x) →0, ∆𝑦

𝑦= β1

∆ x = (100 × β1 ) × ∆ x%

So when cut is upgraded to an adjacent higher grade, price will increase 3.20%.

When color is upgraded to an adjacent higher grade, price will increase 7.79%.

When clarity is upgraded to an adjacent higher grade, price will increase 12.33%.

When carat increases 1%, price will increase 1.88%.

It seems that clarity is more important than color and color is more important than

cut, which is contradicting to diamonds’ background knowledge. One of the

reasons is that the distribution of data set’s does not obey normal distribution and

we have only a few diamonds with fair cut but much diamonds with ideal cut as

shown in figure 2. Another reason is that cutting is always under human being’s

control while color and clarity are determined by the nature, it is wise for diamond

dealers to exaggerate and emphasis impacts of cut to make more money.

5. Prediction by neural networks

At first, I assume the price function is a continuous function, so I try to use Matlab’s

neural network tool box to construct a one-layer network.

Since the only critical parameter we can change in the toolbox is the number of neurons

in the hidden layer, I divide my training set into a real training set which contains 43,692

diamonds and a validation set containing 4,854 diamonds. I try to set the number of

neurons from 1 to 20 and for every number of neurons I do 100 trails to get the best

performance.

Figure 13 performance – number of neurons

As we can see in figure 13, when the number of neurons is greater than or equal to 10, the

performance is quite stable. And when we have 17 neurons in the hidden layer, we can

reach the best result. The mean square error on the testing set (not the validation set) is

1752824777.787574.

Figure 14 the structure of the best neural network

Figure 15 Gradient, Mu and Validation checks

Figure 16 Error Histograms

Figure 17 performance during the best trail

Figure 18 regression plot

As we all know, one-layer neural network can just simulate continuous functions while

multiple-layer perceptron can fit discontinuous functions. As mentioned in background

knowledge part, diamond prices jump at the full- and half-carat weights usually so the

price function may be discontinuous. Although from figure 10 and figure 11 we know

that the price function looks quite continuous, and figure 1 shows that we have few

diamonds whose weight are 0.99 carat or 0.49 carat so the discontinuity may not be a

problem here, we had better take a chance.

I use TensorFlow to construct a deep neural network as shown below.

Figure 18 panorama of deep neural network

Figure 19 details inside DNN (the red box in figure 18)

Since we use 17 neurons in one-layer neural network, it is a good idea to use 20 neurons

at all and 10 neurons in each hidden layer. After 2 hours’ training, the loss function of

training set become stable eventually as shown in Figure 20.

Figure 20 loss function of training set

However, its MSE of testing set is 2693111703.808056, worse than the output of one-

layer neural network which can be trained in half a minute. So we have to stop here and

conclude that one layer with 17 neurons is the best model.

The built-in Matlab neural network toolbox is a black box to us where we cannot set the

learning rate, the momentum, number of epochs between convergence check and so many

other crucial parameters. To get the better neural network, I modify our professor’s

Matlab codes on Back-propagation Multi-Layer Perceptron, set a model with 1 hidden

layer and 17 neurons. To make later comparison fair, I use 3-way cross validation just on

the training set and get the relation between the mean square error and the combination of

learning rate and the momentum.

Figure 21 learning rate – momentum – MSE

From figure 21, we can learn that the best learning rate is 0.04 and the best momentum is

0.86. Applying these parameters, we can get a quite small mean square error

(691834782.0560297) of the testing set, which is only about one-fifth of statistical

model’s mean square error.

6. Prediction Comparison

The author of the report Shine bright like a diamond raised 5 linear models with log

transform

Formula 1: log(price) = w0 + w1 × carat1

3


3 + w2 × carat


3 + w2 × carat + w3 × cut


3 + w2 × carat + w3 × cut +

w4 × color


3 + w2 × carat + w3 × cut +

w4 × color + w5 × clarity

By using Eviews or Excel, we could get the coefficients:

Formula 1: log(price) = 2.82034553075 + 5.55851129815 × carat1

3


3 −

1.13661881495 × carat


3 −

1.16623519277 × carat + 0.0552672749281 × cut


3 −

1.03155001669 × carat + 0.0565930262223 × cut + 0.0624816274995 ×

color

Formula 5: log(price) = −0.572161029767 + 9.31558417329 × carat1

3 −

1.16644626416 × carat + 0.0329350331621 × cut + 0.0776484129798 ×

color + 0.122328957467 × clarity

Now we can compare his models’ mean squared errors and my related results on the

testing set:

model testing set’s MSE

The others’ formula 1 26875409211.834496





Econometrics model 4085100041.590896

The most predictive

statistic model

3359615842.0380077

One-layer network 1752824777.787574

Deep neural network 2693111703.808056

Refined network 691834782.0560297

Table 3 MSE on testing data set

As table 3 shows, the most predictive statistical formula is formula 5, whose MSE is

3445772154.118181. It is a little better than my Econometrics model in the aspect of

prediction. However, the most significant work for statisticians is explaining what has

happened and what is going on. In other words, the advantage of statistical model is the

power to explain the influence of independent variables. I can conclude that when carat

increases 1%, price will increase 1.88%, while you cannot get the similar

conclusion from formula 5 since there are both 𝑐𝑎𝑟𝑎𝑡 and 𝑐𝑎𝑟𝑎𝑡1

3 on the right

hand side.

Besides, the author of the report Shine bright cannot even explain how he derived the

formula and why he set the power as 1

3. In fact, I can get a model with even higher

adjusted R-squared and lower mean square error on testing set by setting the power as

0.62.

log(price) = 2.08191951735 + 9.20137897266 × carat13

− 3.69894098461 × carat + 0.0333092842757 × cut

+ 0.077985772295 × color + 0.121101322659 × clarity

Similarly, I lose the ability to explain the influence of carat’s variation so I prefer my

original Econometrics model.

Although the adjusted R-squared of the Econometrics model is 0.98, which is quite

incredible and wonderful, the Econometrics model’s mean square error is still 4.7 times

of the best neural network’s counterpart as shown in table 3.

7. Conclusion

Both statistical model and neural network play a big part in data mining. Statistical model

combined with data visualization graphs can give us insights of data’s relations and the

rank of importance of every variable, while neural network is good at fitting data and

predicting more precisely. Since none of them can replace the other one, we had better

use statistics to analyze the past and predict the future by neural network.

Reference

[1] Vivek Mangipudi, Diamonds are Forever

[2] Benjamin Lott, Diamond Cut’s Prediction with XGBoost

[3] Jeffrey M. Wooldridge, Econometric Analysis of Cross Section and Panel Data (MIT

Press)

Dependent Variable: LOG(PRICE)Method: Least SquaresDate: 12/03/17 Time: 13:49Sample: 1 48546Included observations: 48546


C 2.081920 0.007306 284.9576 0.0000CARAT^(0.62) 9.201379 0.017928 513.2458 0.0000

CARAT -3.698941 0.011542 -320.4752 0.0000NUMERICALCUT 0.033309 0.000577 57.71657 0.0000

NUMERICALCOLOR 0.077986 0.000392 199.1133 0.0000NUMERICALCLARITY 0.121101 0.000422 287.0655 0.0000


[4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning

[5] Michael Nielsen, A visual proof that neural nets can compute any function

http://michaelnielsen.org/

diamond price model - cae usershomepages.cae.wisc.edu/~ece539/project/f17/yang_rpt.pdfthe last c of...

Documents