ars.els-cdn.com · web viewappendix a: technical details for convolutional neural networks

Appendix A: Technical Details for Convolutional Neural Networks

Convolutional neural networks (CNNs) are based on artificial neural networks

(ANNs), a class of machine learning algorithms. Briefly speaking, an ANN has three layers,

an input layer, a hidden layer, and an output layers. Theoretical, the universal approximation

theorem asserts that an ANN that is large enough can approximate any functions, including

image recognition tasks. However, if we use ANNs for image recognition, the network size

will be too large and infeasible for modern computers. CNNs are then introduced to solve this

problem. Unlike ANNs, which have only one hidden layer, CNNs usually have multiple hid-

den layers. The intuition behind this structure is that a CNN divides the image recognition

process into smaller steps, which is carried out by different layers of the network. At the very

beginning, a convolutional layer (which is one of the first few layers) looks for a specific fea -

ture, such as a straight line or a curve. The lower layers then put these small features together

into more complex features, such as circle-like shapes, or eye-like shapes. Going further, at

the output layer, these complex features are put together into objects such as computers,

tables, and trees. Technically, due to the local connectivity property, we cannot recognize a

huge object with a single convolutional layer—only part of it. With many convolutional lay-

ers, however, we can gradually move from small parts to get a complete picture of the object.

State-of-the-art CNNs can have dozens of layers, millions of parameters, and achieve

an accuracy rate of more than 95% (Denil et al. 2013). Since such settings are highly com-

plex, they are trained on huge datasets with millions of training examples. This training takes

days on GPUs, which are specifically designed processors for image processing and are sig-

nificantly faster than CPUs for image processing jobs. We decided not to train our own CNN-

based image classifier for two reasons. First, we do not have the specific hardware (e.g.,

GPUs) to train the CNNs within an affordable amount of time. Second, it is necessary to have

very large datasets with labeled images to train the CNNs; in our task, the images are highly

differentiated, requiring a comprehensive training dataset. We do not have sufficient training

examples or the time resources to manually label the training images.

1

In addition, deep learning, especially CNN, is already a well-established process in the

computer science literature. We believe the detailed training algorithm here does not provide

an innovative contribution, and such a contribution is beyond the scope of the current manu-

script. We outsourced the training task to a start-up, Imagga, with its own training set, includ-

ing both publicly available image databases (e.g., ImageNet, an online database with 15 mil-

lion labeled images for academic uses) and its own data sources. By the time of our research,

this company was one of the leading start-ups in the field and had a huge training sets. How-

ever, the work can be outsourced to similar application programming interfaces (APIs, appli -

cations that allow us to obtain services of firms) such as Clarifai and Cloud Vision.

Imagga trains its CNNs on the huge training sets using GPUs. Although the company’s

CNNs are not specifically designed for our task, they still achieve a much higher accuracy

rate than our own algorithm, which is trained on small datasets, and keeps its leading position

in academic competitions in image recognition. We used the company’s algorithm as an Ap-

plication Programming Interface (API) to process our images, and the recognized image

classes (e.g., building) were then returned to us for further analysis.

In particular, the API returned us the probability that the image features an object; for

instance, for the following image, the probability that it features a guitar is 54.9%. 1

1 Note that the company has been updating their algorithms and training data. The values are thus not up to date.

2

We convert the continuous probability to a discrete dummy in the following way: if the

probability is above 10%, then the dummy is 1, otherwise it is 02. Note that the choice of this

cut-off point only applies to the current research and may not be generalized to other software

and other context. According to human judgment, this cut-off point is the best-fit of our re-

search purpose. The cutoff point, 10%, is somewhat low for a few reasons:

1) The API (Imagga) that we resort to itself sets a low relative probability to reduce

the type I error.

2) These images are video frames rather than professional images, implying that the

average quality is low (many images are blurred) and algorithm returns a low

probability.

3) In many frames, the feature only appears in a corner of the image, or only part of

a feature appears in the image. The algorithm returns a low probability on these

items; however, for our research purpose, we would like to include them in our

research.

For the sake of brevity, we provide some examples of the labeled images below:

Images featuring humans:

34.0% 23.1% 16.5%

2 In this research, we convert the continuous probabilities into dummy variables for the following rea-son. An image frame sometimes features multiple objects that fall into the category (e.g., an image can feature both guitar and wind instruments together), but the algorithm returns different probabilities for them. It is thus unclear how to combine the probabilities together. Currently, we set the dummy to 1 whenever one of the probabilities exceeds the threshold (10%).

3

54.0% 20.4% 25.8%

Images featuring instruments:

25.4% 100% 100%

11.2% 14.1% 51.7%

References

Denil M, Shakibi B, Dinh L, de Freitas N, et al. (2013) Predicting parameters in deep learn -

ing. In Advances in Neural Information Processing Systems, 2148–2156.

4

Appendix B: Measuring Audio Content

We follow the standard computer science and acoustic approach to control for audio

content. We refer readers to Giannakopoulos and Pikrakis (2014) for detailed technical de-

scriptions.

Digital audio signals are sampled from natural sounds. Let x i (n ) , n=1 ,…, L be the

sequence of audio samples of the i-th frame, where L is the length of that frame. First, we cal-

culate the average zero crossing rate (ZCR) of the audio content, which is based on the rate of

sign-changes during each audio frame. More precisely,

Zi=1

2 L∑n=1

L

| sign [ x i (n ) ]−sign[ x i(n−1)] |

and ZCR is the average of Zi over all frames. ZCR could be interpreted as a measure of the

“noisiness” of a signal, as it usually exhibits higher values in the case of noisy signals. ZCR

has been frequently used in acoustic studies for speech detection, speech-music discrimina-

tion, and music genre classification (Panagiotakis and Tziritas 2005).

The second measure we construct is the short-term energy (energy) of the audio con-

tent, which is computed according to the following equation.

Ei=1L∑

n=1

L

x i2(n)

Ei is averaged over the frames to obtain the energy measure. Energy reflects the power of an

audio file, and it is expected to change rapidly over the file. We use another measure, entropy,

to capture the rapid change of the energy. For each frame, it has various sub-frames, and for

each sub-frame j, we calculate its sub-frame energy value

e j=ESubframe j

∑k

E subfram ek

and the entropy of that frame is

H ( i )=−∑j=1

K

e j log2(e j)

5

and we further average H (i ) over all frames to obtain the entropy measure. Energy and en-

tropy are used for genre classification and emotion detection (Giannakopoulos et al. 2007).

We then apply the discrete Fourier transformation (DFT) of the signal and obtain

spectral coefficients of an audio frame, X i (k ) , k=1, …, FL. Using the spectral coefficients,

we calculate two measures, Spectral centroid (brightness) and Spectral entropy. Spectral cen-

troid is defined as

C i=∑

k

FL

k X i(k )

∑k

FL

X i(k )

Research has shown that higher values of spectral centroid correspond to brighter sounds, and

hence we call this brightness. Spectral entropy is analogous to entropy and we omit the de-

tails here. Together, these frequency domain measures prove effective in genre classification,

emotion detection, and speech-music discrimination (Misra et al. 2004).

As noted, these measures could be used for different audio analysis tasks ranging

from speech-music classification to emotion detection. As a typical video in our data com-

prises both speech, music, and a mixture of emotions, we do not train a model to classify the

audio content (which is not the focus of our research). We simply take these measures as con-

trols of the audio content of a video.

6

Appendix C: Robustness Check Using Data from the Technology Category

To demonstrate the robustness of our visual measures, we replicate our analysis in the tech-

nology category of Kickstarter. The technology category is smaller than the music category

on Kickstarter. For the technology category, our data includes all technology projects in the

following six US states: California, Illinois, Massachusetts, New York, Texas, and Washing-

ton. It comprises all completed projects in these states from the inception of Kickstarter to

March 6, 2017. Together, we have 6,958 observations, among which 5,291 projects have a

video. Almost all technology projects aim to offer some new technology products (e.g.,

glasses, watches, cameras). All the measures and variables used here are essentially the same

variables we use for the music category (except for the CNN variables, where we replace mu -

sical instruments with computers, as computers are commonly featured in technology

projects).

Table C1 summarizes the logistic regression results of the technology category (with

the same control variables used in the paper). The non-video-related control variables are

omitted.

Table C1: Logistic regression results for the technology category

Success(1) (2)

Video 1.348***

(0.120)Log(Duration) 2.280***

(0.586)Log(Duration) – Squared -0.221***

(0.060)Visual Variation 3.102***

(0.602)Visual Variation – Squared -1.301***

(0.462)ZCR 2.514

(4.869)Energy -5.590***

(2.040)Entropy 0.732**

(0.290)Brightness 0.443

(2.327)Spectral Entropy -0.441

7

(0.542)Observations 6,958 5,291Log Likelihood -2,963.649 -2,562.247Akaike Inf. Crit. 5,979.298 5,192.493Note: Column (1) uses all projects data. Column (2) uses data on projectswith a video. Regression includes target, project duration, menu length,creator experience, price, word count, sentiments, and genre. Standarderror in parentheses.**p<0.05 ***p<0.01

The linear effect of video duration is positive and significant, whereas the nonlinear ef-

fect of video duration is negative and significant, thereby suggesting that the marginal effect

of duration is decreasing.

The effects of visual variation are similar. In Table C1, the coefficient of visual varia-

tion is positive and significant, whereas the coefficient of visual variation - squared is nega-

tive and significant, which is again consistent with the optimal stimulation level theory.

So far, we have discussed the effects of video duration and visual variation. These ef-

fects may depend on project and creator characteristics. In Appendix D, we discuss the effects

of project size and creators’ past crowdfunding experiences.

Finally, we replicate the analysis on the technology category to demonstrate the robust-

ness. As mentioned, we construct two CNN variables capturing the content of the videos.

Again, we construct a dummy denoting whether or not a video features humans. Unlike music

projects, technology projects do not feature musical instruments, so we do not use them. In-

stead, we use a similar tool in technology, computers (including both desktops and laptops).

Creators often use computers to show the software or programs they have developed. Similar

to the music category, here the two CNN variables could potentially help the creators commu-

nicate credibility to buyers. As can be seen from Table C2, the effects of both human and

computers are significantly positive, thereby supporting the effect of credibility.

8

Table C2: The effects on project success in the technology category

Success(1) (2) (3)

Human 0.498*** 0.510***

(0.093) (0.094)Computer 0.187** 0.207***

(0.077) (0.077)Log(Duration) 2.050*** 2.298*** 2.063***

(0.581) (0.586) (0.582)Log(Duration) - Squared -0.200*** -0.223*** -0.202***

(0.060) (0.060) (0.060)Visual Variation 2.959*** 3.042*** 2.886***

(0.605) (0.604) (0.607)Visual Variation - Squared -1.194** -1.263*** -1.148**

(0.464) (0.463) (0.466)Observations 5,291 5,291 5,291Log Likelihood -2,547.636 -2,559.297 -2,544.046Akaike Inf. Crit. 5,165.272 5,188.593 5,160.092Note: This table uses data on projects with a video. Regression includes audio controls, target, project duration, menu length, creator experience, price, word count, sentiments, and genre. Standard error in parentheses.

**p<0.05 ***p<0.01

9

Appendix D: Additional Analysis for Crowdfunding Application

In the main text, we have described the main effects of video related variables. Now, we ex-

tend the analysis to study the interaction between visual information and other project con-

tent. We intend to show that the video measures also affect project success through other fac-

tors.

Effect of Visual Information on Project Success: Interaction with Project Size

In this section we consider the interaction between video measures and project size. The ratio-

nale is, when evaluating a large project, the associated risk of the project is usually higher and

potential buyers may value the information provided more, and are less likely to feel bored

when watching its related video. Project size can be defined by the price of offerings or the

total target set by the creators. Since the majority of Kickstarter projects have multiple offer -

ings (and prices), it is not easy to identify a unique price for each project and use that for anal-

ysis. For instance, how would we compare a project with two prices ($5 and $100) to another

project with a single price ($20)? In contrast, each project has a unique target that can be

readily used for analysis. In our data, the median target level is $5,000. We follow a median

split strategy and divide our sample of 6,822 projects into two types. We create a dummy

variable, large (Y=1), to indicate the projects that call for more than $5,000.3 Since many

projects call for exactly $5,000 dollars, only 43.5% of the projects qualify as large ones (see

Figure D1).

We then run the logistic regression including the interaction between project size and

video duration, as well as the interaction between project size and visual variation, i.e.,

Video Duration× Large, Visual Variation× Large. The control variables are the same with

the basic model except that we now replace funding target with large.

The results are summarized in Table D1 below. Table D1 shows that the interaction be-

tween project size and video duration is positive and significant (p-value < .01), suggesting

that, for projects with bigger targets, the tedium effect is weaker and buyers value longer

3 The results are not qualitatively altered when we use the continuous variable Target instead of using the dummy variable Large.

10

videos. In addition, the interaction between project size and visual variation is positive and

significant, implying that the optimal stimulation level could be higher when buyers evaluate

large projects.

Figure D1: The distribution of project target

5001500

25003500

45005500

65007500

85009500

1050011500

1250013500

14500

More than 15000

0

200

400

600

800

1000

1200

1400

1600

Project Target

Table D1: Does the effect of visual information depend on project size?

Success(1) (2) (3) (4)

Log(Duration) 1.669*** 1.885*** 1.703*** 1.901***

(0.451) (0.459) (0.451) (0.459)Log(Duration) – Squared -0.187*** -0.222*** -0.191*** -0.223***

(0.045) (0.047) (0.045) (0.047)Visual Variation 2.640*** 2.651*** 2.491*** 2.526***

(0.445) (0.446) (0.449) (0.449)Visual Variation - Squared -1.579*** -1.595*** -1.682*** -1.681***

(0.415) (0.416) (0.416) (0.417)Large -0.990*** -2.750*** -1.290*** -2.874***

(0.066) (0.518) (0.141) (0.523)Log(Duration)×Large 0.341*** 0.316***

(0.099) (0.100)Visual Variation×Large 0.583** 0.487**

(0.242) (0.244)

11

Observations 6,822 6,822 6,822 6,822Log Likelihood -3,645.645 -3,639.707 -3,642.724 -3,637.710Akaike Inf. Crit. 7,371.291 7,361.415 7,367.447 7,359.421Note: This table uses data on projects with a video. Regression includes audiocontrols, project duration, menu length, creator experience, price, word count,sentiments, genre, and gender. Standard error in parentheses. *p<0.1 **p<0.05 ***p<0.01

We then replicate the analysis on the technology category and present the results in Table D2.

Table D2: Does the effect of a video ad depend on project size?

SuccessLog(Duration) 2.206***


(0.579)Large -3.343***

(0.693)Log(Duration) - Squared -0.233***

(0.059)Visual Variation - Squared -1.345***

(0.459)Log(Duration)×Large 0.337**

(0.133)Visual Variation×Large 0.741**

(0.311)Observations 5,291Log Likelihood -2,724.642Akaike Inf. Crit. 5,521.284Note: This table uses data on projects with a video. Regressionincludes audio controls, project duration, menu length, creatorexperience, price, word count, sentiments, and genre. Standarderror in parentheses. **p<0.05 ***p<0.01

From Table D2, we see that the interaction between visual variation and project size is

still positive and significant, whereas the interaction between video duration and project size

is positive but not significant. As previously discussed, the learning effect is relatively

stronger and the tedium effect is relatively weaker in the technology category, and most

videos are shorter than the optimal length. Therefore, the effect of project size on the two-fac-

tor model is not significant. Also, we have fewer observations in the technology category.

Effect of Visual Information on Project Success: Interaction with Creators’ Prior Crowd-

funding Experience

12

Some creators are more experienced than others. On Kickstarter, if a creator has created some

projects before, these projects will be displayed on that creator’s personal webpage. When

buyers already have some prior knowledge about creators, they can learn about their new

projects faster. As a result, the positive learning effect saturates more quickly. Moreover,

when the musicians and their projects are not fresh to buyers, the boredom effect should be

stronger. Therefore, we expect that consumers will be less patient when watching videos pro-

duced by experienced creators.

In the empirical analysis, we incorporate the interaction between creator experience

and video duration as well as the interaction with visual variation into our logistic regression.

We summarize the results in Table D3.

Table D3: Does the effect of visual information depend on the creator’s crowdfunding experi-ence?

Success(1) (2) (3) (4)

Log(Duration) 1.803*** 1.900*** 1.807*** 1.910***

(0.466) (0.465) (0.466) (0.465)Log(Duration) – Squared -0.198*** -0.199*** -0.198*** -0.200***

(0.047) (0.046) (0.047) (0.046)Visual Variation 2.722*** 2.696*** 2.684*** 2.623***

(0.455) (0.456) (0.460) (0.461)Visual Variation - Squared -1.633*** -1.609*** -1.630*** -1.601***

(0.425) (0.425) (0.425) (0.425)Experience -0.329*** 1.834*** -0.416** 1.770***

(0.074) (0.627) (0.170) (0.631)Log(Duration)×Experience -0.419*** -0.439***

(0.121) (0.122)Visual Variation×Experience 0.170 0.323

(0.299) (0.304)Observations 6,822 6,822 6,822 6,822Log Likelihood -3,541.517 -3,535.479 -3,541.355 -3,534.913Akaike Inf. Crit. 7,163.034 7,152.959 7,164.709 7,153.826

13

Note: This table uses data on projects with a video. Regression includes audio controls,target, project duration, menu length, price, word count, sentiments, genre, and gender.Standard error in parentheses. **p<0.05 ***p<0.01

The results in Table D3 show that all the main effects of video duration and visual vari-

ation remain after incorporating the interaction with creator experience. The results also show

that the interaction between video duration and creator experience is negative and significant

(p-value < 0.01), suggesting that, all other things being equal, buyers are likely to prefer

shorter videos when evaluating projects posted by experienced creators. It is worth noting that

in our data, the average video duration of the projects posted by experienced creators is not

significantly different from that of the videos posted by their inexperienced counterparts (

p>0.1). Thus, our results suggest that projects posted by experienced creators might benefit

from abridged versions of videos.

The interaction between visual variation and creator experience is positive but not sta-

tistically significant, which suggests that the optimal level of stimulation in a video does not

seem to depend on buyers’ familiarity with the creators.

For the technology category, the effects of the creator’s past crowdfunding experience

on the effectiveness of video ads are replicated, and these results are summarized in Table D4.

Table D4: Does the effect of a video ad depend on the creator’s crowdfunding experience?

SuccessLog(Duration) 2.752***


(0.620)Creator Experience 2.540***

(0.700)Log(Duration) – Squared -0.257***

(0.062)Visual Variation – Squared -1.288***

(0.463)Log(Duration)×Creator Experience -0.378***

(0.137)Visual Variation×Experience -0.211

(0.324)Observations 5,291Log Likelihood -2,558.033

14

Akaike Inf. Crit. 5,188.067Note: This table uses data on projects with a video. Regression includesaudio controls, target, project duration, menu length, price, word count,sentiments, and genre. Standard error in parentheses.**p<0.05 ***p<0.01

15

ars.els-cdn.com · web viewappendix a: technical details for convolutional neural networks

Documents