ars.els-cdn.com · web viewappendix a: technical details for convolutional neural networks
TRANSCRIPT
Appendix A: Technical Details for Convolutional Neural Networks
Convolutional neural networks (CNNs) are based on artificial neural networks
(ANNs), a class of machine learning algorithms. Briefly speaking, an ANN has three layers,
an input layer, a hidden layer, and an output layers. Theoretical, the universal approximation
theorem asserts that an ANN that is large enough can approximate any functions, including
image recognition tasks. However, if we use ANNs for image recognition, the network size
will be too large and infeasible for modern computers. CNNs are then introduced to solve this
problem. Unlike ANNs, which have only one hidden layer, CNNs usually have multiple hid-
den layers. The intuition behind this structure is that a CNN divides the image recognition
process into smaller steps, which is carried out by different layers of the network. At the very
beginning, a convolutional layer (which is one of the first few layers) looks for a specific fea -
ture, such as a straight line or a curve. The lower layers then put these small features together
into more complex features, such as circle-like shapes, or eye-like shapes. Going further, at
the output layer, these complex features are put together into objects such as computers,
tables, and trees. Technically, due to the local connectivity property, we cannot recognize a
huge object with a single convolutional layer—only part of it. With many convolutional lay-
ers, however, we can gradually move from small parts to get a complete picture of the object.
State-of-the-art CNNs can have dozens of layers, millions of parameters, and achieve
an accuracy rate of more than 95% (Denil et al. 2013). Since such settings are highly com-
plex, they are trained on huge datasets with millions of training examples. This training takes
days on GPUs, which are specifically designed processors for image processing and are sig-
nificantly faster than CPUs for image processing jobs. We decided not to train our own CNN-
based image classifier for two reasons. First, we do not have the specific hardware (e.g.,
GPUs) to train the CNNs within an affordable amount of time. Second, it is necessary to have
very large datasets with labeled images to train the CNNs; in our task, the images are highly
differentiated, requiring a comprehensive training dataset. We do not have sufficient training
examples or the time resources to manually label the training images.
1
In addition, deep learning, especially CNN, is already a well-established process in the
computer science literature. We believe the detailed training algorithm here does not provide
an innovative contribution, and such a contribution is beyond the scope of the current manu-
script. We outsourced the training task to a start-up, Imagga, with its own training set, includ-
ing both publicly available image databases (e.g., ImageNet, an online database with 15 mil-
lion labeled images for academic uses) and its own data sources. By the time of our research,
this company was one of the leading start-ups in the field and had a huge training sets. How-
ever, the work can be outsourced to similar application programming interfaces (APIs, appli -
cations that allow us to obtain services of firms) such as Clarifai and Cloud Vision.
Imagga trains its CNNs on the huge training sets using GPUs. Although the company’s
CNNs are not specifically designed for our task, they still achieve a much higher accuracy
rate than our own algorithm, which is trained on small datasets, and keeps its leading position
in academic competitions in image recognition. We used the company’s algorithm as an Ap-
plication Programming Interface (API) to process our images, and the recognized image
classes (e.g., building) were then returned to us for further analysis.
In particular, the API returned us the probability that the image features an object; for
instance, for the following image, the probability that it features a guitar is 54.9%. 1
1 Note that the company has been updating their algorithms and training data. The values are thus not up to date.
2
We convert the continuous probability to a discrete dummy in the following way: if the
probability is above 10%, then the dummy is 1, otherwise it is 02. Note that the choice of this
cut-off point only applies to the current research and may not be generalized to other software
and other context. According to human judgment, this cut-off point is the best-fit of our re-
search purpose. The cutoff point, 10%, is somewhat low for a few reasons:
1) The API (Imagga) that we resort to itself sets a low relative probability to reduce
the type I error.
2) These images are video frames rather than professional images, implying that the
average quality is low (many images are blurred) and algorithm returns a low
probability.
3) In many frames, the feature only appears in a corner of the image, or only part of
a feature appears in the image. The algorithm returns a low probability on these
items; however, for our research purpose, we would like to include them in our
research.
For the sake of brevity, we provide some examples of the labeled images below:
Images featuring humans:
34.0% 23.1% 16.5%
2 In this research, we convert the continuous probabilities into dummy variables for the following rea-son. An image frame sometimes features multiple objects that fall into the category (e.g., an image can feature both guitar and wind instruments together), but the algorithm returns different probabilities for them. It is thus unclear how to combine the probabilities together. Currently, we set the dummy to 1 whenever one of the probabilities exceeds the threshold (10%).
3
54.0% 20.4% 25.8%
Images featuring instruments:
25.4% 100% 100%
11.2% 14.1% 51.7%
References
Denil M, Shakibi B, Dinh L, de Freitas N, et al. (2013) Predicting parameters in deep learn -
ing. In Advances in Neural Information Processing Systems, 2148–2156.
4
Appendix B: Measuring Audio Content
We follow the standard computer science and acoustic approach to control for audio
content. We refer readers to Giannakopoulos and Pikrakis (2014) for detailed technical de-
scriptions.
Digital audio signals are sampled from natural sounds. Let x i (n ) , n=1 ,…, L be the
sequence of audio samples of the i-th frame, where L is the length of that frame. First, we cal-
culate the average zero crossing rate (ZCR) of the audio content, which is based on the rate of
sign-changes during each audio frame. More precisely,
Zi=1
2 L∑n=1
L
| sign [ x i (n ) ]−sign[ x i(n−1)] |
and ZCR is the average of Zi over all frames. ZCR could be interpreted as a measure of the
“noisiness” of a signal, as it usually exhibits higher values in the case of noisy signals. ZCR
has been frequently used in acoustic studies for speech detection, speech-music discrimina-
tion, and music genre classification (Panagiotakis and Tziritas 2005).
The second measure we construct is the short-term energy (energy) of the audio con-
tent, which is computed according to the following equation.
Ei=1L∑
n=1
L
x i2(n)
Ei is averaged over the frames to obtain the energy measure. Energy reflects the power of an
audio file, and it is expected to change rapidly over the file. We use another measure, entropy,
to capture the rapid change of the energy. For each frame, it has various sub-frames, and for
each sub-frame j, we calculate its sub-frame energy value
e j=ESubframe j
∑k
E subfram ek
and the entropy of that frame is
H ( i )=−∑j=1
K
e j log2(e j)
5
and we further average H (i ) over all frames to obtain the entropy measure. Energy and en-
tropy are used for genre classification and emotion detection (Giannakopoulos et al. 2007).
We then apply the discrete Fourier transformation (DFT) of the signal and obtain
spectral coefficients of an audio frame, X i (k ) , k=1, …, FL. Using the spectral coefficients,
we calculate two measures, Spectral centroid (brightness) and Spectral entropy. Spectral cen-
troid is defined as
C i=∑
k
FL
k X i(k )
∑k
FL
X i(k )
Research has shown that higher values of spectral centroid correspond to brighter sounds, and
hence we call this brightness. Spectral entropy is analogous to entropy and we omit the de-
tails here. Together, these frequency domain measures prove effective in genre classification,
emotion detection, and speech-music discrimination (Misra et al. 2004).
As noted, these measures could be used for different audio analysis tasks ranging
from speech-music classification to emotion detection. As a typical video in our data com-
prises both speech, music, and a mixture of emotions, we do not train a model to classify the
audio content (which is not the focus of our research). We simply take these measures as con-
trols of the audio content of a video.
6
Appendix C: Robustness Check Using Data from the Technology Category
To demonstrate the robustness of our visual measures, we replicate our analysis in the tech-
nology category of Kickstarter. The technology category is smaller than the music category
on Kickstarter. For the technology category, our data includes all technology projects in the
following six US states: California, Illinois, Massachusetts, New York, Texas, and Washing-
ton. It comprises all completed projects in these states from the inception of Kickstarter to
March 6, 2017. Together, we have 6,958 observations, among which 5,291 projects have a
video. Almost all technology projects aim to offer some new technology products (e.g.,
glasses, watches, cameras). All the measures and variables used here are essentially the same
variables we use for the music category (except for the CNN variables, where we replace mu -
sical instruments with computers, as computers are commonly featured in technology
projects).
Table C1 summarizes the logistic regression results of the technology category (with
the same control variables used in the paper). The non-video-related control variables are
omitted.
Table C1: Logistic regression results for the technology category
Success(1) (2)
Video 1.348***
(0.120)Log(Duration) 2.280***
(0.586)Log(Duration) – Squared -0.221***
(0.060)Visual Variation 3.102***
(0.602)Visual Variation – Squared -1.301***
(0.462)ZCR 2.514
(4.869)Energy -5.590***
(2.040)Entropy 0.732**
(0.290)Brightness 0.443
(2.327)Spectral Entropy -0.441
7
(0.542)Observations 6,958 5,291Log Likelihood -2,963.649 -2,562.247Akaike Inf. Crit. 5,979.298 5,192.493Note: Column (1) uses all projects data. Column (2) uses data on projectswith a video. Regression includes target, project duration, menu length,creator experience, price, word count, sentiments, and genre. Standarderror in parentheses.**p<0.05 ***p<0.01
The linear effect of video duration is positive and significant, whereas the nonlinear ef-
fect of video duration is negative and significant, thereby suggesting that the marginal effect
of duration is decreasing.
The effects of visual variation are similar. In Table C1, the coefficient of visual varia-
tion is positive and significant, whereas the coefficient of visual variation - squared is nega-
tive and significant, which is again consistent with the optimal stimulation level theory.
So far, we have discussed the effects of video duration and visual variation. These ef-
fects may depend on project and creator characteristics. In Appendix D, we discuss the effects
of project size and creators’ past crowdfunding experiences.
Finally, we replicate the analysis on the technology category to demonstrate the robust-
ness. As mentioned, we construct two CNN variables capturing the content of the videos.
Again, we construct a dummy denoting whether or not a video features humans. Unlike music
projects, technology projects do not feature musical instruments, so we do not use them. In-
stead, we use a similar tool in technology, computers (including both desktops and laptops).
Creators often use computers to show the software or programs they have developed. Similar
to the music category, here the two CNN variables could potentially help the creators commu-
nicate credibility to buyers. As can be seen from Table C2, the effects of both human and
computers are significantly positive, thereby supporting the effect of credibility.
8
Table C2: The effects on project success in the technology category
Success(1) (2) (3)
Human 0.498*** 0.510***
(0.093) (0.094)Computer 0.187** 0.207***
(0.077) (0.077)Log(Duration) 2.050*** 2.298*** 2.063***
(0.581) (0.586) (0.582)Log(Duration) - Squared -0.200*** -0.223*** -0.202***
(0.060) (0.060) (0.060)Visual Variation 2.959*** 3.042*** 2.886***
(0.605) (0.604) (0.607)Visual Variation - Squared -1.194** -1.263*** -1.148**
(0.464) (0.463) (0.466)Observations 5,291 5,291 5,291Log Likelihood -2,547.636 -2,559.297 -2,544.046Akaike Inf. Crit. 5,165.272 5,188.593 5,160.092Note: This table uses data on projects with a video. Regression includes audio controls, target, project duration, menu length, creator experience, price, word count, sentiments, and genre. Standard error in parentheses.
**p<0.05 ***p<0.01
9
Appendix D: Additional Analysis for Crowdfunding Application
In the main text, we have described the main effects of video related variables. Now, we ex-
tend the analysis to study the interaction between visual information and other project con-
tent. We intend to show that the video measures also affect project success through other fac-
tors.
Effect of Visual Information on Project Success: Interaction with Project Size
In this section we consider the interaction between video measures and project size. The ratio-
nale is, when evaluating a large project, the associated risk of the project is usually higher and
potential buyers may value the information provided more, and are less likely to feel bored
when watching its related video. Project size can be defined by the price of offerings or the
total target set by the creators. Since the majority of Kickstarter projects have multiple offer -
ings (and prices), it is not easy to identify a unique price for each project and use that for anal-
ysis. For instance, how would we compare a project with two prices ($5 and $100) to another
project with a single price ($20)? In contrast, each project has a unique target that can be
readily used for analysis. In our data, the median target level is $5,000. We follow a median
split strategy and divide our sample of 6,822 projects into two types. We create a dummy
variable, large (Y=1), to indicate the projects that call for more than $5,000.3 Since many
projects call for exactly $5,000 dollars, only 43.5% of the projects qualify as large ones (see
Figure D1).
We then run the logistic regression including the interaction between project size and
video duration, as well as the interaction between project size and visual variation, i.e.,
Video Duration× Large, Visual Variation× Large. The control variables are the same with
the basic model except that we now replace funding target with large.
The results are summarized in Table D1 below. Table D1 shows that the interaction be-
tween project size and video duration is positive and significant (p-value < .01), suggesting
that, for projects with bigger targets, the tedium effect is weaker and buyers value longer
3 The results are not qualitatively altered when we use the continuous variable Target instead of using the dummy variable Large.
10
videos. In addition, the interaction between project size and visual variation is positive and
significant, implying that the optimal stimulation level could be higher when buyers evaluate
large projects.
Figure D1: The distribution of project target
5001500
25003500
45005500
65007500
85009500
1050011500
1250013500
14500
More than 15000
0
200
400
600
800
1000
1200
1400
1600
Project Target
Table D1: Does the effect of visual information depend on project size?
Success(1) (2) (3) (4)
Log(Duration) 1.669*** 1.885*** 1.703*** 1.901***
(0.451) (0.459) (0.451) (0.459)Log(Duration) – Squared -0.187*** -0.222*** -0.191*** -0.223***
(0.045) (0.047) (0.045) (0.047)Visual Variation 2.640*** 2.651*** 2.491*** 2.526***
(0.445) (0.446) (0.449) (0.449)Visual Variation - Squared -1.579*** -1.595*** -1.682*** -1.681***
(0.415) (0.416) (0.416) (0.417)Large -0.990*** -2.750*** -1.290*** -2.874***
(0.066) (0.518) (0.141) (0.523)Log(Duration)×Large 0.341*** 0.316***
(0.099) (0.100)Visual Variation×Large 0.583** 0.487**
(0.242) (0.244)
11
Observations 6,822 6,822 6,822 6,822Log Likelihood -3,645.645 -3,639.707 -3,642.724 -3,637.710Akaike Inf. Crit. 7,371.291 7,361.415 7,367.447 7,359.421Note: This table uses data on projects with a video. Regression includes audiocontrols, project duration, menu length, creator experience, price, word count,sentiments, genre, and gender. Standard error in parentheses. *p<0.1 **p<0.05 ***p<0.01
We then replicate the analysis on the technology category and present the results in Table D2.
Table D2: Does the effect of a video ad depend on project size?
SuccessLog(Duration) 2.206***
(0.563)Visual Variation 2.646***
(0.579)Large -3.343***
(0.693)Log(Duration) - Squared -0.233***
(0.059)Visual Variation - Squared -1.345***
(0.459)Log(Duration)×Large 0.337**
(0.133)Visual Variation×Large 0.741**
(0.311)Observations 5,291Log Likelihood -2,724.642Akaike Inf. Crit. 5,521.284Note: This table uses data on projects with a video. Regressionincludes audio controls, project duration, menu length, creatorexperience, price, word count, sentiments, and genre. Standarderror in parentheses. **p<0.05 ***p<0.01
From Table D2, we see that the interaction between visual variation and project size is
still positive and significant, whereas the interaction between video duration and project size
is positive but not significant. As previously discussed, the learning effect is relatively
stronger and the tedium effect is relatively weaker in the technology category, and most
videos are shorter than the optimal length. Therefore, the effect of project size on the two-fac-
tor model is not significant. Also, we have fewer observations in the technology category.
Effect of Visual Information on Project Success: Interaction with Creators’ Prior Crowd-
funding Experience
12
Some creators are more experienced than others. On Kickstarter, if a creator has created some
projects before, these projects will be displayed on that creator’s personal webpage. When
buyers already have some prior knowledge about creators, they can learn about their new
projects faster. As a result, the positive learning effect saturates more quickly. Moreover,
when the musicians and their projects are not fresh to buyers, the boredom effect should be
stronger. Therefore, we expect that consumers will be less patient when watching videos pro-
duced by experienced creators.
In the empirical analysis, we incorporate the interaction between creator experience
and video duration as well as the interaction with visual variation into our logistic regression.
We summarize the results in Table D3.
Table D3: Does the effect of visual information depend on the creator’s crowdfunding experi-ence?
Success(1) (2) (3) (4)
Log(Duration) 1.803*** 1.900*** 1.807*** 1.910***
(0.466) (0.465) (0.466) (0.465)Log(Duration) – Squared -0.198*** -0.199*** -0.198*** -0.200***
(0.047) (0.046) (0.047) (0.046)Visual Variation 2.722*** 2.696*** 2.684*** 2.623***
(0.455) (0.456) (0.460) (0.461)Visual Variation - Squared -1.633*** -1.609*** -1.630*** -1.601***
(0.425) (0.425) (0.425) (0.425)Experience -0.329*** 1.834*** -0.416** 1.770***
(0.074) (0.627) (0.170) (0.631)Log(Duration)×Experience -0.419*** -0.439***
(0.121) (0.122)Visual Variation×Experience 0.170 0.323
(0.299) (0.304)Observations 6,822 6,822 6,822 6,822Log Likelihood -3,541.517 -3,535.479 -3,541.355 -3,534.913Akaike Inf. Crit. 7,163.034 7,152.959 7,164.709 7,153.826
13
Note: This table uses data on projects with a video. Regression includes audio controls,target, project duration, menu length, price, word count, sentiments, genre, and gender.Standard error in parentheses. **p<0.05 ***p<0.01
The results in Table D3 show that all the main effects of video duration and visual vari-
ation remain after incorporating the interaction with creator experience. The results also show
that the interaction between video duration and creator experience is negative and significant
(p-value < 0.01), suggesting that, all other things being equal, buyers are likely to prefer
shorter videos when evaluating projects posted by experienced creators. It is worth noting that
in our data, the average video duration of the projects posted by experienced creators is not
significantly different from that of the videos posted by their inexperienced counterparts (
p>0.1). Thus, our results suggest that projects posted by experienced creators might benefit
from abridged versions of videos.
The interaction between visual variation and creator experience is positive but not sta-
tistically significant, which suggests that the optimal level of stimulation in a video does not
seem to depend on buyers’ familiarity with the creators.
For the technology category, the effects of the creator’s past crowdfunding experience
on the effectiveness of video ads are replicated, and these results are summarized in Table D4.
Table D4: Does the effect of a video ad depend on the creator’s crowdfunding experience?
SuccessLog(Duration) 2.752***
(0.613)Visual Variation 3.133***
(0.620)Creator Experience 2.540***
(0.700)Log(Duration) – Squared -0.257***
(0.062)Visual Variation – Squared -1.288***
(0.463)Log(Duration)×Creator Experience -0.378***
(0.137)Visual Variation×Experience -0.211
(0.324)Observations 5,291Log Likelihood -2,558.033
14
Akaike Inf. Crit. 5,188.067Note: This table uses data on projects with a video. Regression includesaudio controls, target, project duration, menu length, price, word count,sentiments, and genre. Standard error in parentheses.**p<0.05 ***p<0.01
15