using “distant reading” to explore discussion threads in online courses
TRANSCRIPT
Using “Distant Reading” to Explore Discussion Threads in
Online Courses
Digital Poster SessionBig XII Teaching and Learning Conference (3rd Annual) Theme: Transformational Teaching and Learning Kansas State University June 2 – 3, 2016
“reading” related tags network on Flickr (1.5 deg.)
2
“Reading_(process)” article network on Wikipedia (1 deg.)
3
Overview
• In this age of mass data, “distant reading” has come to the fore as a way to deal with large amounts of text data—including from student discussion threads in online courses. Kansas State University has a site license for NVivo 11 Plus, a software that enables multimedia data curation and qualitative and mixed methods data analysis. Two new features in NVivo—sentiment analysis and theme extraction (topic modeling)—enable users to “distant read” large amounts of text to extract some early insights.
• What are the expressed sentiments of learners when discussing a particular issue? Do these trend positive or negative?
• What topics or themes or concepts are brought up by students given a certain discussion thread prompt?
• What do the sentiment and topic insights suggest about where students are at with a particular issue? Are there latent (hidden) insights?
4
Overview (cont.)
• These new features, in combination with text frequency counts (with related text clustering), text searches, and other text data query capabilities (and related data visualization capabilities) in NVivo enable distant reading for use in online courses. This digital slideshow will introduce NVivo 11 Plus (a local software tool with both Windows and Mac platform versions) and walk-through how it may be applied to textual data extracted from an online course.
• Understanding what students are thinking is a critical part of transformational teaching and learning. Using computational means to listen and to hear is important to this end.
5
“Distant reading”
6
Origin of the “distant reading” concept
• Proposal to use computational means to process more than the local canon but also works of world literature, digitized texts, and born-digital contents in the Internet era
• Compared classic reading as a solemn and serious endeavor of reading when what is needed is “a little pact with the devil” or reading texts with distance in order to study “devices, themes, tropes—or genres and systems” (n.p.)
• Franco Moretti (“Conjectures on World Literature,” 2000)
• An abundance of digital texts with many comprising “the great unread” and so requiring computational means to decode text / explore meaning / “distant read”
• Franco Moretti (Distant Reading, 2013)
7
Origin of the “distant reading” concept (cont.)
• Willingness to let computers / machines “distant read” for people?• Reluctant readers in the present age? • Low grade level reading assumptions for the broader population?
• Surveys written to the 3rd and 5th grade levels• Newspapers written to the middle school levels
• Moves to reading online and letting go of traditional news
• Inefficient human reading speeds • 200 words per minute (wpm) for close academic reading • 800 wpm for skimming and scanning • Challenges with human attention, focus, comprehension / accuracy, and
memory in terms of reading
8
Light history of statistical analysis of text
• Over 50 years of statistical analysis of text (but initially with human analysts)
• Three general methodologies • Thematic content analysis (by human judges early on) • Word pattern analysis (as part of artificial intelligence work) • Word count strategies (Pennebaker, Mehl, & Niederhoffer, 2003)
• All three methodologies extended and refined through computational means
9
General sequence of “distant reading”
1. Collection of relevant texts (challenges with access) 2. Data formatting and cleaning (data pre-processing, including file
formatting, searchable text, de-duplication of files, spell checks, and other types of work)
3. Textual markup via tagging (sometimes, with XML, for example) 4. Running data queries and analytics 5. Outputting relevant data visualizations (can start with exploration
using data visualizations) 6. Analysis and interpretation 7. Finalization and presentation
10
Comparisons and contrasts between “distant” and “close” reading
11
Some comparisons and contrasts between distant and close reading (cont.)
Distant Reading• Text interpretation based on
computerized counts of words and ngrams
• Counts may be done by unsupervised machine learning
• Counts may be done by supervised machine learning
• Counts may be done by a combination of computational and human methods (usually via customized human tagging and counting)
Close Reading • Text interpretation based on interactions
between a human reader (with personality, intellect, emotions; life experiences; education; training) and a text / text corpus
• Evocative and descriptive details affect the reader and activate various parts of the brain…
• Trigger the somatosensory cortex • Engage the reader’s imagination, mind,
emotions, and prior knowledge • Engage reader empathy and sympathy • Challenge the reader’s worldview and sense
of how things should be and how things are
12
Some comparisons and contrasts between distant and close reading (cont.)
Distant Reading• Summary data of a text or text
corpora• Reductionist (of language data
dimensionality) • Indirect • Scalable (but not to “big data”
standards without access to cloud computing and special unique analytics approaches)
• Machine-efficient and fast • Objectivist, reproducible, repeatable• Comparable to other ~ text
documents and collections
Close Reading • Activates the “inner language” that
people use to think (vs. the “outer language” used to communicate with others)
• People “simulate to comprehend” • See Rawi Nanakul’s “Shaping Data Stories
with Neuroscience,” May 24, 2016, in Data Science Central’s webinar series
• Informed by human-based proprioception (and somatosensory system) / embodiedreading
• See David Gelernter’s comments on import of proprioception in AI article in Time Magazine
13
Some comparisons and contrasts between distant and close reading (cont.)
Distant Reading• Data mining for discovery • Data exploration for patterns (at
various levels of granularity)• Specific data queries for specific
text types and text sets • Descriptions of genres (news
texts, short story collections, letters, terror messaging, and others)
Close Reading • Tends towards contextual awareness • Assumption of a human interpretive
lens with relativistic assumptions • Is affected by who the reader is and what
informs his / her literary or reading palate• Goes beyond surface meanings of a text
• Includes subtext • Includes a lot of understanding outside of
and beyond the text (such as cultural references)
14
Some comparisons and contrasts between distant and close reading (cont.)
Distant Reading• May be applied to anything in
text format • Built on the analysis of “natural
language” (evolved vs. constructed language)
Close Reading • Analytics presented with a
personality framework (through people, through human mediation)
• Requires a certain level of human reading background and cognition to achieve certain types of high-level reading
• Engaging the mind (cognition, emotions) and human senses (sight, smell, taste, touch, hearing, and proprioception)
15
Some comparisons and contrasts between distant and close reading (cont.)
Distant Reading• Comparisons across texts,
corpora, and (cultural, time, and other) contexts
• Stylometric analysis of author writing
• “Disembodied” and machine-based decoding of text
Close Reading • Intimate to the human
experience • Variant meanings depending on
the reader and on context (texts may be “read into”)
• Conceptualized as both phonological (sound-based as in read-aloud and “sounding out” language) and orthographic (written / printed symbol-based)
16
Some comparisons and contrasts between distant and close reading (cont.)
Distant reading• Uses data visualizations to depict
summary data• Word clouds • Word trees • Data tables • Cluster visualizations (2D, 3D, 4D) • Dendrograms • Tree maps • Intensity matrices (matrix factorizations)• Node-link diagrams• Topic histograms • Sociograms• Geographical maps, and others
Close reading • Canons of text informed by
collective processes of selection that are often serendipitous and not necessarily based on “merit” once texts reach a particular level of recognition (Klosterman, But What if We’re Wrong? 2016)
• Informed by social contexts, and subtle and not-so-subtle influences
• Informed by collective memory and collective forgetting
17
Some research enablements of both types of reading
Distant reading• Empirical from-world (in vivo)
approaches • Analysis of from-world text data sets
• Experimental lab (in vitro) approaches (control groups, interventions, outcomes)
• Analysis of assigned types of writing
• and others
Close reading• Literary analysis • Cultural analytics • Remote profiling• Reviews
• and others
18
Strategic application of “distant” and “close” reading
19
Controversies related to “distant reading”
20
Controversy: “Not reading”
• Distant reading described as “non-reading” and “not-reading” also “data mining” and “machine learning”
• “Distant reading” without an inherent human sense of appreciationfor reading
• Disembodied, decontextualized, and non-ordered computational text analysis (such as computation-based “bag of words” approaches)
• How does the “great unread” benefit from “distant reading” capabilities which doesn’t have clear human investment in the reading / or clear appreciation for the writing? (argument)
• Counter-intuitive approach
21
Controversy: Computer incursion into human spaces• Sense of computational incursions in distinctly human spaces
• Human writing as a human activity done purely for human consumption and usage
• Over-reliance on computational means for text interpretation • Computational means seen to be highly mechanistic and rigid
• If distant reading now, distant algorithmic writing (like SCIgen) close behind?• Machine writing based on algorithms, sampling from other sources, rules of
organization, emulation of genre and forms (may not make sense in a human-readable way)
• Human writing as individual and / or social, something “uniquely” human…?
22
Controversy: Computer incursion into human spaces (cont.)
• Changing up the standards of literary research • Completeness of text sets • Accuracy (vs. human bias and subjectivity) • Reliability • Reproducibility • Machine vs. human insight
• A summary of the argument: Computers “are too rigid, formulaic, and cold to pick up the subtle nuances, emotions, and creative brilliance manifested in literature” (Graesser, Dowell, & Moldovan, “A computer’s understanding of literature,” 2011, p. 3)
23
Controversy: Epistemology
• Challenges involved in the apparent positivism and research reproducibility assumptions of “distant reading” and concepts of epistemology (or where knowledge comes from)
• vs. human-focused interpretivism, social constructivism, and emphasis on uniqueness and originality…or many of the human aspects of qualitative research
• Goes against various traditions of text analytics: narratology research, various thematic analytical approaches, and others
24
Controversy: Hard work
• Hard analytical work to “distant read” • High learning curve to understand various underlying text processing
algorithms and XML tagging systems (depending on what is used) • Challenges to acquire copyrighted texts • Effortful digitization of paper-based texts in a searchable way • Time-intensive requirements to clean and data • Need for careful human analysis of the extracted data
25
“Non-consumptiveness” in some types of computation-based text research • “Non-consumptive” distant reading involves access to summary data
only and no access to the underlying source (text) data • Non-consumptiveness enables researchers to have some access to
text but not in a computational way that enables them to recreate the text computationally because of the risks of contravening copyright of the accessed texts
• To protect copyright, “digital surrogates” and “shadow datasets” are made available
26
Examples of non-consumptive text research sources on the Web • Example: Google Books Ngram Viewer only offers access to ngram
counts and line graph visualizations and links to seminal years (with some pages of potentially relevant information) but not the underlying source dataset (because of copyright and other issues); uses a shadow dataset for the Ngram Viewer queries
• Example: HathiTrust Digital Library offers opportunities for non-consumptive research
27
Controversy: Non-consumptiveness
Con non-consumptive research of text(s)• Non-consumptiveness means that
researchers do not have access to the underlying texts but can only capture summary patterns and data snippets.
• Researchers cannot explore the underlying data more directly to understand particular effects within the context of the document.
• Researchers have partial information; they are seeing through a glass darkly.
Pro non-consumptive research of text(s)• Big data will disallow any real
direct intimacy with the data from close reading.
• The on-ground reality is a de facto situation of essential non-consumptiveness as well.
• It is better to know something than not…even if the full text set is not available.
28
Controversy: Non-consumptiveness (cont.)
Con non-consumptive research of text(s)• Non-consumptive protections
assume that researchers will hack systems in order to break copyright instead of using the copyrighted texts for relevant research.
Pro non-consumptive research of text(s)• Waiting until texts are fully in
the public domain and not under copyright protection means loss of research opportunities.
29
Current state of affairs
• Since early 1990s, advances in: • Computational linguistics • Corpus linguistics • Discourse processes • “statistical representations of world knowledge and meaning” (Graesser,
Dowell, & Moldovan, “A computer’s understanding of literature,” 2011, p. 3)
• Enablements for the “computational science of literature” (Graesser, Dowell, & Moldovan, “A computer’s understanding of literature,” 2011, p. 4)
30
Current state of affairs (cont.)
• Definitions and practices: What exactly is “distant reading”? What are the most effective ways to achieve this?
• Room for both? Can distant and close reading complement and enhance each other?
• Distant reading will not replace close reading; close reading will not replace distant reading
• Each may use the same texts, but they cover different ground and surface different insights and provide different functions
• Neither camp (advocates for one or the other) claims to have the definitive approach to reading nor the sole correct way of reading nor that they have the whole story
31
Current state of affairs (cont.)
• Split practices: Some humanists going to the “digital humanities,” and others not ceding any traditional ground
• Others use multiple methods to read and use each to complement the other
32
Current state of affairs (cont.)
• Don’t call it “distant reading” but rather “computational text analysis”• Early methods have been around since the 1990s
• Computational methods to model texts are necessary for handling large-scale document and text corpora to enhance findability and searchability
• Various types of statistical and computational methods are applied to identify hidden themes in textual (and visual) works to inferentially summarize contents in an unsupervised machine learning way
• Computational methods have been used to identify similar works—for recommender systems (to aid people in finding the research they are seeking)
33
Applications in the research literature
34
Some electronic sourcing of texts for “distant reading” analysis • Social network wall postings • Microblogging text streams and
text networks • Online journaling sites • Content sharing social media
platforms (and commenting streams / networks)
• Crowd-sourced online encyclopedia articles, article networks, and collections
• Web log (blog) entries • Article repositories and databases • Content descriptions from online
repositories • Article from online news sites• Email networks • Slideshows • Tags for user-generated imagery
and / or videos• FAQs (frequently asked questions)
35
Some electronic sourcing of texts for “distant reading” analysis (cont.)
• Descriptors for user-generated contents
• Massive open online course (MOOC) discussion boards and threads
• Downloadable online text datasets (such as from published research)
• Short messaging service (SMS) texts
• Chat room postings
• Online ads• Social news services • Stories, poems, scripts, plays, and
other literary genre contents from online publishing sites
• Codes from online pastebins and open sites
• Documents, reports, memos, minutes, and other work-related contents
36
Some electronic sourcing of texts for “distant reading” analysis (cont.)
• Government cables (such as from leak sites)
• Book manuscripts on open sharing sites
• Informal gray literature shared on the Web
• Profiles on dating, professional, and other purpose-based social sites
• Song lyrics• Recipes, and others
Also• Anything digital (or analog) that is
transcribe-able or able to be versioned into text
• Anything analog that is digitize-able (scannable) to searchable text
37
Some actual distant reading enablements in the academic research • “Reading” to surface latent insights, such as the ability to…
• Look for hidden patterns in texts and language use • Extract self-concept and social relationships
• Predict friendships and length of romantic relationships (through textual mirroring) • Study social groups and interactions
• Explore sentiment, emotion, and personality based on language use patterns • Profile writers based on their writing • Describe writing based on stylometry (quantitative metrics of writing style)• Pursue the answering of research questions through purposeful and tactical
distant reading • Study collective social sentiment during social upheavals
38
Some actual distant reading enablements in the academic research (cont.)
• Ability to… (cont.)
• Study stage-of-life issues• Explore benefits of writing to deal with health issues / social traumas /
recovery from disasters• Detect deception and differentiate actual vs. faked works (with varying levels
of confidence) • Differentiate between the writing of poets and songwriters who commit
suicide vs. those who don’t (observed from-world data) • Predict various phenomena: future work-place performance, threat
assessment and intentionality, sense of aggression, and others • Study and describe works from various cultures and sub-cultures
39
Some actual distant reading enablements in the academic research (cont.)
• “Reading” (textual interpretation / decoding) at different levels of analysis
• Ability to pull out particular aspects of texts• Ability to extract insights that would not be possible otherwise
• “Reading” at computational speeds • “Reading” at massive scales
40
Some general practical examples
• Pulling out dialogue in order to understand characters and relationships
• Identifying collective discussions (based around themes) around a particular topic (in periods in history and contemporaneously)
• Studying how concepts / terms evolve over time
• Mapping a particular domain field for main directions and foci (as well as outlier research and concepts)
• Mapping time relationships in a text to understand plot
41
Some general practical examples (cont.)
• Mapping texts for geographical locations• Studying symbolism across works by an author or texts in a genre or
written works in a culture
42
Distant reading applied to discussion board data
43
44
Some types of data in discussion boards
Content Data • Communicators (by name) • Textual content of
communications • Names of discussion threads • Discussion threads • Multimedia contents
Trace / Traffic Data • Interaction data (replies, likes, and
others) • Time of communications • Social networks
• Locational coordinates of user (or contents, like imagery or video)
• Technological specifications of files45
Metadata
46
Distant reading (writ large) applied to discussion board data
Some Askable Questions• What are dominant issues on a certain discussion thread(s)? What
are the outlier issues on certain discussion threads (the topics that are part of the “long tail”)?
• What sorts of language are used in addressing particular topics? Why?
• Who is interacting with whom? Around what issues? What types of cliques may be observed in the online course social network?
• What are expressed sentiments based on a certain discussion thread(s)?
47
Distant reading (writ large) applied to discussion board data (cont.)
• How are particular terms used in various contexts in the discussion? What are the various gists and word senses?
• How does activity in discussion (message) boards change over time? • What sorts of prompts are the most effective to encourage online
discussions? • What sorts of instructor interventions are most constructive (based
on learner responses)? Which are least constructive? Why? • What sorts of messages spark discussion? What sorts of messages
shut down discussion? When is each type of message preferred?
48
Distant reading (writ large) applied to discussion board data (cont.)
• What are observable word relationships in a discussion board(s)? What do these (inter)relationships indicate?
• Are there hidden or latent topics in the discussions based on the observed words? What insights are there?
• What locations are indicated in uploaded images shared on a discussion board(s)? How do these map? Are there spatial patterns?
• And others…
49
In NVivo 11 PlusA qualitative data analytics suite
50
Some “distant reading” features of NVivo 11 Plus(1) word frequency counts(2) text searches(3) theme and sub-theme extractions (topic modeling)(4) matrix queries(5) sentiment analyses (6) autocoding by existing pattern(7) geolocational mapping(8) sociograms(9) time patterns
51
Additional notes about NVivo and distant reading • NVivo 11 Plus enables a half-dozen base languages for projects
• Any language representable using the UTF-8 character set may be ingested into NVivo 11 Plus and accurately represented
• These include all languages representable on the Web and Internet, both natural languages and constructed languages
• Some of the data will have to be pre-processed outside of NVivo 11 Plus before running particular data queries, autocoding, and data visualizations
• Some LMSes may allow more readily accessible data for easier download, but current-state LMSes do not apparently make this effort that easy
52
Demos using NVivo 11 Plus
53
A Tweetstream dataset to stand in for discussion board data • Publicly released information • Has entities (social media users) • Has commenting and replies • Is based on @officeofedtech Tweetstream (an educational focus)
• 2,796 captured records (including retweets) using NCapture of NVivo • 5,601 Tweets, 1,130 following, 109,955 followers, 687 likes, 9 lists, at the
time of the data extraction; joined in July 2010 • U.S. Department of Education (http://tech.ed.gov)
54
55
(1) word frequency counts
• Machine count of the 1000 (or fewer) top content words in the selected text set
• May adjust the level of words • May add words to the stopwords list (at either the global project level or at
the local run / local macro level)
• Half-dozen base languages may be applied (but only one at a time) • Also proportionality of a word’s appearance in relation to other
counted words from a document, node, or text set
56
(1) word frequency counts (cont.)
ProsIdentification of focuses of interest (more mentions)
Cons• Word counts captured at word
level, not phrase level; not bigrams, not trigrams/three-grams, not four-grams; not “thought units,” etc.
• Word counts do not provide gist or context
• Word counts do not capture sentiment (positive or negative)
57
58
59
60
61
62
63
Conversely: Exploring the “long tail” w/ “least frequent” word count • Unclick the “1000 most frequent” selection in the word frequency
count query • Filter the findings by count (from least to greatest) • Export the data table to Excel • Clean out garble and non-words • Identify outlier topics of interest, URLs, and other features
64
65
66
(2) text searches
• Words, phrases, quotes, names, dates, and other content that may be searched for and identified in a word tree
• May click on the word tree phrases to capture context • May double click on phrases to go to the original reference • May set parameters of the search for the following (from specific to
general): exact matches, stemmed words, synonyms, specializations, generalizations
67
(2) text searches (cont.)
Pros• Context of a few words prior and
a few words after provide summary data
Cons• Deeper exploration requires accessing
original source contexts (so the data processing has to enable going up and down the chain of a discussion board thread)
• Multimedia files uploaded in discussion boards have to be annotated in order to query (using text)
• Tree is helpful for a limited set of occurrences but becomes too complex and unwieldy after a certain point
• Downloaded word tree is not interactive nor navigable
68
69
(3) theme and sub-theme extractions
• Automated extraction of identified themes and sub-themes (concepts) from a text set based on an unknown algorithm (or chain of algorithms) but classically based on word relatedness and similarity clustering such as based on principal component analysis or factor analysis; probabilistic latent semantic indexing; latent Dirichlet allocation; non-negative matrix factorization, and other approaches
• Variance based on whether extracted themes are applied to sentences or paragraphs or cells (level of granularity of the coding)
• Subthemes (as extracted by NVivo 11 Plus) tend to be literally related words to the extracted theme (usually with occurrences of those key theme terms in the subthemes, which are generally noun phrases)
• Principal component analysis or factor analysis usually extract more noisy topics that require some human interpretation to label (by contrast)
70
(3) theme and sub-theme extractions (cont.)
• “Topic modeling” extracts the latent, underlying semantic structure of a text or texts (using machine-based inductive logic)
• Process captures topic distributions based on occurrences of a particular term
• Extracted theme and sub-theme extractions must be approved by the human user before themes are coded to nodes in NVivo 11 Plus
• Themes may be semantic terms as well as verbs (in various stemmed forms) and other words
• Autocoded nodes (based on the extracted themes) may be edited and revised and even omitted
• Based on unsupervised machine learning
71
72
(3) theme and sub-theme extractions (cont.)
Pros• Summary data based on themes
and literally connected subthemes (subthemes show recurrence of the theme words)
• Objectivist basis for theme and sub-theme extractions (without theory-based selectivity)
• Requires exploration of the nodes to see what has been captured in the particular theme
Cons• May include non-content non-
semantic words • Does not offer tunable
parameters like the setting of “k” topics
73
(3) theme and sub-theme extractions (cont.)
Pros• Offers a topic-based histogram
for source publications and combined text corpora, with latent topics identified (and latent topic frequency distributions)
• Theme text sets may be analyzed further (with the extraction of sentiments, or data-queried, etc.)
Cons
74
(3) theme and sub-theme extractions (cont.)
Pros• Easy to export to Excel or other
software tool for other data visualizations (like spider charts)
Cons
75
Machine extracted and human-vetted
76
77
78
79
80
Fast-read of document set with automated theme and sub-theme extraction • Applied to text corpus (or corpora / corpuses) of documents • A resulting matrix with identification of relevant documents per found
themes and sub-themes (based on observed words) • Result: A co-occurrence table of documents with extracted themes • Extracted dimensions of the “domain” if the articles were a complete
representation of the domain (but not in this example, and generally not because of prohibitive size of such document sets)
• Resembles a term-document matrix, which is a middle-step in a classic latent semantic analysis (LSA) sequence
81
Fast-read of document set with automated theme and sub-theme extraction (cont.)
• Can further create article histograms (of contents) by exporting top column headers and specific single rows (articles) (and mapping in Excel)
• Can run article histograms in the sequence of publication to capture a sense of the dynamism related to that topic over sequential time
• Can create topical summary tendencies for the article set
82
83
Example of an article histogram
84
Themes in the full set
85
(4) matrix queries
• Cross referencing sources, codes, respective coding by individuals, and other data
• May be queried across types of data
• May be used to identify recurring concepts in individual’s communications
• May be used to capture student interactions with each other
86
(4) matrix queries (cont.)
Pros• Enables identification of data
patterns by cross-referencing various variables and sources from the uploaded data
• May query student work as sources and pull theme extractions from their work
Cons• Requires coded data for matrix
coding queries • Requires proper data processing • Requires proper ingestion of
data for clear analytics
87
88
(5) sentiment analyses
• Sentiment is traditionally conceptualized as a positive-negative polarity (either as a binary positive or negative or as a continuum: “very negative, moderately negative, moderately positive, very positive” in NVivo)
• Assumption that sentiment will affect behavior (to some degree) • Depending on the structure of the textual data, the sentiment coding
may be applied at sentence, paragraph, or cell levels
89
(5) sentiment analyses (cont.)
Pros• Sentiment coding (positive-negative) may
provide some general insights • Sentiment coding text sets may be accessed
to see what text was coded to where; uncoded or “neutral” text sets may be visualized in proportion to the sentiment coding in hierarchical graphs but the underlying text sets that are “neutral” cannot be extracted
• Sentiment coding may be uncoded and re-coded
• Original source texts from each sentiment category may be accessed for further exploration
Cons• Coding for sentiment does not capture
double negatives, irony, sarcasm, humor, symbolism, tropes, and other nuances of communication
• Sentiment coding by machine may misread words that are highly positive or negative in the internal sentiment dictionary (without applying contextual awareness)
• Pre-coded sentiment dictionary (set) is not revisable in NVivo 11 Plus currently
90
91
92
93
94
95
(6) autocoding by existing pattern
• This “autocoding by existing pattern” enables machine emulation of human coding
• Human has to create the data nodes (categories); he / she / they has(have) to provide sufficient exemplars for each category
• The “autocoding by existing pattern” should be set for the proper level of aggressiveness in the coding (by “NV” or NVivo)
• All autocoding should be double-checked for accuracy and corrections (such as by uncoding / recoding)
96
(6) autocoding by existing pattern (cont.)
Pros• Captures the human-based
coding of a text set • Extends the researcher’s coding
“fist” (style) to as-yet uncoded data
Cons• Requires a sufficient amount of
coded data • Does not enable the creation of
new nodes (beyond what the human coder has created)
• Requires human oversight
97
98
(7) geolocational mapping
• This feature places pins of various locations on a map • This requires disambiguated place information to work effectively
99
(7) geolocational mapping (cont.)
Pros• Provides a geographical and
spatial sense to texts • Geographical information in
texts can be ambiguous and noisy (not effective for specific locational mapping)
• Computer-captured long-lat data are more accurate
Cons• Is a rough map (based on a
Mapquest map understructure) • Would work better using Google
Maps and ArcGIS maps
100
101
(8) sociograms
• Node-link diagrams that show individual entities (individuals) and their relationships to others based on interactivity (messaging)
• Sociograms align with online course discussion boards, which show instructors / teaching assistants / learners and their relational messaging
102
(8) sociograms (cont.)
Pros• Shows interrelationships (with
“relating” defined as interaction / intercommunications / replies / retweets / shares / likes – dislikes )
• Enables capturing of relationships at one-degree (direct ego neighborhood) in NVivo; enables the capturing of relationships at 1.5 deg., 2 deg., and such, using other tools
Cons• Shows general information only
about entities and relationships • Does not include profile
information • Does not include the shared
messaging (need to go to the underlying discussion board posts)
• Does not offer the more eye-catching sociogram visualizations that other tools can offer
103
104
105
(9) time patterns
• There are a variety of possible time patterns: • slice-in-time data (at a particular moment), • time periods or phases (different ways to slice time), and • continuous changes over time
106
(9) time patterns (cont.)
Pros• Enables understandings of how
learners interact on a discussion board during a term
Cons• Requires setup of data a
particular way in order to represent time; may be more effective in a different tool than NVivo (such as Excel)
107
108
Application of distant reading tools to discussion boards
109
Application to discussion boards
(1) word frequency counts• What are common popular words used in discussion boards based on
certain topics, questions, image prompts, cases, or videos? • What related word clusters may be found in terms of proxemics co-
occurrence? Which terms are related, and why? Are these relations justified by facts, or are these a product of inaccurate assumptions?
(2) text searches• What are the common expressed ideas used in discussion boards based
on certain alphanumeric data (names, dates, phrases, events, and others)?
110
Application to discussion boards (cont.)
(3) theme and sub-theme extractions • What are common machine-identified identified concepts and sub-
concepts used in particular discussion boards / threads? • And what is not mentioned? What do we expect to see that we don’t see (even
after a precise “text search”)? Why?• How do these align with human-identified themes and sub-themes? Why? • Based on the underlying text sets, how are these themes and sub-themes
understood? How may misconceptions be addressed by the instructor and the learners together?
• In terms of clustering by word similarity, whose writings are most similar? Does that show who is talking to whom? Does word similarity across users (their sources) show potential emulation (or even potential plagiarism)?
111
Application to discussion boards (cont.)
(4) matrix queries• Are there identifiable patterns between how different groups of
students stand on various issues? • Are there identifiable patterns between coded variables from the text
sets?(5) sentiment analyses • What are the amounts of sentiments—very negative, moderately
negative, moderately positive, and very positive—seen in the sentences / paragraphs of the discussion board interactions? • Why are certain discussions trending positive-negative in the machine-observed
ways?• What are the texts in each of the coded sentiment categories? • What are some possible inferences from these observed expressed sentiments?
112
Application to discussion boards (cont.)
(6) autocoding by existing pattern• What sorts of summary data assertions may be made about the
text sets from the discussion boards based on the following: • Human emergent coding • Human a priori coding [based on theory/ies, model(s), framework(s), and
other methods]• A mixture of emergent and a priori coding
(7) geolocational mapping• Are there some geographical patterns to the expressed ideas or
metadata from uploaded images / audio / video on the discussion boards?
113
Application to discussion boards (cont.)
(8) sociograms• How do the learners in an online class form relationships and interact
and collaborate based on co-commenting on a discussion board? Replying?
• If there are identified cliques, how do these form? If there are dyadic, triadic, and other types of motif relationships, how do these form? How long do these ties last?
(9) time patterns• When are discussion boards most active during the assignment period?
When are they least active? Why? • What sorts of prompts encourage more time-bursty communications
behaviors? Why?
114
115
116
Discussion
117
Walking this back a little…
• Benefits to deep knowledge of the data set (attainable by close reading of some parts of the text) even with the affordances of distant reading
• Deep knowledge enables a more accurate understanding of the results of “distant reading”
• Benefits to have a human touch in supporting student learning in an age of computational affordances
• People respond to other people much more positively (and negatively) than to computers
118
Walking this back a little… (cont.)
• Benefits to wielding distant reading tool with a deep and accurate understanding of what is going on with the computer, the software, and the data…and how to read the outputs without being mechanistic or unthinking
• Is reading ease always desirable? • What is the writing context? The writing genre? Expectations?
• Benefits to having some baseline understandings of the types of text sets and typical quantitative readings
119
Some other “distant reading” tools
120
Some other “distant reading” tools, too
• There are a range of other software tools that enable various ways to “distant read”:
• LIWC (Linguistic Inquiry and Word Count, pronounced “luke,” U of Texas)• AutoMap and ORA-NetScenes (word relationships in a document or a text
corpus, CASOS, Carnegie Mellon University) • RapidMiner Studio (data mining and modeling, RapidMiner, Inc.)• the Natural Language Toolkit (NLP) package in Python• Natural Language Processing in R, and others
• A number of the above are free for non-commercial academic use, and one is at very low cost for an eternal license.
121
Some other “distant reading” tools, too (cont.)
• All software programs require study and practice…to wield effectively.• MS Word has a built-in readability calculator that looks at Flesch
Reading Ease and Flesch-Kincaid Grade Level.
122
Once upon a time…
…The End 123
Conclusion and contact
• Dr. Shalin Hai-Jew• iTAC, Kansas State University • 212 Hale / Farrell Library • 785-532-5262 • [email protected]
• Thanks to Dr. Jana R. Fallin and her team at the K-State Teaching & Learning Center for accepting this digital poster session for the Big XII Teaching and Learning Conference (2016)! Thanks to Dr. Rebecca Gould for supporting this work.
• The content network graphs on Slides 2 and 3 were created using NodeXL Basic, an open-source and free Excel add-on available on the Microsoft CodePlex site.
• The “long tail” image on Slide 61 is an open source image from Wikimedia Commons. The text overlays are mine.
• The presenter has no tie with QSR International, the makers of NVivo 11 Plus.
124