using “distant reading” to explore discussion threads in online courses

Using “Distant Reading” to Explore Discussion Threads in

Online Courses

Digital Poster SessionBig XII Teaching and Learning Conference (3rd Annual) Theme: Transformational Teaching and Learning Kansas State University June 2 – 3, 2016

“reading” related tags network on Flickr (1.5 deg.)

2

“Reading_(process)” article network on Wikipedia (1 deg.)

3

Overview

• In this age of mass data, “distant reading” has come to the fore as a way to deal with large amounts of text data—including from student discussion threads in online courses. Kansas State University has a site license for NVivo 11 Plus, a software that enables multimedia data curation and qualitative and mixed methods data analysis. Two new features in NVivo—sentiment analysis and theme extraction (topic modeling)—enable users to “distant read” large amounts of text to extract some early insights.

• What are the expressed sentiments of learners when discussing a particular issue? Do these trend positive or negative?

• What topics or themes or concepts are brought up by students given a certain discussion thread prompt?

• What do the sentiment and topic insights suggest about where students are at with a particular issue? Are there latent (hidden) insights?

4

Overview (cont.)

• These new features, in combination with text frequency counts (with related text clustering), text searches, and other text data query capabilities (and related data visualization capabilities) in NVivo enable distant reading for use in online courses. This digital slideshow will introduce NVivo 11 Plus (a local software tool with both Windows and Mac platform versions) and walk-through how it may be applied to textual data extracted from an online course.

• Understanding what students are thinking is a critical part of transformational teaching and learning. Using computational means to listen and to hear is important to this end.

5

“Distant reading”

6

Origin of the “distant reading” concept

• Proposal to use computational means to process more than the local canon but also works of world literature, digitized texts, and born-digital contents in the Internet era

• Compared classic reading as a solemn and serious endeavor of reading when what is needed is “a little pact with the devil” or reading texts with distance in order to study “devices, themes, tropes—or genres and systems” (n.p.)

• Franco Moretti (“Conjectures on World Literature,” 2000)

• An abundance of digital texts with many comprising “the great unread” and so requiring computational means to decode text / explore meaning / “distant read”

• Franco Moretti (Distant Reading, 2013)

7

https://newleftreview.org/II/1/franco-moretti-conjectures-on-world-literature

http://www.versobooks.com/books/1421-distant-reading

Origin of the “distant reading” concept (cont.)

• Willingness to let computers / machines “distant read” for people?• Reluctant readers in the present age? • Low grade level reading assumptions for the broader population?

• Surveys written to the 3rd and 5th grade levels• Newspapers written to the middle school levels

• Moves to reading online and letting go of traditional news

• Inefficient human reading speeds • 200 words per minute (wpm) for close academic reading • 800 wpm for skimming and scanning • Challenges with human attention, focus, comprehension / accuracy, and

memory in terms of reading

8

Light history of statistical analysis of text

• Over 50 years of statistical analysis of text (but initially with human analysts)

• Three general methodologies • Thematic content analysis (by human judges early on) • Word pattern analysis (as part of artificial intelligence work) • Word count strategies (Pennebaker, Mehl, & Niederhoffer, 2003)

• All three methodologies extended and refined through computational means

9

http://minimalism.linguistics.arizona.edu/%7Emehl/eReprints/ARP2003.pdf

General sequence of “distant reading”

1. Collection of relevant texts (challenges with access) 2. Data formatting and cleaning (data pre-processing, including file

formatting, searchable text, de-duplication of files, spell checks, and other types of work)

3. Textual markup via tagging (sometimes, with XML, for example) 4. Running data queries and analytics 5. Outputting relevant data visualizations (can start with exploration

using data visualizations) 6. Analysis and interpretation 7. Finalization and presentation

10

Comparisons and contrasts between “distant” and “close” reading

11

Some comparisons and contrasts between distant and close reading (cont.)

Distant Reading• Text interpretation based on

computerized counts of words and ngrams

• Counts may be done by unsupervised machine learning

• Counts may be done by supervised machine learning

• Counts may be done by a combination of computational and human methods (usually via customized human tagging and counting)

Close Reading • Text interpretation based on interactions

between a human reader (with personality, intellect, emotions; life experiences; education; training) and a text / text corpus

• Evocative and descriptive details affect the reader and activate various parts of the brain…

• Trigger the somatosensory cortex • Engage the reader’s imagination, mind,

emotions, and prior knowledge • Engage reader empathy and sympathy • Challenge the reader’s worldview and sense

of how things should be and how things are

12


Distant Reading• Summary data of a text or text

corpora• Reductionist (of language data

dimensionality) • Indirect • Scalable (but not to “big data”

standards without access to cloud computing and special unique analytics approaches)

• Machine-efficient and fast • Objectivist, reproducible, repeatable• Comparable to other ~ text

documents and collections

Close Reading • Activates the “inner language” that

people use to think (vs. the “outer language” used to communicate with others)

• People “simulate to comprehend” • See Rawi Nanakul’s “Shaping Data Stories

with Neuroscience,” May 24, 2016, in Data Science Central’s webinar series

• Informed by human-based proprioception (and somatosensory system) / embodiedreading

• See David Gelernter’s comments on import of proprioception in AI article in Time Magazine

13

http://www.datasciencecentral.com/video/shaping-data-stories-with-neuroscience

http://time.com/4236974/encounters-with-the-archgenius/


Distant Reading• Data mining for discovery • Data exploration for patterns (at

various levels of granularity)• Specific data queries for specific

text types and text sets • Descriptions of genres (news

texts, short story collections, letters, terror messaging, and others)

Close Reading • Tends towards contextual awareness • Assumption of a human interpretive

lens with relativistic assumptions • Is affected by who the reader is and what

informs his / her literary or reading palate• Goes beyond surface meanings of a text

• Includes subtext • Includes a lot of understanding outside of

and beyond the text (such as cultural references)

14


Distant Reading• May be applied to anything in

text format • Built on the analysis of “natural

language” (evolved vs. constructed language)

Close Reading • Analytics presented with a

personality framework (through people, through human mediation)

• Requires a certain level of human reading background and cognition to achieve certain types of high-level reading

• Engaging the mind (cognition, emotions) and human senses (sight, smell, taste, touch, hearing, and proprioception)

15


Distant Reading• Comparisons across texts,

corpora, and (cultural, time, and other) contexts

• Stylometric analysis of author writing

• “Disembodied” and machine-based decoding of text

Close Reading • Intimate to the human

experience • Variant meanings depending on

the reader and on context (texts may be “read into”)

• Conceptualized as both phonological (sound-based as in read-aloud and “sounding out” language) and orthographic (written / printed symbol-based)

16


Distant reading• Uses data visualizations to depict

summary data• Word clouds • Word trees • Data tables • Cluster visualizations (2D, 3D, 4D) • Dendrograms • Tree maps • Intensity matrices (matrix factorizations)• Node-link diagrams• Topic histograms • Sociograms• Geographical maps, and others

Close reading • Canons of text informed by

collective processes of selection that are often serendipitous and not necessarily based on “merit” once texts reach a particular level of recognition (Klosterman, But What if We’re Wrong? 2016)

• Informed by social contexts, and subtle and not-so-subtle influences

• Informed by collective memory and collective forgetting

17

http://www.penguinrandomhouse.com/books/533521/but-what-if-were-wrong-by-chuck-klosterman/9780399184123/

Some research enablements of both types of reading

Distant reading• Empirical from-world (in vivo)

approaches • Analysis of from-world text data sets

• Experimental lab (in vitro) approaches (control groups, interventions, outcomes)

• Analysis of assigned types of writing

• and others

Close reading• Literary analysis • Cultural analytics • Remote profiling• Reviews

• and others

18

Strategic application of “distant” and “close” reading

19

Controversies related to “distant reading”

20

Controversy: “Not reading”

• Distant reading described as “non-reading” and “not-reading” also “data mining” and “machine learning”

• “Distant reading” without an inherent human sense of appreciationfor reading

• Disembodied, decontextualized, and non-ordered computational text analysis (such as computation-based “bag of words” approaches)

• How does the “great unread” benefit from “distant reading” capabilities which doesn’t have clear human investment in the reading / or clear appreciation for the writing? (argument)

• Counter-intuitive approach

21

http://www.csee.umbc.edu/%7Ehillol/NGDM07/abstracts/talks/MKirschenbaum.pdf

Controversy: Computer incursion into human spaces• Sense of computational incursions in distinctly human spaces

• Human writing as a human activity done purely for human consumption and usage

• Over-reliance on computational means for text interpretation • Computational means seen to be highly mechanistic and rigid

• If distant reading now, distant algorithmic writing (like SCIgen) close behind?• Machine writing based on algorithms, sampling from other sources, rules of

organization, emulation of genre and forms (may not make sense in a human-readable way)

• Human writing as individual and / or social, something “uniquely” human…?

22

Controversy: Computer incursion into human spaces (cont.)

• Changing up the standards of literary research • Completeness of text sets • Accuracy (vs. human bias and subjectivity) • Reliability • Reproducibility • Machine vs. human insight

• A summary of the argument: Computers “are too rigid, formulaic, and cold to pick up the subtle nuances, emotions, and creative brilliance manifested in literature” (Graesser, Dowell, & Moldovan, “A computer’s understanding of literature,” 2011, p. 3)

23

https://benjamins.com/#catalog/journals/ssol.1.1.03gra/details

Controversy: Epistemology

• Challenges involved in the apparent positivism and research reproducibility assumptions of “distant reading” and concepts of epistemology (or where knowledge comes from)

• vs. human-focused interpretivism, social constructivism, and emphasis on uniqueness and originality…or many of the human aspects of qualitative research

• Goes against various traditions of text analytics: narratology research, various thematic analytical approaches, and others

24

Controversy: Hard work

• Hard analytical work to “distant read” • High learning curve to understand various underlying text processing

algorithms and XML tagging systems (depending on what is used) • Challenges to acquire copyrighted texts • Effortful digitization of paper-based texts in a searchable way • Time-intensive requirements to clean and data • Need for careful human analysis of the extracted data

25

“Non-consumptiveness” in some types of computation-based text research • “Non-consumptive” distant reading involves access to summary data

only and no access to the underlying source (text) data • Non-consumptiveness enables researchers to have some access to

text but not in a computational way that enables them to recreate the text computationally because of the risks of contravening copyright of the accessed texts

• To protect copyright, “digital surrogates” and “shadow datasets” are made available

26

Examples of non-consumptive text research sources on the Web • Example: Google Books Ngram Viewer only offers access to ngram

counts and line graph visualizations and links to seminal years (with some pages of potentially relevant information) but not the underlying source dataset (because of copyright and other issues); uses a shadow dataset for the Ngram Viewer queries

• Example: HathiTrust Digital Library offers opportunities for non-consumptive research

27

https://books.google.com/ngrams

https://www.hathitrust.org/

Controversy: Non-consumptiveness

Con non-consumptive research of text(s)• Non-consumptiveness means that

researchers do not have access to the underlying texts but can only capture summary patterns and data snippets.

• Researchers cannot explore the underlying data more directly to understand particular effects within the context of the document.

• Researchers have partial information; they are seeing through a glass darkly.

Pro non-consumptive research of text(s)• Big data will disallow any real

direct intimacy with the data from close reading.

• The on-ground reality is a de facto situation of essential non-consumptiveness as well.

• It is better to know something than not…even if the full text set is not available.

28

Controversy: Non-consumptiveness (cont.)

Con non-consumptive research of text(s)• Non-consumptive protections

assume that researchers will hack systems in order to break copyright instead of using the copyrighted texts for relevant research.

Pro non-consumptive research of text(s)• Waiting until texts are fully in

the public domain and not under copyright protection means loss of research opportunities.

29

Current state of affairs

• Since early 1990s, advances in: • Computational linguistics • Corpus linguistics • Discourse processes • “statistical representations of world knowledge and meaning” (Graesser,

Dowell, & Moldovan, “A computer’s understanding of literature,” 2011, p. 3)

• Enablements for the “computational science of literature” (Graesser, Dowell, & Moldovan, “A computer’s understanding of literature,” 2011, p. 4)

30



Current state of affairs (cont.)

• Definitions and practices: What exactly is “distant reading”? What are the most effective ways to achieve this?

• Room for both? Can distant and close reading complement and enhance each other?

• Distant reading will not replace close reading; close reading will not replace distant reading

• Each may use the same texts, but they cover different ground and surface different insights and provide different functions

• Neither camp (advocates for one or the other) claims to have the definitive approach to reading nor the sole correct way of reading nor that they have the whole story

31


• Split practices: Some humanists going to the “digital humanities,” and others not ceding any traditional ground

• Others use multiple methods to read and use each to complement the other

32


• Don’t call it “distant reading” but rather “computational text analysis”• Early methods have been around since the 1990s

• Computational methods to model texts are necessary for handling large-scale document and text corpora to enhance findability and searchability

• Various types of statistical and computational methods are applied to identify hidden themes in textual (and visual) works to inferentially summarize contents in an unsupervised machine learning way

• Computational methods have been used to identify similar works—for recommender systems (to aid people in finding the research they are seeking)

33

Applications in the research literature

34

Some electronic sourcing of texts for “distant reading” analysis • Social network wall postings • Microblogging text streams and

text networks • Online journaling sites • Content sharing social media

platforms (and commenting streams / networks)

• Crowd-sourced online encyclopedia articles, article networks, and collections

• Web log (blog) entries • Article repositories and databases • Content descriptions from online

repositories • Article from online news sites• Email networks • Slideshows • Tags for user-generated imagery

and / or videos• FAQs (frequently asked questions)

35

Some electronic sourcing of texts for “distant reading” analysis (cont.)

• Descriptors for user-generated contents

• Massive open online course (MOOC) discussion boards and threads

• Downloadable online text datasets (such as from published research)

• Short messaging service (SMS) texts

• Chat room postings

• Online ads• Social news services • Stories, poems, scripts, plays, and

other literary genre contents from online publishing sites

• Codes from online pastebins and open sites

• Documents, reports, memos, minutes, and other work-related contents

36

Some electronic sourcing of texts for “distant reading” analysis (cont.)

• Government cables (such as from leak sites)

• Book manuscripts on open sharing sites

• Informal gray literature shared on the Web

• Profiles on dating, professional, and other purpose-based social sites

• Song lyrics• Recipes, and others

Also• Anything digital (or analog) that is

transcribe-able or able to be versioned into text

• Anything analog that is digitize-able (scannable) to searchable text

37

Some actual distant reading enablements in the academic research • “Reading” to surface latent insights, such as the ability to…

• Look for hidden patterns in texts and language use • Extract self-concept and social relationships

• Predict friendships and length of romantic relationships (through textual mirroring) • Study social groups and interactions

• Explore sentiment, emotion, and personality based on language use patterns • Profile writers based on their writing • Describe writing based on stylometry (quantitative metrics of writing style)• Pursue the answering of research questions through purposeful and tactical

distant reading • Study collective social sentiment during social upheavals

38

Some actual distant reading enablements in the academic research (cont.)

• Ability to… (cont.)

• Study stage-of-life issues• Explore benefits of writing to deal with health issues / social traumas /

recovery from disasters• Detect deception and differentiate actual vs. faked works (with varying levels

of confidence) • Differentiate between the writing of poets and songwriters who commit

suicide vs. those who don’t (observed from-world data) • Predict various phenomena: future work-place performance, threat

assessment and intentionality, sense of aggression, and others • Study and describe works from various cultures and sub-cultures

39

Some actual distant reading enablements in the academic research (cont.)

• “Reading” (textual interpretation / decoding) at different levels of analysis

• Ability to pull out particular aspects of texts• Ability to extract insights that would not be possible otherwise

• “Reading” at computational speeds • “Reading” at massive scales

40

Some general practical examples

• Pulling out dialogue in order to understand characters and relationships

• Identifying collective discussions (based around themes) around a particular topic (in periods in history and contemporaneously)

• Studying how concepts / terms evolve over time

• Mapping a particular domain field for main directions and foci (as well as outlier research and concepts)

• Mapping time relationships in a text to understand plot

41

Some general practical examples (cont.)

• Mapping texts for geographical locations• Studying symbolism across works by an author or texts in a genre or

written works in a culture

42

Distant reading applied to discussion board data

43

Some types of data in discussion boards

Content Data • Communicators (by name) • Textual content of

communications • Names of discussion threads • Discussion threads • Multimedia contents

Trace / Traffic Data • Interaction data (replies, likes, and

others) • Time of communications • Social networks

• Locational coordinates of user (or contents, like imagery or video)

• Technological specifications of files45

Metadata

Distant reading (writ large) applied to discussion board data

Some Askable Questions• What are dominant issues on a certain discussion thread(s)? What

are the outlier issues on certain discussion threads (the topics that are part of the “long tail”)?

• What sorts of language are used in addressing particular topics? Why?

• Who is interacting with whom? Around what issues? What types of cliques may be observed in the online course social network?

• What are expressed sentiments based on a certain discussion thread(s)?

47

Distant reading (writ large) applied to discussion board data (cont.)

• How are particular terms used in various contexts in the discussion? What are the various gists and word senses?

• How does activity in discussion (message) boards change over time? • What sorts of prompts are the most effective to encourage online

discussions? • What sorts of instructor interventions are most constructive (based

on learner responses)? Which are least constructive? Why? • What sorts of messages spark discussion? What sorts of messages

shut down discussion? When is each type of message preferred?

48

Distant reading (writ large) applied to discussion board data (cont.)

• What are observable word relationships in a discussion board(s)? What do these (inter)relationships indicate?

• Are there hidden or latent topics in the discussions based on the observed words? What insights are there?

• What locations are indicated in uploaded images shared on a discussion board(s)? How do these map? Are there spatial patterns?

• And others…

49

In NVivo 11 PlusA qualitative data analytics suite

50

Some “distant reading” features of NVivo 11 Plus(1) word frequency counts(2) text searches(3) theme and sub-theme extractions (topic modeling)(4) matrix queries(5) sentiment analyses (6) autocoding by existing pattern(7) geolocational mapping(8) sociograms(9) time patterns

51

Additional notes about NVivo and distant reading • NVivo 11 Plus enables a half-dozen base languages for projects

• Any language representable using the UTF-8 character set may be ingested into NVivo 11 Plus and accurately represented

• These include all languages representable on the Web and Internet, both natural languages and constructed languages

• Some of the data will have to be pre-processed outside of NVivo 11 Plus before running particular data queries, autocoding, and data visualizations

• Some LMSes may allow more readily accessible data for easier download, but current-state LMSes do not apparently make this effort that easy

52

Demos using NVivo 11 Plus

53

A Tweetstream dataset to stand in for discussion board data • Publicly released information • Has entities (social media users) • Has commenting and replies • Is based on @officeofedtech Tweetstream (an educational focus)

• 2,796 captured records (including retweets) using NCapture of NVivo • 5,601 Tweets, 1,130 following, 109,955 followers, 687 likes, 9 lists, at the

time of the data extraction; joined in July 2010 • U.S. Department of Education (http://tech.ed.gov)

54

http://tech.ed.gov/

(1) word frequency counts

• Machine count of the 1000 (or fewer) top content words in the selected text set

• May adjust the level of words • May add words to the stopwords list (at either the global project level or at

the local run / local macro level)

• Half-dozen base languages may be applied (but only one at a time) • Also proportionality of a word’s appearance in relation to other

counted words from a document, node, or text set

56

(1) word frequency counts (cont.)

ProsIdentification of focuses of interest (more mentions)

Cons• Word counts captured at word

level, not phrase level; not bigrams, not trigrams/three-grams, not four-grams; not “thought units,” etc.

• Word counts do not provide gist or context

• Word counts do not capture sentiment (positive or negative)

57

Conversely: Exploring the “long tail” w/ “least frequent” word count • Unclick the “1000 most frequent” selection in the word frequency

count query • Filter the findings by count (from least to greatest) • Export the data table to Excel • Clean out garble and non-words • Identify outlier topics of interest, URLs, and other features

64

(2) text searches

• Words, phrases, quotes, names, dates, and other content that may be searched for and identified in a word tree

• May click on the word tree phrases to capture context • May double click on phrases to go to the original reference • May set parameters of the search for the following (from specific to

general): exact matches, stemmed words, synonyms, specializations, generalizations

67

(2) text searches (cont.)

Pros• Context of a few words prior and

a few words after provide summary data

Cons• Deeper exploration requires accessing

original source contexts (so the data processing has to enable going up and down the chain of a discussion board thread)

• Multimedia files uploaded in discussion boards have to be annotated in order to query (using text)

• Tree is helpful for a limited set of occurrences but becomes too complex and unwieldy after a certain point

• Downloaded word tree is not interactive nor navigable

68

(3) theme and sub-theme extractions

• Automated extraction of identified themes and sub-themes (concepts) from a text set based on an unknown algorithm (or chain of algorithms) but classically based on word relatedness and similarity clustering such as based on principal component analysis or factor analysis; probabilistic latent semantic indexing; latent Dirichlet allocation; non-negative matrix factorization, and other approaches

• Variance based on whether extracted themes are applied to sentences or paragraphs or cells (level of granularity of the coding)

• Subthemes (as extracted by NVivo 11 Plus) tend to be literally related words to the extracted theme (usually with occurrences of those key theme terms in the subthemes, which are generally noun phrases)

• Principal component analysis or factor analysis usually extract more noisy topics that require some human interpretation to label (by contrast)

70

(3) theme and sub-theme extractions (cont.)

• “Topic modeling” extracts the latent, underlying semantic structure of a text or texts (using machine-based inductive logic)

• Process captures topic distributions based on occurrences of a particular term

• Extracted theme and sub-theme extractions must be approved by the human user before themes are coded to nodes in NVivo 11 Plus

• Themes may be semantic terms as well as verbs (in various stemmed forms) and other words

• Autocoded nodes (based on the extracted themes) may be edited and revised and even omitted

• Based on unsupervised machine learning

71


Pros• Summary data based on themes

and literally connected subthemes (subthemes show recurrence of the theme words)

• Objectivist basis for theme and sub-theme extractions (without theory-based selectivity)

• Requires exploration of the nodes to see what has been captured in the particular theme

Cons• May include non-content non-

semantic words • Does not offer tunable

parameters like the setting of “k” topics

73


Pros• Offers a topic-based histogram

for source publications and combined text corpora, with latent topics identified (and latent topic frequency distributions)

• Theme text sets may be analyzed further (with the extraction of sentiments, or data-queried, etc.)

Cons

74


Pros• Easy to export to Excel or other

software tool for other data visualizations (like spider charts)

Cons

75

Machine extracted and human-vetted

76

Fast-read of document set with automated theme and sub-theme extraction • Applied to text corpus (or corpora / corpuses) of documents • A resulting matrix with identification of relevant documents per found

themes and sub-themes (based on observed words) • Result: A co-occurrence table of documents with extracted themes • Extracted dimensions of the “domain” if the articles were a complete

representation of the domain (but not in this example, and generally not because of prohibitive size of such document sets)

• Resembles a term-document matrix, which is a middle-step in a classic latent semantic analysis (LSA) sequence

81

Fast-read of document set with automated theme and sub-theme extraction (cont.)

• Can further create article histograms (of contents) by exporting top column headers and specific single rows (articles) (and mapping in Excel)

• Can run article histograms in the sequence of publication to capture a sense of the dynamism related to that topic over sequential time

• Can create topical summary tendencies for the article set

82

Example of an article histogram

84

Themes in the full set

85

(4) matrix queries

• Cross referencing sources, codes, respective coding by individuals, and other data

• May be queried across types of data

• May be used to identify recurring concepts in individual’s communications

• May be used to capture student interactions with each other

86

(4) matrix queries (cont.)

Pros• Enables identification of data

patterns by cross-referencing various variables and sources from the uploaded data

• May query student work as sources and pull theme extractions from their work

Cons• Requires coded data for matrix

coding queries • Requires proper data processing • Requires proper ingestion of

data for clear analytics

87

(5) sentiment analyses

• Sentiment is traditionally conceptualized as a positive-negative polarity (either as a binary positive or negative or as a continuum: “very negative, moderately negative, moderately positive, very positive” in NVivo)

• Assumption that sentiment will affect behavior (to some degree) • Depending on the structure of the textual data, the sentiment coding

may be applied at sentence, paragraph, or cell levels

89

(5) sentiment analyses (cont.)

Pros• Sentiment coding (positive-negative) may

provide some general insights • Sentiment coding text sets may be accessed

to see what text was coded to where; uncoded or “neutral” text sets may be visualized in proportion to the sentiment coding in hierarchical graphs but the underlying text sets that are “neutral” cannot be extracted

• Sentiment coding may be uncoded and re-coded

• Original source texts from each sentiment category may be accessed for further exploration

Cons• Coding for sentiment does not capture

double negatives, irony, sarcasm, humor, symbolism, tropes, and other nuances of communication

• Sentiment coding by machine may misread words that are highly positive or negative in the internal sentiment dictionary (without applying contextual awareness)

• Pre-coded sentiment dictionary (set) is not revisable in NVivo 11 Plus currently

90

(6) autocoding by existing pattern

• This “autocoding by existing pattern” enables machine emulation of human coding

• Human has to create the data nodes (categories); he / she / they has(have) to provide sufficient exemplars for each category

• The “autocoding by existing pattern” should be set for the proper level of aggressiveness in the coding (by “NV” or NVivo)

• All autocoding should be double-checked for accuracy and corrections (such as by uncoding / recoding)

96

(6) autocoding by existing pattern (cont.)

Pros• Captures the human-based

coding of a text set • Extends the researcher’s coding

“fist” (style) to as-yet uncoded data

Cons• Requires a sufficient amount of

coded data • Does not enable the creation of

new nodes (beyond what the human coder has created)

• Requires human oversight

97

(7) geolocational mapping

• This feature places pins of various locations on a map • This requires disambiguated place information to work effectively

99

(7) geolocational mapping (cont.)

Pros• Provides a geographical and

spatial sense to texts • Geographical information in

texts can be ambiguous and noisy (not effective for specific locational mapping)

• Computer-captured long-lat data are more accurate

Cons• Is a rough map (based on a

Mapquest map understructure) • Would work better using Google

Maps and ArcGIS maps

100

(8) sociograms

• Node-link diagrams that show individual entities (individuals) and their relationships to others based on interactivity (messaging)

• Sociograms align with online course discussion boards, which show instructors / teaching assistants / learners and their relational messaging

102

(8) sociograms (cont.)

Pros• Shows interrelationships (with

“relating” defined as interaction / intercommunications / replies / retweets / shares / likes – dislikes )

• Enables capturing of relationships at one-degree (direct ego neighborhood) in NVivo; enables the capturing of relationships at 1.5 deg., 2 deg., and such, using other tools

Cons• Shows general information only

about entities and relationships • Does not include profile

information • Does not include the shared

messaging (need to go to the underlying discussion board posts)

• Does not offer the more eye-catching sociogram visualizations that other tools can offer

103

(9) time patterns

• There are a variety of possible time patterns: • slice-in-time data (at a particular moment), • time periods or phases (different ways to slice time), and • continuous changes over time

106

(9) time patterns (cont.)

Pros• Enables understandings of how

learners interact on a discussion board during a term

Cons• Requires setup of data a

particular way in order to represent time; may be more effective in a different tool than NVivo (such as Excel)

107

Application of distant reading tools to discussion boards

109

Application to discussion boards

(1) word frequency counts• What are common popular words used in discussion boards based on

certain topics, questions, image prompts, cases, or videos? • What related word clusters may be found in terms of proxemics co-

occurrence? Which terms are related, and why? Are these relations justified by facts, or are these a product of inaccurate assumptions?

(2) text searches• What are the common expressed ideas used in discussion boards based

on certain alphanumeric data (names, dates, phrases, events, and others)?

110

Application to discussion boards (cont.)

(3) theme and sub-theme extractions • What are common machine-identified identified concepts and sub-

concepts used in particular discussion boards / threads? • And what is not mentioned? What do we expect to see that we don’t see (even

after a precise “text search”)? Why?• How do these align with human-identified themes and sub-themes? Why? • Based on the underlying text sets, how are these themes and sub-themes

understood? How may misconceptions be addressed by the instructor and the learners together?

• In terms of clustering by word similarity, whose writings are most similar? Does that show who is talking to whom? Does word similarity across users (their sources) show potential emulation (or even potential plagiarism)?

111


(4) matrix queries• Are there identifiable patterns between how different groups of

students stand on various issues? • Are there identifiable patterns between coded variables from the text

sets?(5) sentiment analyses • What are the amounts of sentiments—very negative, moderately

negative, moderately positive, and very positive—seen in the sentences / paragraphs of the discussion board interactions? • Why are certain discussions trending positive-negative in the machine-observed

ways?• What are the texts in each of the coded sentiment categories? • What are some possible inferences from these observed expressed sentiments?

112


(6) autocoding by existing pattern• What sorts of summary data assertions may be made about the

text sets from the discussion boards based on the following: • Human emergent coding • Human a priori coding [based on theory/ies, model(s), framework(s), and

other methods]• A mixture of emergent and a priori coding

(7) geolocational mapping• Are there some geographical patterns to the expressed ideas or

metadata from uploaded images / audio / video on the discussion boards?

113


(8) sociograms• How do the learners in an online class form relationships and interact

and collaborate based on co-commenting on a discussion board? Replying?

• If there are identified cliques, how do these form? If there are dyadic, triadic, and other types of motif relationships, how do these form? How long do these ties last?

(9) time patterns• When are discussion boards most active during the assignment period?

When are they least active? Why? • What sorts of prompts encourage more time-bursty communications

behaviors? Why?

114

Discussion

117

Walking this back a little…

• Benefits to deep knowledge of the data set (attainable by close reading of some parts of the text) even with the affordances of distant reading

• Deep knowledge enables a more accurate understanding of the results of “distant reading”

• Benefits to have a human touch in supporting student learning in an age of computational affordances

• People respond to other people much more positively (and negatively) than to computers

118

Walking this back a little… (cont.)

• Benefits to wielding distant reading tool with a deep and accurate understanding of what is going on with the computer, the software, and the data…and how to read the outputs without being mechanistic or unthinking

• Is reading ease always desirable? • What is the writing context? The writing genre? Expectations?

• Benefits to having some baseline understandings of the types of text sets and typical quantitative readings

119

Some other “distant reading” tools

120

Some other “distant reading” tools, too

• There are a range of other software tools that enable various ways to “distant read”:

• LIWC (Linguistic Inquiry and Word Count, pronounced “luke,” U of Texas)• AutoMap and ORA-NetScenes (word relationships in a document or a text

corpus, CASOS, Carnegie Mellon University) • RapidMiner Studio (data mining and modeling, RapidMiner, Inc.)• the Natural Language Toolkit (NLP) package in Python• Natural Language Processing in R, and others

• A number of the above are free for non-commercial academic use, and one is at very low cost for an eternal license.

121

http://liwc.wpengine.com/

http://www.casos.cs.cmu.edu/projects/automap/

https://rapidminer.com/

http://www.nltk.org/

https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Some other “distant reading” tools, too (cont.)

• All software programs require study and practice…to wield effectively.• MS Word has a built-in readability calculator that looks at Flesch

Reading Ease and Flesch-Kincaid Grade Level.

122

Once upon a time…

…The End 123

Conclusion and contact

• Dr. Shalin Hai-Jew• iTAC, Kansas State University • 212 Hale / Farrell Library • 785-532-5262 • [email protected]

• Thanks to Dr. Jana R. Fallin and her team at the K-State Teaching & Learning Center for accepting this digital poster session for the Big XII Teaching and Learning Conference (2016)! Thanks to Dr. Rebecca Gould for supporting this work.

• The content network graphs on Slides 2 and 3 were created using NodeXL Basic, an open-source and free Excel add-on available on the Microsoft CodePlex site.

• The “long tail” image on Slide 61 is an open source image from Wikimedia Commons. The text overlays are mine.

• The presenter has no tie with QSR International, the makers of NVivo 11 Plus.

124

mailto:[email protected]

using “distant reading” to explore discussion threads in online courses

Data & Analytics