Sentiment Analysis of a CHILDES CorpusCS 631 Final Project, part 2Grace LawleyAugust 14th, 20181 / 19

The Dataset2 / 19

The DatasetChild Language Data Exchange System (CHILDES)  
2 / 19

The Dataset

Child Language Data Exchange System (CHILDES)
- Online repository of language acquisition data

2 / 19

The Dataset

Child Language Data Exchange System (CHILDES)
- Online repository of language acquisition data
- Used to study language development, second language acquisition, child directed speech

2 / 19

The Dataset

Child Language Data Exchange System (CHILDES)
- Online repository of language acquisition data
- Used to study language development, second language acquisition, child directed speech
Why is CHILDES special?

2 / 19

The Dataset

Child Language Data Exchange System (CHILDES)
- Online repository of language acquisition data
- Used to study language development, second language acquisition, child directed speech
Why is CHILDES special?
- Corpora of children speaking North American English are very hard to come by

2 / 19

The DatasetCHILDES → Eng-NA → Kuczaj Corpus
3 / 19

The Dataset

CHILDES → Eng-NA → Kuczaj Corpus
- Longitudional Case Study
- 1 target child: Abe
- ~2 - ~5 years old
- 210 transcripts (average of 810 words long)

3 / 19

The Raw DataPulled raw utterances data down from the CHILDES database with the childesr package
4 / 19

The Raw Data

Pulled raw utterances data down from the CHILDES database with the childesr package
Some raw utterances:

4 / 19

The Raw Data

Pulled raw utterances data down from the CHILDES database with the childesr package
Some raw utterances:

## [1] "okay that's a alligator he got a cigar"
## [2] "go away"                               
## [3] "camel pig and the donkey"              
## [4] "you go away"                           
## [5] "uhhuh eat"                             
## [6] "oh no"

4 / 19

The Raw Data

Pulled raw utterances data down from the CHILDES database with the childesr package
Some raw utterances:

## [1] "okay that's a alligator he got a cigar"
## [2] "go away"                               
## [3] "camel pig and the donkey"              
## [4] "you go away"                           
## [5] "uhhuh eat"                             
## [6] "oh no"

Cleaned, processed, & tokenized the data

4 / 19

The Sentiment Analysis5 / 19

The Sentiment AnalysisUsed the nrc Word-Emotion Association Lexicon in the tidytext package
5 / 19

The Sentiment Analysis

Used the nrc Word-Emotion Association Lexicon in the tidytext package
- Classifies words into 10 different sentiment categories:
  - anger
  - disgust
  - fear
  - joy
  - negative
  - sadness
  - anticipation
  - surprise
  - trust
  - positive

5 / 19

The Sentiment Analysis

Used the nrc Word-Emotion Association Lexicon in the tidytext package
- Classifies words into 10 different sentiment categories:
  - anger
  - disgust
  - fear
  - joy
  - negative
  - sadness
  - anticipation
  - surprise
  - trust
  - positive

Merged with tokens with dplyr::inner_join()

5 / 19

The Sentiment Analysis

Used the nrc Word-Emotion Association Lexicon in the tidytext package
- Classifies words into 10 different sentiment categories:
  - anger
  - disgust
  - fear
  - joy
  - negative
  - sadness
  - anticipation
  - surprise
  - trust
  - positive

Merged with tokens with dplyr::inner_join()
- Only kept tokens that occured in both dataframes

5 / 19

Six sentiments6 / 19

Six sentiments

Positive

trust
joy
anticipation

6 / 19

Six sentiments

Positive

trust
joy
anticipation

Negative

sadness
fear
anger

6 / 19

The Original Plot

7 / 19

Problems8 / 19

ProblemsVisualization is difficult to explain
8 / 19

Problems

Visualization is difficult to explain
~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

8 / 19

Problems

Visualization is difficult to explain
~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon
Transcript length varies a lot

8 / 19

Problems

Visualization is difficult to explain
~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon
Transcript length varies a lot
Distribution of transcripts across the ages varies a lot

8 / 19

Problems

Visualization is difficult to explain
~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon
Transcript length varies a lot
Distribution of transcripts across the ages varies a lot

9 / 19

Normalization

Binned age into months:
- 30.13204, 30.19775, 30.32916,...30.59200 → 30

10 / 19

Normalization

Binned age into months:
- 30.13204, 30.19775, 30.32916,...30.59200 → 30

For each age bin and each sentiment:

10 / 19

Normalization

Binned age into months:
- 30.13204, 30.19775, 30.32916,...30.59200 → 30

For each age bin and each sentiment:
- n_percent = n_sentiment/n_tokens

10 / 19

Normalization

Binned age into months:
- 30.13204, 30.19775, 30.32916,...30.59200 → 30

For each age bin and each sentiment:
- n_percent = n_sentiment/n_tokens

10 / 19

Iterate!11 / 19

Version 1

12 / 19

Version 2

13 / 19

Version 3

14 / 19

Version 4

15 / 19

Version 5

16 / 19

Version 5.1

17 / 19

The Final Version18 / 19

The Final Version

18 / 19

Thank you!

Github Repository:
gracelawley/kuczaj-corpus

Write up & code available at:
grace.rbind.io/project/kuczaj_pt2/

Slides made with the R package xaringan
These slides - rendered & raw

Based on my CS 631 Final Visualization Project
Write up & code available at:
grace.rbind.io/project/final_vis/

19 / 19

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Sentiment Analysis of a CHILDES Corpus

CS 631 Final Project, part 2

Grace Lawley

August 14th, 2018

The Dataset

The Dataset

The Dataset

The Dataset

The Dataset

The Dataset

The Dataset

The Dataset

The Raw Data

The Raw Data

The Raw Data

The Raw Data

The Sentiment Analysis

The Sentiment Analysis

The Sentiment Analysis

The Sentiment Analysis

The Sentiment Analysis

Six sentiments

Six sentiments

Positive

Six sentiments

Positive

Negative

The Original Plot

Problems

Problems

Problems

Problems

Problems

Problems

Normalization

Normalization

Normalization

Normalization

Iterate!

Version 1

Version 2

Version 3

Version 4

Version 5

Version 5.1

The Final Version

The Final Version

Thank you!

The Dataset

Help