+ - 0:00:00
Notes for current slide
Notes for next slide

Sentiment Analysis of a CHILDES Corpus

CS 631 Final Project, part 2

Grace Lawley

August 14th, 2018

1 / 19

The Dataset

2 / 19

The Dataset

  • Child Language Data Exchange System (CHILDES)
2 / 19

The Dataset

  • Child Language Data Exchange System (CHILDES)

    • Online repository of language acquisition data
2 / 19

The Dataset

  • Child Language Data Exchange System (CHILDES)

    • Online repository of language acquisition data

    • Used to study language development, second language acquisition, child directed speech

2 / 19

The Dataset

  • Child Language Data Exchange System (CHILDES)

    • Online repository of language acquisition data

    • Used to study language development, second language acquisition, child directed speech

  • Why is CHILDES special?

2 / 19

The Dataset

  • Child Language Data Exchange System (CHILDES)

    • Online repository of language acquisition data

    • Used to study language development, second language acquisition, child directed speech

  • Why is CHILDES special?

    • Corpora of children speaking North American English are very hard to come by
2 / 19

The Dataset

  • CHILDES → Eng-NA → Kuczaj Corpus
3 / 19

The Dataset

  • CHILDES → Eng-NA → Kuczaj Corpus

    • Longitudional Case Study

    • 1 target child: Abe

    • ~2 - ~5 years old

    • 210 transcripts (average of 810 words long)

3 / 19

The Raw Data

  • Pulled raw utterances data down from the CHILDES database with the childesr package
4 / 19

The Raw Data

  • Pulled raw utterances data down from the CHILDES database with the childesr package

  • Some raw utterances:

4 / 19

The Raw Data

  • Pulled raw utterances data down from the CHILDES database with the childesr package

  • Some raw utterances:

## [1] "okay that's a alligator he got a cigar"
## [2] "go away"
## [3] "camel pig and the donkey"
## [4] "you go away"
## [5] "uhhuh eat"
## [6] "oh no"
4 / 19

The Raw Data

  • Pulled raw utterances data down from the CHILDES database with the childesr package

  • Some raw utterances:

## [1] "okay that's a alligator he got a cigar"
## [2] "go away"
## [3] "camel pig and the donkey"
## [4] "you go away"
## [5] "uhhuh eat"
## [6] "oh no"
  • Cleaned, processed, & tokenized the data
4 / 19

The Sentiment Analysis

5 / 19

The Sentiment Analysis

  • Used the nrc Word-Emotion Association Lexicon in the tidytext package
5 / 19

The Sentiment Analysis

  • Used the nrc Word-Emotion Association Lexicon in the tidytext package

    • Classifies words into 10 different sentiment categories:

      • anger
      • disgust
      • fear
      • joy
      • negative
      • sadness
      • anticipation
      • surprise
      • trust
      • positive
5 / 19

The Sentiment Analysis

  • Used the nrc Word-Emotion Association Lexicon in the tidytext package

    • Classifies words into 10 different sentiment categories:

      • anger
      • disgust
      • fear
      • joy
      • negative
      • sadness
      • anticipation
      • surprise
      • trust
      • positive
  • Merged with tokens with dplyr::inner_join()
5 / 19

The Sentiment Analysis

  • Used the nrc Word-Emotion Association Lexicon in the tidytext package

    • Classifies words into 10 different sentiment categories:

      • anger
      • disgust
      • fear
      • joy
      • negative
      • sadness
      • anticipation
      • surprise
      • trust
      • positive
  • Merged with tokens with dplyr::inner_join()

    • Only kept tokens that occured in both dataframes
5 / 19

Six sentiments

6 / 19

Six sentiments

Positive

  • trust

  • joy

  • anticipation

6 / 19

Six sentiments

Positive

  • trust

  • joy

  • anticipation

Negative

  • sadness

  • fear

  • anger

6 / 19

The Original Plot

7 / 19

Problems

8 / 19

Problems

  • Visualization is difficult to explain
8 / 19

Problems

  • Visualization is difficult to explain

  • ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

8 / 19

Problems

  • Visualization is difficult to explain

  • ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

  • Transcript length varies a lot

8 / 19

Problems

  • Visualization is difficult to explain

  • ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

  • Transcript length varies a lot

  • Distribution of transcripts across the ages varies a lot

8 / 19

Problems

  • Visualization is difficult to explain

  • ~86% of tokens were lost when filtering against the NRC Word-Emotion Association Lexicon

  • Transcript length varies a lot

  • Distribution of transcripts across the ages varies a lot

9 / 19

Normalization

  • Binned age into months:

    • 30.13204, 30.19775, 30.32916,...30.5920030
10 / 19

Normalization

  • Binned age into months:

    • 30.13204, 30.19775, 30.32916,...30.5920030
  • For each age bin and each sentiment:
10 / 19

Normalization

  • Binned age into months:

    • 30.13204, 30.19775, 30.32916,...30.5920030
  • For each age bin and each sentiment:

    • n_percent = n_sentiment/n_tokens
10 / 19

Normalization

  • Binned age into months:

    • 30.13204, 30.19775, 30.32916,...30.5920030
  • For each age bin and each sentiment:

    • n_percent = n_sentiment/n_tokens

10 / 19

Iterate!

11 / 19

Version 1

12 / 19

Version 2

13 / 19

Version 3

14 / 19

Version 4

15 / 19

Version 5

16 / 19

Version 5.1

17 / 19

The Final Version

18 / 19

The Final Version

18 / 19

Thank you!

Github Repository:
gracelawley/kuczaj-corpus

Write up & code available at:
grace.rbind.io/project/kuczaj_pt2/

Slides made with the R package xaringan
These slides - rendered & raw

Based on my CS 631 Final Visualization Project
Write up & code available at:
grace.rbind.io/project/final_vis/

19 / 19

The Dataset

2 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow