Alt text

Welcome to the home of TED-plus and GenderListener

TED-plus is an enhanced dataset of all online TED talks that includes features from metadata, audio, transcripts, and gender labels based on speaker first name and speaker speech (using GenderListener). I created it so that you can effortlessly explore TED talks. Scroll down for the full list of features that I added.

GenderListener is a tool that generates speaker gender labels from audio. Use it to add gender labels to any audio files you have.The goal of GenderListener is to make it easier for data scientists and social scientists to explore gender-related trends in speech audio data. GenderListener is based on a Logistic Regression classiffier that I trained on 1,096 pre-labeled TED talks.

This repository contains all the code I used to to build GenderListener and to produce the TED_plus.csv dataset.

TED_plus contains:

Here are some examples of how to use the code and datasets in this repository (in order from least involved to most involved).

If you want to:

In most cases, the main modification will be to change the input and output filenames.

Some downloads you may need:
Raw metadata and transcripts: Kaggle-Rounak_Banik
Audio recordings of TED talks:TEDLIUM II
pyAudioAnalysis3: Python 3 version
PyAudioAnalysis was created by Theodoros Giannakopoulos, Postdoc researcher at NCSR Demokritos, Athens, Greece.

TED-plus features:

Original metadata features (courtesy of Rounak Banik):

Enhanced metadata (courtesy of Cynthia Correa):

Topics:

Ratings:

34 audio features extracted using PyAudioAnalysis:

Name-derived gender labels from gender_guesser:

Sound-derived gender labels from gender_listener: