Tutorial: Natural language processing for social media

26 April, afternoon

Presenters: Kalina Bontcheva, Leon Derczynski

Tutorial contents

From a business and government point of view there is an increasing need to interpret and act upon information from large-volume, social media streams, such as Twitter, Facebook, and forum posts. While natural language processing from newswire has been very well studied in the past two decades, understanding social media content has only recently been addressed in NLP research.

Social media poses three major computational challenges, dubbed by Gartner the 3Vs of big data: volume, velocity, and variety. NLP methods, in particular, face further difficulties arising from the short, noisy, and strongly contextualised nature of social media. To address the 3Vs of social media, novel language technologies have emerged, e.g. using locality sensitive hashing to detect new stories in media streams (volume), predicting stock market movements from tweet sentiment (velocity), and recommending blogs and news articles based on users' own comments (variety).

The tutorial takes a detailed view of key NLP tasks (corpus annotation, linguistic pre-processing, information extraction and opinion mining) of social media content. After a short introduction to the challenges of processing social media, we will cover key NLP algorithms adapted to processing such content, discuss available evaluation datasets and outline remaining challenges.

The tutorial will start by comparing several social media corpora against the traditionally used, news-based ones and demonstrating how the differences in terms of noisiness, brevity, diversity, and temporality affect performance of state-of-the-art NLP algorithms.

The core of the tutorial will present NLP algorithms tailored to social media, and more specifically: language identification, tokenisation, normalisation, part-of-speech tagging, named entity recognition, entity linking, event recognition, opinion mining, and text summarisation.

Since the lack of human-annotated NLP corpora of social media content is another major challenge, this tutorial will cover also crowdsourcing approaches used to collect training and evaluation data (including paid-for crowdsourcing with CrowdFlower, also combined with expert-sourcing and games with a purpose). We will also discuss briefly practical and ethical considerations, arising from gathering and mining social media content.

The last part of the tutorial will address applications, including summarisation of social media content, user modelling (geo-location, age, gender, and personality identification), media monitoring and information visualisation, and using social media to predict economical and political outcomes (e.g. stock price movements, voting intentions).


Kalina Bontcheva (University of Sheffield) is a senior research scientist and the holder of an EPSRC career acceleration fellowship, working on text summarisation of social media. Kalina received her PhD on the topic of adaptive hypertext generation from the University of Sheffield in 2001. Her main interests are information extraction, opinion mining, natural language generation, text summarisation, and software infrastructures for NLP. She has been a leading developer of GATE since 1999. Kalina Bontcheva coordinated the EC-funded TAO STREP project on transitioning applications to ontologies, as well as leading the Sheffield teams in TrendMiner, MUSING, SEKT, and MI-AKT projects.

Leon Derczynski (University of Sheffield) is a post-doctoral Research Associate, who completed a PhD in Temporal Information Extraction at the University in Sheffield in 2012 under an enhanced EPSRC doctoral training grant. His main interests are in data-intensive approaches to computational linguistics, specialising in information extraction, spatiotemporal semantics, semi-supervised learning, and handling noisy linguistic data, especially social media. He has been working in commercial and academic research on NLP and IR since 2003 with focus on temporal relations, temporal and spatial information extraction, semantic annotation, usability and social media. Commercial work included early introduction of Mechanical Turk for scaling marketing and linguistic discrimination tasks. His current work focuses on processing social media.