Bootcamp: Learning representations of text for NLP

Learn and implement an end-to-end deep learning models for natural language processing.

19-20 May 2018, Bangalore

Think of your favorite NLP application that you wish to build - sentiment analysis, named entity recognition, machine translation, information extraction, summarization, recommender system, to name a few. A key step towards achieving any of the above task is - using the right set of techniques to represent text in a form that machine can easily understand.

Unlike images, where directly using the intensity of pixels is a natural way to represent the image; in case of text there is no such natural representation. No matter how good is your ML algorithm, it can do only so much unless there is a richer way to represent underlying text data. Thus, whatever NLP application you are building, it’s imperative to find a good representation for your text data. Motivated from this, the subfield of representation learning of text for NLP has attracted a lot of research interest in the past few years.

In this bootcamp, we will understand key concepts, maths, and code behind the state-of-the-art techniques for text representation. Various representation learning techniques have been proposed in literature, but still there is a dearth of comprehensive tutorials that provides full coverage with mathematical explanations as well as implementation details of these algorithms to a satisfactory depth.

This bootcamp aims to bridge this gap. It aims to demystify, both - Theory (key concepts, maths) and Practice (code) that goes into building these representation schemes. At the end of this bootcamp participants would have gained a fundamental understanding of these schemes with an ability to implement them on datasets of their interest.

Target Audience

  • Machine learning practitioners
  • Anyone (researcher, student, professional) learning NLP
  • Corporates and Start-ups looking to add NLP to their product or service offerings


  • This is a hands-on course and hence, participants should be comfortable with programming. Familiarity with python data stack is ideal.
  • Prior knowledge of machine learning will be helpful. Participants should have some practice with basic NLP problems e.g. sentiment analysis.
  • While the DL concepts will be taught in an intuitive way, some prior knowledge of linear algebra and probability theory would be helpful.


The material for the bootcamp is hosted on github. You can find slides for this workshop here.

This is from the popular bootcamp series by the speakers on NLP. Additional materials relevant would be shared prior to the bootcamp.


This would be a two-day instructor-led hands-on bootcamp to learn and implement an end-to-end deep learning models for natural language processing.

  • Day1 will cover introduction to text representation, old ways of representing text, followed by a deep dive into embedding spaces and word vectors.
  • Day2 will cover more advanced techniques of representing text such as Paragraph2vec/doc2vector techniques and various architectures for char2vec.

There will be four sessions of three hours each over two days .

Session 1: Introduction to representation learning

  1. What is representation learning?
  2. Use cases in natural language processing.
  3. Old ways of representing text
    • One-hot encoding
    • Tf-idf
    • N-grams
  4. How to use pre-trained word embedding?

Session 2: Word-vectors

  1. Introduction to word-vectors?
  2. Different techniques of generating word-vectors
    • CBOW, Skip-gram model
    • Glove model
  3. Detailed implementation of each of these models in tensorflow
  4. Negative sampling, hierarchical softmax, tSNE
  5. Fine-tuning pretrained embeddings

Session 3: Sentence2vec/Paragraph2vec/Doc2vec

  1. Extending word vectors to represent sentences/paragraphs/documents
  2. Various techniques for training doc2vec
    • Doc2vec i. DM ii. DBOW
    • Skip - thoughts
  3. Detailed implementation of each of these models in tensorflow

Session 4: Char2vec

  1. Building character embeddings
  2. Tweet2vec - character embeddings from social data
  3. CNN for character vectors.
  4. fastText - character n-gram embeddings

Software Requirements

We will be using Python data stack for the bootcamp with keras and tensorflow for the deep learning component. Please install Anaconda for Python 3 for the bootcamp. Additional requirement will be communicated to participants.


Anuj Gupta

Director - Machine Learning, Huawei Technologies

Satyam Saxena

Applied Scientist - Machine Learning, Amazon