Commit message generator

Background

As NLP stands for ‘Natural Language Processing’, it’s not surprising that the field typically dealt with ‘Natural Language’: human-like conversational and written text.

Recently though, the NLP field has been widening it’s field of view to also include processing ‘Programming Language’.

Clear indications of this interesting evolution are:

  • The rise of appropriate metrics (eg. CodeGlue)
  • The rise of enormous datasets (eg. CodeSearchNet)
  • The rise of benchmarks (eg. CodeXGlue)
  • The rise of pretrained models (eg. CodeT5, CodeBERT, PLBart, etc.)
  • The rise of commercial products (eg. OpenAI Codex / Github CoPilot)

So due time we investigate this emerging field and try to make our mark on it!

One mark we can make on this is to make a commit message generator application. AI is especially useful when used to automate human-like tasks that are not that hard, add value and that humans dislike / skip over. Creating meaningful commit messages definitely falls under this category!

The idea is to train or finetune a seq2seq model (so a model that can convert text to another text) to create a commit message, like illustrated with an example below:

This, dear ML6 Intern agent, is your mission.

Goal

The internships encompasses various steps:

  1. Researching the existing field of NLP+Code
  2. Identifying relevant architectures to test
  3. Identifying or scrape relevant code datasets to use
  4. Find a suitable way of representing the input diff data
  5. Training a successful seq2seq model
  6. Open-sourcing the obtained models (and datasets)
  7. Writing a blog post
  8. Creating a small cloud-based serverless application to demonstrate the model(s)
  9. [extension] creating and open sourcing a vscode plugin that can be used by developers