Quantifying the semantic gap

The adoption of NLP models in real-world use cases has also meant the rise of niche domain-specific language models. Do you need a language model for Polish legal text? Sure thing. Or how about a language model for Swedish medical texts? Look no further. However, in many real-world situations, this begs the question of when we decide to explore using a custom language model or when we decide that an “out-of-the-box” language model will suffice. In some fields like the medical field, this question is trivial as most of the important words (e.g, names of diseases, Latin anatomy language, etc.) are out-of-vocabulary for general-purpose language models but in other domains such as legal texts, technical manuals, etc. it is much less obvious.

The goal of this project is to estimate the expected effect that using a custom language model will have over an existing one based on known heuristics. The current best approach is to compare the overlap of n-grams from your domain-specific texts to those from the texts that the existing solution was trained on and install some (arbitrary) cut-off (i.e, if the n-gram overlap is under 30%, we explore using a custom language model). However, this method is not very quantitative nor very rigorous.

Methodology / Tasks

During this internship, you will:

  • Gain experience leveraging state-of-the-art NLP techniques.
  • Conduct applied research to solve a concrete real-world problem.
  • Let your creative skills loose on a major problem for many companies in a wide range of domains.

Profile / Required skills

  • Strong analytical abilities, knowledge of different statistical methods, not scared by mathematics and a familiarity with research studies.
  • Strong interest in Computer Vision / NLP / Other subdomain [preferred]
  • Familiarity with statistical analysis languages and tools like Python, SQL.
  • Excellent verbal and written communication in English.
  • You are currently pursuing a degree in computer science or related field.

Internship Duration

The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 6 weeks.


Our internships and theses are linked to our chapters. A chapter is a cross-squad team of experts in a specific topic to enable knowledge building and sharing across projects. The chapters build knowledge by performing applied research and gathering learnings from projects. This internship falls under the NLP chapter.