Needle in the haystack

Background

Companies have a massive amount of information available within reach, only a few clicks away. The challenge with these large volumes of information is that we, as a human being, are not able to digest it properly. Search Engines like Google are helping to prioritise information based on your query and interests. However, we noticed, with the trend of open data, that not all data is indexed in Google, leaving multiple sources “undiscoverable”. 

One type of information is the local, regional and federal political information. Vast amounts of reports, detailed research documents, … are available for mining but it is difficult to valorize them, since the information is “stuck” in PDF, or, as mentioned, is not indexed by Google. Many political decision documents have become more and more open-source. Albeit on a local municipal level or on a district level, these data and metadata have found their way into Linked Open Data platforms and databases.

In this project, we wish to solve this challenge. We want to enable many companies to reach that information with ease, based on their interests in the political chatter available about their company, line of work or sector.

A traditional approach to this problem is to have literal people spending hours combing through government statements and meeting notes to sift out nuggets of information pertaining to a certain context. A more modern approach however, is to use advanced NLP techniques to do this automatically on a large scale.

Goal

We want to address this gap by creating an end-to-end application that:

In order to get companies:

  • To find the crucial and relevant topics in this vast amount of information, find the needle in the haystack.
  • Targeted insights in specifically important events

Your mission, dear ML6 Intern agent, should you choose to accept it, is exactly this!

Functional solution

On a high-level technical perspective, the solution could look as follows:

Of course, things aren’t set in stone, and the finalisation of the functional design, as well as the translation into the technical design, is something that can happen in collaboration with senior engineers at ML6.

Technologies involved

On a machine learning level:

  • NLP
  • Keyword extraction
  • Named entity recognition
  • Extractive summarization
  • Engineering-wise
  • Serverless backend applications
  • Microservice architecture
  • Apache Beam
  • Event-driven task architecture
  • Data Warehousing
  • Scraping (Scrapy framework)
  • General way of working
  • Google Cloud Platform
  • Trello
  • Bitbucket

So if you are a person with a broad set of interests in Machine Learning, Data Engineering and Software Engineering: you are the agent for the job 😎!