What is A Few-Shot Learning Molecular Dataset?

Small introducing into FS-MOL

Dec 19, 2021

I want to sometimes cover the academic news-cycle around more academic or technical type of A.I. news and I think this will be the Newsletter publication where I will do that as opposed to AISupremacy.

So for more Academic dealings of machine-learning, subscribe to Data Science Learning Center.

This Newsletter was actually born from being defrauded by a site called Datascience Central, a supposed client who refused to pay me for eight articles I wrote in the second half of 2021.

Let’s get into it.

Microsoft has invented a method to on-board deep learning more efficiently into drug discovery. FS-MOL essentially brings deep learning in Early-Stage Drug Discovery.

Microsoft Researchers Introduce ‘FS-MOL’, A Few-Shot Learning Molecular Dataset, To Bring Deep Learning in Early-Stage Drug Discovery

If you never heard of few-shot learning, don’t worry, I hadn’t either.

Small datasets are ubiquitous in drug discovery as data generation is expensive and can be restricted for ethical reasons (e.g. in vivo experiments). A widely applied technique in early drug discovery to identify novel active molecules against a protein target is modelling quantitative structure-activity relationships (QSAR).

It is known to be extremely challenging, as available measurements of compound activities range in the low dozens or hundreds. However, many such related datasets exist, each with a small number of datapoints, opening up the opportunity for few-shot learning after pre-training on a substantially larger corpus of data.
At the same time, many few-shot learning methods are currently evaluated in the computer-vision domain. Microsoft Research propose (Aug, 2021 paper) that expansion into a new application, as well as the possibility to use explicitly graph-structured data, will drive exciting progress in few-shot learning.
Here, they provide a few-shot learning dataset (FS-Mol) and complementary benchmarking procedure. We define a set of tasks on which few-shot learning methods can be evaluated, with a separate set of tasks for use in pre-training. In addition, we implement and evaluate a number of existing single-task, multi-task, and meta-learning approaches as baselines for the community. They hope that their dataset, support code release, and baselines will encourage future work on this extremely challenging new domain for few-shot learning.

Drug Discovery using AI will go Mainstream in the 2020s

So why is this interesting? Drug discovery using AI will be big business.

The discovery, design, and testing phases of the drug development process are iterative. Drugs were previously sourced from plants and found through trial-and-error methods. While much safer and more effective, this method takes a long time and costs a lot of money. Thankfully, drug research today takes place in a lab, with each iteration of custom-designed chemicals yielding a more promising candidate.

It can take over ten years to bring a single medicine from concept to market, and it might cost anywhere between $1 and $2 billion.

A lot of effort is invested during the repeated cycles of developing and synthesizing new candidate molecules, testing them, and determining which molecular features to improve before starting the process again. In a laboratory, the steps of synthesis and in vitro testing of molecular behavior are inherently slow.

Computational modeling is one technique to speed up the drug-development process. Most compounds can be prioritized in silico even if they aren’t physically available. Only the most likely to succeed are synthesized and measured.

A machine learning model must be able to predict chemical attributes correctly, mainly whether a suggested medicinal molecule will be active — that is, able to alter the protein target associated with the disease — to enable such a speedup through computational modeling.

The AI of Biotechnology is in its Nascent Stage

When millions of lines of data are available, ML is known to be particularly good at spotting patterns in images and text. However, just a few dozen molecules are likely to have been measured in a laboratory during the early phases of the drug-discovery process. Since data generation is expensive and can be restricted for ethical concerns, small datasets are standard in drug research.

Microsoft is Investing in Drug Discovery AI

Healthcare being augmented by A.I. is also a corporate arms race. FS-Mol: A Few-Shot Learning Dataset of Molecules was developed by the Machine Intelligence team at Microsoft Research Cambridge in partnership with Novartis to address the problem of molecule-protein interaction prediction given a small amount of data. The goal is to help the ML and computational chemistry communities work together to solve this complex problem.

This will give BigTech generational power in the field of healthcare, genomics and the pharma industry. Monopolies could become factions in the future as the best A.I. is most likely to transform healthcare (and education) at scale.

Microsoft’s Breakthrough FS-Mol:

The researchers created a tiny dataset for protein-ligand binding prediction as well as a principled strategy for exploiting these datasets in few-shot learning. Due to the lack of such a dataset, an open-source evaluation framework was created to allow ML researchers to evaluate their work and assist drug development professionals in determining which computer modeling approaches are most promising for their specific goals.

In computer vision and reinforcement learning communities, few-shot learning is prevalent. It comprises preparing an ML model using training data from a set of related tasks before adapting it to a new task of interest with only a few relevant data points. The structure of the model is ready to pick up new information, similar to how a human brain learns to recognize an object it has only seen once. Thus access to millions of data points for each recent activity we may encounter isn’t required.

It turns out simulating real-world barriers and obstacles is part of the key.

By learning to identify the most significant properties, pre-training techniques try to prepare an ML model for specialization. Multitask training is one such strategy that seeks to train a model to predict labels for molecules drawn from numerous tasks simultaneously. Models are trained to recover removed or altered information in the input in self-supervised pre-training.

Only if all few-shot learners are given the same testing problem and have access to the same information during the pre-training phase can such approaches be compared fairly. However, there was no well-defined set of activities or a clear testing strategy previous to this effort. The researchers created a dataset and testing technique that mirror the real-world obstacles of early-stage drug development.

The research shows that not only is early-stage drug development well-posed as a few-shot learning issue, but also pre-training and, in particular, meta-learning approaches can increase the quality of molecular property predictions significantly. They have given the drug-discovery community access to the most up-to-date state-of-the-art ML research on a truly realistic topic by sharing the dataset and evaluation framework with these baseline results.

It appears then that Microsoft is well positioned to work with the pharma industry to monetize its A.I. Research, just as Google’s DeepMind and OpenAI (Microsoft) are being monetized.

Paper: https://openreview.net/forum?id=701FtuyLlAd

Reference: https://www.microsoft.com/en-us/research/blog/fs-mol-bringing-deep-learning-to-early-stage-drug-discovery/

Machine Economy Press

Discussion about this post