Cointime

Download App
iOS & Android

Microsoft BioGPT: Towards the ChatGPT of Life Science?

The use of language models (LMs) has exploded in recent years, and ChatGPT is just the tip of the iceberg. ChatGPT has been used to write code, recipes, or even sonnets and poems. All noble purposes, but there is also a huge scientific literature, so why not exploit this vast amount of textual data?  

Microsoft recently unveiled BioGPT which does exactly that. the new model has achieved state-of-the-art in several tasks. Let’s find out together.

Meanwhile, why is it important? Thousands and thousands of scientific publications come out every year, and it is difficult to be able to keep up with the growing literature. On the other hand, the scientific literature is essential to be able to develop new drugs, establish new trials, develop new algorithms, or understand the mechanisms of disease.

In fact, NLP pipelines can be used to be able to extract information from large amounts of scientific articles (names and entities, relationships, classification, and so on). On the other hand, LMs models generally have poor performance in the biomedical area (poor ability to generalize in such a domain). for this reason, researchers, prefer to train models directly on scientific literature articles.

PubMed, screenshot by the Author

In general, Pubmed (the main repository of scientific articles) contains 30M articles. So there are enough articles to be able to train a model and then use this pre-trained model for follow-up tasks.

Generally, two types of pre-trained models were used:

  • BERT-like models, trained using masked languaging modeling. Here the task is given a sequence of tokens (subwords) some are masked, and the model using the remaining tokens (context) should predict which tokens are masked.
  • GPT like, trained using auto-regressive language modeling. The model learns to predict the next word in a sequence, knowing which words are before in the sequence.

BERT has been used extensively and in fact, there are several alternatives dedicated to the biomedical world: BioBERT, PubMedBERT, and so on. These models have shown superior capabilities in understanding tasks in compared to other models. On the other hand, GPT-like models are superior in generative tasks and have been little explored in the biomedical field.

So in summary, the authors in this paper used GPT-like architecture:

we propose BioGPT, a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch. We apply BioGPT to six biomedical NLP tasks: end-to-end relation extraction on BC5CDR [13], KD-DTI [14] and DDI [15], question answering on PubMedQA [16], document classification on HoC [17], and text generation. (original article)

The authors tested (and compared with previous methods) BioGPT on three main tasks:

  • Relation extraction. The purpose is the joint extraction of both entities and their relationships (e.g., drugs, diseases, proteins, and how they interact).
  • Question answering. In this task, the model must provide an appropriate answer according to the context (reading comprehension).
  • Document classifcation. The model must classify (predict) a document with a label (or more than one label).
  image source: preprint


As the authors point out when training a model from scratch it is important to make sure that the dataset is from the domain, of quality, and in the right amount. In this case, they used 15 M abstracts. In addition, the vocabulary must also be appropriate for the domain: here the vocabulary is learned with byte pair encoding (BPE). Next, the architecture must be chosen, and the authors chose GPT-2.

The authors, specifically engined the datasets to create training prompts (in which they provided the prompt and target). This was to enable better training specific to the biomedical domain.

  Framework of BioGPT when adapting to downstream tasks. image source: preprint


The authors evaluated the model on end-to-end relationship extraction with a model called REBEL (based on BART, a variant of BERT) and which was recently published. They also used as a baseline the GPT-2 model, which has not been trained specifically for the medical domain. The table shows that BioGPT achieves state-of-the-art in a chemical dataset:

“Results on BC5CDR chemical-disease-relation extraction task. ’gt+pred’ means using ground truth NER information for training and using open-source NER tool to annotate NER for inference. ’pred+pred’ means using open-source NER tool for both training and inference. ’†’ means training on training and validation set. M”. image source: preprint

drug-drug interaction is another type of interaction that is very useful in research. In fact, interactions between two different drugs are among the leading causes of adverse effects during treatment, so predicting potential interactions going forward is an asset for clinicians:

  image source: preprint


They also used a drug-target interaction dataset. The result is very interesting because predicting drug-target interaction can be useful for developing new drugs.

  image source: preprint


In this task, too, the model achieved state-of-the-art status:

  image source: preprint


And even in document classification, BioGPT outperformed previous models

  image source: preprint


As said above, GPT has the generative capability: “ they can continue to generate text that are syntactically correct and semantically smooth conditioning on the given text”. The authors decided to evaluate the ability of the model in generating biomedical synthetic text.

Specially, we extract all the entities within the triplets from the KD-DTI test set (i.e. drugs and targets). Then for each drug/target name, we provide it to the language model as the prefix and let the model generate text conditioned on it. We then investigate whether the generated text is meaningful and fluent. (original article)

The authors noted that the model was working well with known names as input while if you input an unknown name the model or it is copying from the article (something seen in the training set) or fails in generating an informative test.

  image source: preprint


In conclusion:

BioGPT achieves SOTA results on three endto-end relation extraction tasks and one question answering task. It also demonstrates better biomedical text generation ability compared to GPT-2 on the text generation task. (original article)

For the future the authors would like to scale up with a bigger model and a bigger dataset:

For future work, we plan to train larger scale BioGPT on larger scale biomedical data and apply to more downstream tasks. (original article)

Microsoft is convinced it can be useful for helping biologists and scientists with scientific discoveries. The model can be useful in the future for the research of new drugs, being included in pipelines that are analyzing the scientific literature.

article: here, preprint: here, GitHub repository: here, HuggingFace page: here.

What do you think about it?

Comments

All Comments

Recommended for you

  • South Korea’s Monetary Authority: Confirmed to include token delisting standards in the Virtual Asset User Protection Act

    The Financial Supervisory Service (FSS) of South Korea has confirmed that token delisting standards will be included in the "Best Practice for Compliance with the Virtual Asset User Protection Act" released in early June. An official from the Financial Supervisory Service stated in a conversation with Bloomberg on Tuesday that the upcoming "Best Practices for Compliance with the Virtual Asset User Protection Act" will not only include listing standards for virtual assets, but also provide guidance on whether to maintain trading of listed virtual assets. The guidance will provide a basis for cryptocurrency issuers to delist in the event of problems. The guidance will be released from the end of May to early June. Currently, the Financial Supervisory Service is developing guidelines to support self-regulation by cryptocurrency exchanges under the Virtual Asset User Protection Act before it is implemented in July. The plan proposes standards for virtual asset issuance, circulation, and trading support, prohibits the listing of virtual assets with a history of hacking attacks, and requires the release of Korean white papers and technical manuals when listing overseas virtual assets.

  • HKEX CEO: Virtual asset exchanges have become HKEX’s competitors

    On May 10th, Hong Kong Exchanges and Clearing Limited's new CEO, Nicolas Aguzin, stated in an interview with the Shanghai Securities News that HKEX faces competition not only from other securities exchanges, but also from external competitors such as virtual asset exchanges. In order to meet the rapidly evolving demands of customers and technology, HKEX must balance innovation and stable business operations, continuously expand its resources for listed companies, and improve its market services.

  • WOOFi attacker address has transferred 100 ETH to Tornado cash

    PeckShield monitoring shows that the address marked by the WOOFi attacker has transferred 100 ETH to Tornado cash. The WOOFi attacker has already transferred 2200 ETH (worth about $6.5 million) to Tornado cash.

  • Trump will hold a private dinner on the day of the court recess, inviting NFT trading card buyers to attend

    On May 10th, according to sources, former US President Donald Trump will host a dinner at his Mar-a-Lago estate on a day off, inviting NFT trading card buyers to attend. This event is part of Trump's series of non-campaign activities, aimed at balancing his White House campaign and legal disputes. After Stormy Daniels testified in Trump's trial on Tuesday, Trump expressed his desire for campaigning rather than being tied up in court. Despite no public campaign activities on Wednesday, Trump's schedule includes private political meetings.

  • Tether: Deutsche Bank’s analysis lacks clarity and substantive evidence

    According to a report on stablecoins released on May 7, Deutsche Bank analyzed 334 currencies linked to stablecoins and found that 49% of stablecoins had failed during their median lifespan of about eight to ten years. The analysts concluded that most anchored assets in the cryptocurrency field will experience significant "turbulence" caused by speculative sentiment and ultimately suffer some form of decoupling event. Deutsche Bank analysts also pointed out that Tether's reserve transparency was lacking and described the company's solvency as "doubtful".

  • Yesterday, Solana’s on-chain DEX transaction volume surpassed Ethereum, reaching $1.314 billion

    On May 10th, according to DeFiLlama data, the trading volume of Solana's DEX reached 1.314 billion US dollars yesterday, surpassing the trading volume of 1.297 billion US dollars on Ethereum's DEX.

  • US court orders seizure of 279 virtual currency accounts containing criminal proceeds from North Korean hacking

    A US court has ordered the confiscation of 279 virtual currency accounts containing proceeds from North Korean hacker crimes. US District Court Judge Timothy Kelly in Washington, DC approved the federal prosecutor's request for a summary judgment on these accounts and ordered their confiscation on May 8. This ruling means that these accounts are now under the control of the US Department of Treasury.

  • South Korea’s National Tax Service announced that it would collect 40 billion won in taxes from Bithumb users

    Bithumb has issued a preliminary notice of comprehensive income tax to some users who participated in activities held between 2018 and 2021, and announced full support for the related tax amount. The position of the National Tax Service is that rewards paid to users through various activities (including virtual assets) constitute taxable income. Bithumb does not agree with the National Tax Service's opinion, but explains that taxation is mandatory.

  • The Base ecosystem Bloom project said it has recovered 90% of the funds stolen in the attack

    On May 10th, Bloom, a decentralized derivatives exchange on the X platform, announced that they have recovered $486,000 (minus 10% for bug bounties) out of the total funds utilized ($540,000). All of these funds will be redistributed to limited partners. 10% of the bug bounty has been agreed upon in exchange for not pressing charges against those who exploited the bug. A compensation plan for limited partners affected by the bug will be completed within the next 24-48 hours. Funds are safe and there is currently no need to revoke contract access.

  • US House of Representatives passes SAB 121 crypto rule overturning SEC

    The US House of Representatives has passed H.J. Res. 109, a resolution aimed at overturning the Securities and Exchange Commission's SAB 121 regulation on digital assets. The resolution aims to reduce regulatory burden and promote regulated banks to safely hold digital assets. However, the White House supports the SEC and has threatened to veto the resolution, emphasizing that if the President receives H.J. Res. 109, he will veto it.