ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein
3D Structures
Abstract
The study of proteins is critical in various scientific disciplines, but
understanding their complex structure-function relationships remains
challenging. Recent advancements in large language models (LLMs) have
demonstrated their ability to comprehend task-specific knowledge,
suggesting the potential for specially trained ChatGPT-like systems to
accelerate protein research. In this work, we introduce ProteinGPT, a
prototype model aimed at learning and understanding protein 3D
structures. ProteinGPT enables users to upload proteins, ask questions,
and engage in interactive conversations to gain insights. The
ProteinChat system consists of three main components: a composite
encoder block, a projection layer, and an LLM. The protein undergoes
encoding to form a protein embedding, which is then projected to conform
with the LLM. The LLM combines user questions with the embedding to
generate informative answers. To train ProteinChat, we curated the
RCSB-PDB Protein Description Dataset, comprising 143,508
protein-description pairs from publicly available sources. By leveraging
the capabilities of ProteinGPT, researchers can potentially expedite
their investigations into protein functionalities and structures,
benefiting areas such as drug development and therapeutics, food science
and nutrition, and various aspects of our lives. This initial step lays
the foundation for further exploration and utilization of the
ChatGPT-like system in protein research. The code is available at
\url{https://github.com/UCSD-AI4H/proteinchat} and the
dataset can be downloaded from
\url{https://drive.google.com/file/d/1AeJW5BY5C-d8mKJjAULTax6WA4hzWS0N/view?usp=share_link}.