Large language models generate functional protein sequences across diverse families

Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

Lin, Y.-R. et al. Control over overall shape and size in de novo designed proteins. Proc. Natl Acad. Sci. USA 112, E5478–E5485 (2015).

Article
CAS

Google Scholar

Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

Article
CAS

Google Scholar

Huang, P.-S. et al. De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat. Chem. Biol. 12, 29–34 (2016).

Article
CAS

Google Scholar

Boyken, S. E. et al. De novo design of protein homo-oligomers with modular hydrogen-bond network–mediated specificity. Science 352, 680–687 (2016).

Article
CAS

Google Scholar

Lapedes, A. S., Bertrand, G. G., LonChang, L. & Stormo, G. D. Correlated mutations in models of protein sequences: Phylogenetic and structural effects. Lect. Notes Monogr. Ser. 33, 236–256 (1999).

Article

Google Scholar

Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).

Article
CAS

Google Scholar

Hopf, T. A. et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019).

Article
CAS

Google Scholar

Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).

Article
CAS

Google Scholar

Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).

Article

Google Scholar

Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

Article
CAS

Google Scholar

Wu, Z. et al. Signal peptides generated by attention-based neural networks. ACS Synth. Biol. 9, 2154–2161 (2020).

Article
CAS

Google Scholar

Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

Article
CAS

Google Scholar

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

Article
CAS

Google Scholar

Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol. 39, 691–696 (2021).

Article
CAS

Google Scholar

Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613–623 (2021).

Article
CAS

Google Scholar

Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

Article
CAS

Google Scholar

Moffat, L., Kandathil, S. M. & Jones, D. T. Design in the DARK: Learning deep generative models for De Novo Protein Design. Preprint at bioRxiv https://doi.org/10.1101/2022.01.27.478087 (2022).

Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

Huang, B. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022).

Article
CAS

Google Scholar

Leinonen, R. et al. UniProt archive. Bioinformatics 20, 3236–3237 (2004).

Article
CAS

Google Scholar

Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005).

Article
CAS

Google Scholar

Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).

Article
CAS

Google Scholar

Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS, 2017).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2019).

Brown, T. B. et al. Language models are few-shot learners. In 34th Conference on Neural Information Processing Systems (NeurIPS, 2020).

Zellers, R. et al. Defending against neural fake news. In 33rd Conference on Neural Information Processing Systems (NeurIPS, 2019).

Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at arXiv https://doi.org/10.48550/arXiv.1909.05858 (2019).

AlQuraishi, M. The future of protein science will not be supervised. Some Thoughts on a Mysterious Universe https://moalquraishi.wordpress.com/2019/04/01/the-future-of-protein-science-will-not-be-supervised/ (2019).

Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

Article
CAS

Google Scholar

Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

Google Scholar

Peters, M. E. et al. Deep contextualized word representations. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2018).

Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL, 2018).

Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. Preprint at https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).

Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

Article
CAS

Google Scholar

Pfaff, C. W. Constraints on language mixing: Intrasentential code-switching and borrowing in Spanish/English. Language 55, 291–318 (1979).

Article

Google Scholar

Poplack, S. Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching. Linguistics 18, 581–618 (1980).

Article

Google Scholar

Dathathri, S. et al. Plug and play language models: a simple approach to controlled text generation. In 8th International Conference on Learning Representations (ICLR, 2020).

Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).

Article
CAS

Google Scholar

Broendum, S. S., Buckle, A. M. & McGowan, S. Catalytic diversity and cell wall binding repeats in the phage-encoded endolysins. Mol. Microbiol. 110, 879–896 (2018).

Article
CAS

Google Scholar

Love, M. J., Abeysekera, G. S., Muscroft-Taylor, A. C., Billington, C. & Dobson, R. C. J. On the catalytic mechanism of bacteriophage endolysins: opportunities for engineering. Biochim. Biophys. Acta. Proteins Proteom. 1868, 140302 (2020).

Article
CAS

Google Scholar

Martin, P. P. Potts Models And Related Problems In Statistical Mechanics (World Scientific, 1991).

Thomas, J., Ramakrishnan, N. & Bailey-Kellogg, C. Graphical models of residue coupling in protein families. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 183–197 (2008).

Article
CAS

Google Scholar

Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67–72 (2009).

Article
CAS

Google Scholar

Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).

Article
CAS

Google Scholar

Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).

Article

Google Scholar

Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Binformatics 37, 3029–3031 (2021).

Article
CAS

Google Scholar

Mooers, B. H. M., Tronrud, D. E. & Matthews, B. W. Evaluation at atomic resolution of the role of strain in destabilizing the temperature-sensitive T4 lysozyme mutant Arg 96 → His. Protein Sci. 18, 863–870 (2009).

Article
CAS

Google Scholar

Baase, W. A., Liu, L., Tronrud, D. E. & Matthews, B. W. Lessons from the lysozyme of phage T4. Protein Sci. 19, 631–641 (2010).

Article
CAS

Google Scholar

Kuroki, R., Weaver, L. H. & Matthews, B. W. A covalent enzyme–substrate intermediate with saccharide distortion in a mutant T4 lysozyme. Science 262, 2030–2033 (1993).

Article
CAS

Google Scholar

Mchaourab, H. S., Oh, K. J., Fang, C. J. & Hubbell, W. L. Conformation of T4 lysozyme in solution. Hinge-bending motion and the substrate-induced conformational transition studied by site-directed spin labeling. Biochemistry 36, 307–316 (1997).

Article
CAS

Google Scholar

Kim, J.-K. et al. BetaCavityWeb: a webserver for molecular voids and channels. Nucleic Acids Res. 43, W413–W418 (2015).

Article
CAS

Google Scholar

Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).

Article
CAS

Google Scholar

Pearson, W. R. An introduction to sequence similarity (‘homology’) searching. Curr. Protoc. Bioinforma. 3, 3.1 (2013). ChapterUnit.

Google Scholar

Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).

Article

Google Scholar

Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. Transfer learning in natural language processing. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics (eds Jill Burstein, J., Doran, C. & Solorio T.) (Association for Computational Linguistics, 2019).

Huh, M., Agrawal, P. & Efros, A. A. What makes ImageNet good for transfer learning? Preprint at arXiv https://doi.org/10.48550/arXiv.1608.08614 (2016).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

Article
CAS

Google Scholar

Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).

Article
CAS

Google Scholar

Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 746 (2022).

Article
CAS

Google Scholar

Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136–D143 (2012).

Article
CAS

Google Scholar

Pettit, L. D. The IUPAC stability constants database. Chem. Int. 28, 14–15 (2006).

CAS

Google Scholar

Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

Article
CAS

Google Scholar

Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).

Google Scholar

Madani, A. et al. ProGen: language modeling for protein generation. Preprint at arXiv https://doi.org/10.1101/2020.03.07.982272 (2020).

Vig, J. et al. BERTology meets biology: Interpreting attention in protein language models. In International Conference on Learning Representations (ICLR, 2020).

Goyal, K., Dyer, C. & Berg-Kirkpatrick, T. Exposing the implicit energy networks behind masked language models via metropolis–hastings. In 10th International Conference on Learning Representations (ICLR, 2022).

Bhattacharya, N. et al. Single layers of attention suffice to predict protein contacts. Preprint at bioRxiv https://doi.org/10.1101/2020.12.21.423882 (2020).

Ramsauer, H. et al. Hopfield Networks is All You Need. Preprint at arXiv https://doi.org/10.48550/arXiv.2008.02217 (2020).

Alley, E., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

Article
CAS

Google Scholar

Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).

Google Scholar

Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint arXiv https://doi.org/10.48550/arXiv.1412.6980 (2014).

Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proc. 30th International Conference on Machine Learning (eds. Dasgupta, S. & McAllester, D.) 1310–1318 (PMLR, 2013).

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

Google Scholar

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In 8th International Conference on Learning Representations (ICLR, 2020).

Goodfellow, I. J. et al. Generative adversarial networks. In 28th Conference on Neural Information Processing Systems (NIPS, 2014).

Koehn, P. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. in Machine Translation: From Real Users to Research 115–124 (Springer, 2004).

Sun, Z. Z. et al. Protocols for implementing an Escherichia coli based TX-TL cell-free expression system for synthetic biology. J. Vis. Exp. 16, e50762 (2013).

Google Scholar

Kabsch, W. XDS. Acta Crystallogr. D Biol. Crystallogr. 66, 125–132 (2010).

Article
CAS

Google Scholar

McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).

Article
CAS

Google Scholar

Kovalevskiy, O., Nicholls, R. A., Long, F., Carlon, A. & Murshudov, G. N. Overview of refinement procedures within REFMAC5: utilizing data from different sources. Acta Crystallogr D Struct. Biol. 74, 215–227 (2018).

Article
CAS

Google Scholar

Terwilliger, T. C. et al. Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr. D Biol. Crystallogr. 64, 61–69 (2008).

Article
CAS

Google Scholar

Hoh, S. W., Burnley, T. & Cowtan, K. Current approaches for automated model building into cryo-EM maps using Buccaneer with CCP-EM. Acta Crystallogr D Struct. Biol. 76, 531–541 (2020).

Article
CAS

Google Scholar

Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D Biol. Crystallogr. 66, 486–501 (2010).

Article
CAS

Google Scholar

Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. D Biol. Crystallogr. 68, 352–367 (2012).

Article
CAS

Google Scholar

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint at arXiv https://doi.org/10.48550/arXiv.1910.10683 (2019).

Studier, F. W. Protein production by auto-induction in high density shaking cultures. Protein Expr. Purif. 41, 207–234 (2005).

Article
CAS

Google Scholar

Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

Article
CAS

Google Scholar

Large language models generate functional protein sequences across diverse families

Spatial transcriptomics for profiling the tropism of viral vectors in tissues

Massively parallel knock-in engineering of human T cells

Massively parallel knock-in engineering of human T cells