Margaret Dayhoff (en)

This is a blog post by Amanda Clare, for International Women’s Day 2021. You can find the Welsh version here.
Dyma blog gan Amanda Clare, ar gyfer Diwrnod Rhyngwladol Y Menywod 2021. Cewch ffindio’r fersiwn Cymraeg yma.

Margaret Dayhoff was a bioinformatician who, in the very early days of bioinformatics, collected biological sequence data and made it available for other scientists, first in books, and then over the internet.

She wrote down amino acid sequences using a single-letter code that she had created, so S represented Serine, T represented Threonine, P represented Proline and so on for the 20 possible amino acids (I still always forget which of Leucine and Lysine are L and K). In this way, a protein could be written down simply as a string of characters, and compared with other proteins.

The Atlas of Protein Sequence and Structure (1965) was a book in which she compiled and published 65 protein sequences, all known sequences at that time, along with chapters describing them. The sequences themselves were at the back of the book in a Data section. Ensuring the accuracy of the sequences in this book was an exacting task. Many further published editions of the book followed, with the amount of sequence information doubling each year, and eventually it became the Protein Information Resource, a forerunner to the large database UniProt, which is in widespread use today.

With this collection of sequences, Dayhoff could compare them, align them and use them to inspect the likelihood of amino acid sequences changing over time. She could create a matrix of possibilities: how often do we see an S changed for a T? or an L changed for a K? This lets us know how more about how sequences evolve over time, and what might be expected. The Dayhoff matrices bear her name. She used these to make the first phylogenies (evolutionary trees) that were inferred by computer.

Later she created one of the first databases of DNA: the Nucleic Acid Sequence Database computer database, and in 1980 made this available to users around the world “on line”, via telephone dialup access to her computer. This contained 500 nucleotide sequences, the longest of which had 16,569 characters. She also provided various useful commands that users could execute online, such as search for substrings, get codon usage tables and search for potential gene-encoding regions, along with searches in the metadata for the sequences, such as the organism names, or the author names. Over two hundred user groups requested access in the first year that her nucleic acid sequence demonstration database was available to query online. The field of bioinformatics now relies heavily on such databases.

Margaret Dayhoff was pioneering, both as a bioinformatician and someone who wanted to encourage data sharing for better science.

Categories:EnglishIWD21