Bioinformatics

Bioinformatics is a special kind of science that mixes many different areas of learning together. It helps scientists understand lots of information about living things, especially when there is too much data to handle by hand. This field uses ideas from biology, chemistry, physics, computer science, and math to study and explain biological data.

Early bioinformatics—computational alignment of experimentally determined sequences of a class of related proteins; see § Sequence analysis for further information.

People use bioinformatics to look at the genes inside living things and find out how they work. This can help us learn about diseases, why some plants or animals are special, or what makes different groups of people unique. Bioinformatics also studies proteins, which are tiny parts of our bodies that do important jobs.

It helps scientists read and organize big sets of information, like the instructions inside our cells. By doing this, bioinformatics helps us understand how living things grow, change, and stay healthy. It even helps us see how tiny parts of our bodies work together in big, complicated ways.

History

The term bioinformatics was first used in 1970 by Paulien Hogeweg and Ben Hesper. It describes the study of information in living systems, similar to how biochemistry studies chemical processes in living things.

The field grew quickly in the mid-1990s, helped by the Human Genome Project and new technology for reading DNA. To understand biological data like DNA and protein sequences, scientists use special computer programs. These programs rely on ideas from many areas of science and math.

Sequences of genetic material are frequently used in bioinformatics and are easier to manage using computers than manually.

Since the Human Genome Project, sequencing has become much faster and cheaper. Some labs can now read over 100,000 billion pieces of DNA each year, and a full set of DNA can be read for $1,000 or less.

Computers became very important when scientists started sharing protein sequences in the 1950s. Comparing these sequences by hand was too hard, so scientists created databases and methods to compare them easily. Early leaders in this field helped build the first databases and ways to study these sequences.

Goals

Bioinformatics helps us understand how cells work, especially when something goes wrong and causes illness. To do this, scientists need to look at lots of information about living things. Bioinformatics makes it easier to study this information by using computers and math.

One big goal of bioinformatics is to better understand how living things function. It does this by creating programs and tools to handle big sets of data. For example, scientists use these tools to find genes in DNA, predict how proteins will look, and study how different pieces of DNA are related. Other tasks include designing medicines, learning how genes work together, and modeling how cells grow and change.

Sequence analysis

Main articles: Sequence alignment, Sequence database, and Alignment-free sequence analysis

Since 1977, when the DNA of a small virus was first decoded, scientists have uncovered the DNA codes of thousands of living things and stored them in databases. These codes help us learn about genes that make proteins, RNA, and other important parts of life. By comparing genes from the same or different species, we can discover similarities in how proteins work or how species are related.

Because there is so much data, it’s not possible to look at all these DNA codes by hand. Instead, we use computer programs like BLAST to search through the codes. These programs help scientists find useful information from over 260,000 different organisms, which includes more than 190 billion building blocks of DNA.

DNA sequencing

Main article: DNA sequencing

Before we can study DNA, we need to get it from a database like GenBank. Reading DNA codes is tricky because the raw data can be messy or unclear. Special computer methods have been created to help make sense of this data.

Sequence assembly

Main article: Sequence assembly

Most methods for reading DNA give us short pieces of the code that need to be put together to make complete sets of genes or genomes. One common way to do this is called shotgun sequencing, where many tiny pieces of DNA are read. These pieces overlap each other, and computer programs help line them up to rebuild the full genome. This method is quick but can be hard for very large genomes, like human DNA, which might take many days of computer time. Even then, there are often gaps that need more work to fill in.

Genome annotation

Main article: Gene prediction

In genomics, annotation means marking the start and end points of genes and other features in a DNA sequence. Many genomes are too big to annotate by hand, and as we learn more DNA codes faster than we can study them, this has become a big challenge in bioinformatics.

Genome annotation can be done at three levels: looking at the DNA building blocks, the proteins they make, and the processes they are part of. Finding genes is a big part of the DNA level. For complicated genomes, scientists use both computer predictions and comparisons with other organisms to find genes. At the protein level, the goal is to figure out what each protein does. Databases of protein codes and features help with this, but many proteins in new genomes still have unknown functions.

Understanding how genes and proteins work together in cells and organisms is the aim of the process level. One challenge here is that different systems use different names for things, but groups like the Gene Ontology Consortium are working to make things clearer.

The first full system for describing a genome was created in 1995 for a type of bacteria. This system finds genes for proteins, transfer RNAs, and ribosomal RNAs to start understanding what they do. After the Human Genome Project ended in 2003, the ENCODE project began, using new technologies to find more details about the human genome.

Gene function prediction

Image: 450 pixels Sequencing analysis steps

While genome annotation often looks at how genes are similar, other features of genes can also help predict their functions. For example, looking at certain building blocks in proteins can show where they might be placed in cells. Scientists also use information about when genes are active, how proteins are shaped, and how they connect to each other to understand gene functions better.

Computational evolutionary biology

Further information: Computational phylogenetics

Evolutionary biology studies how species change over time. Computers help scientists in this field by allowing them to:

follow the changes in many organisms by looking at their DNA, instead of just physical traits,
compare whole genomes to study complex events like gene duplication and transfer between species,
create models to predict how populations will change over time,
keep track of information about many species

Comparative genomics

Main article: Comparative genomics

Comparative genomics looks at the differences and similarities between genomes of different organisms. Scientists make maps to see how genomes have changed over time. Many events, like small changes in DNA, large sections moving around, or whole genomes combining, shape how genomes evolve. Studying these changes helps scientists understand how life has developed and changed.

Pan genomics

Main article: Pan-genome

Pan genomics is a way to look at all the genes in a group of related organisms. It includes a core set of genes found in every organism and a flexible set that varies between them. Tools like BPGA can help study these gene sets in bacteria.

Genetics of disease

Main article: Genome-wide association studies

With new technologies, scientists can now find the causes of many human disorders. Some diseases follow simple patterns passed down in families, while others are more complex. Studies have found many small pieces of DNA linked to diseases like breast cancer and Alzheimer’s, but these only explain part of the risk. Rare changes in DNA might explain more, and large studies are looking at whole genomes to find these rare pieces. Tools help scientists analyze this data and understand which rare changes are important.

Analysis of mutations in cancer

Main article: Oncogenomics

In cancer, the DNA of affected cells changes in many complicated ways. Scientists use special microarrays to find changes in the number of DNA pieces and to look for small changes that cause cancer. These methods create huge amounts of data, which can be messy, so scientists use computer models to find real changes in DNA copies.

Two key ideas help identify cancer through DNA changes. First, cancer happens because of changes that build up in genes. Second, some of these changes drive cancer growth, while others are just along for the ride.

Better bioinformatics tools could help classify cancer types by looking at these DNA changes. In the future, it might be possible to follow patients by looking at how their cancer changes over time. Scientists are also studying common damage found in many tumors to learn more about cancer.

Gene and protein expression

Analysis of gene expression

We can learn which genes are active by measuring tiny messages called mRNA using tools like microarrays, expressed cDNA sequence tag, serial analysis of gene expression, massively parallel signature sequencing, and RNA-Seq. These tools help scientists study many genes at once, but they can sometimes give unclear results. Researchers use special computer programs to make sense of the data and find which genes are involved in different conditions, like comparing cells from someone who is sick to cells from a healthy person.

MIcroarray vs RNA-Seq

Analysis of protein expression

Tools like protein microarrays and mass spectrometry let scientists see which proteins are present in a sample. These methods have challenges, such as matching the data with known protein information and analyzing complex results. Scientists can also learn where proteins are located in tissues using special staining techniques and tissue microarrays.

Analysis of regulation

Gene regulation is how the body controls which genes are active. Signals like hormones can turn genes on or off. Bioinformatics helps study how genes are controlled, for example by looking at parts of DNA near genes called sequence motifs that affect how much mRNA is made. Scientists also study how distant parts of DNA called enhancer elements influence genes through special folding of the DNA. By comparing data from different conditions, scientists can find groups of genes that act together, using methods like clustering algorithms.

Analysis of cellular organization

Scientists have created ways to study where important parts like organelles, genes, and proteins are located inside cells. They use a special group called "cellular component" to organize this information in databases.

Microscopes help us see where organelles and molecules are, which can show us clues about diseases.

Knowing where proteins are helps us guess what they do. For example, proteins in the nucleus might help control genes, while proteins in mitochondria might help with energy production. There are tools and databases to help figure out where proteins are located.

Main article: Nuclear organization

Experiments like Hi-C and ChIA-PET give us details about how DNA is arranged inside the nucleus. This helps us understand how parts of the genome are grouped together in three-dimensional space.

Structural bioinformatics

Main articles: Structural bioinformatics and Protein structure prediction

See also: Structural motif and Structural domain

3-dimensional protein structures such as this one are common subjects in bioinformatic analyses.

Finding the shape of proteins is a key part of bioinformatics. There is an open contest called the Critical Assessment of Protein Structure Prediction (CASP) where research teams from around the world try to guess the shapes of unknown proteins.

The order of building blocks called amino acids in a protein is known as its primary structure. This order can be found from the code in DNA. In most proteins, this order decides the protein's 3D shape, which is important for its job. One example is hemoglobin, which carries oxygen in both humans and plants called legumes. Even though these hemoglobins have different amino acid orders, their shapes are very similar because they share the same job and ancestor.

Other ways to guess protein shapes include using known shapes from related proteins and physics-based modeling from scratch. In 2021, a powerful tool called AlphaFold, made by Google's DeepMind, became much better at guessing protein shapes than older methods. It has predicted shapes for hundreds of millions of proteins.

Network and systems biology

Main articles: Computational systems biology, Biological network, and Interactome

Network analysis helps us understand how different parts in living things are connected, like how proteins work together or how genes interact. These connections can be between many types of molecules, such as proteins and small chemicals, all working together in ways we can study.

Systems biology uses computer simulations to study small parts of cells, like how chemicals move and change, or how genes turn on and off. This helps scientists see how everything in a cell is connected. Some scientists even use computers to create simple artificial life to better understand how real life evolves.

Interactions between proteins are frequently visualized and analyzed using networks. This network is made up of protein–protein interactions from Treponema pallidum, the causative agent of syphilis and other diseases.

Molecular interaction networks

Main articles: Protein–protein interaction prediction and interactome

Scientists have discovered the shapes of many proteins using special tools. One big question is whether we can predict how proteins will interact just by looking at their shapes, without doing experiments. There are many methods to try to solve this, but there is still more work to do.

Other important interactions include how proteins bind to small molecules or other tiny pieces. By simulating how atoms move, scientists use special computer programs to study these interactions.

Biodiversity informatics

Main article: Biodiversity informatics

Biodiversity informatics helps us collect and study information about all the different kinds of plants, animals, and tiny living things, like those in microbiome data. Scientists use this information to understand how species are related, predict where they might live, and even identify them using parts of their DNA. This field also looks at how living things are affected by big changes in the world, like climate change.

Others

Literature analysis

Main articles: Text mining and Biomedical text mining

There is so much scientific writing that it's hard for one person to read it all. Literature analysis uses computer methods to help find important information in these writings. This can include recognizing short and long forms of biological terms, finding names of genes, and figuring out which proteins work together.

High-throughput image analysis

Computers help process and study lots of medical pictures. These tools can make analysis more accurate, fair, and fast. They are used for diagnosing diseases and for research. Examples include measuring tiny parts inside cells, studying shapes and sizes, watching how air moves in animals' lungs, and tracking how animals behave over time.

High-throughput single cell data analysis

Main article: Flow cytometry bioinformatics

Computers are used to study data from single cells, like information from flow cytometry. These methods help find groups of cells that are important for understanding diseases or experiments.

Ontologies and data integration

Biological ontologies are special lists that help organize biological ideas so computers can study them easily. The OBO Foundry worked to make some of these lists standard. One well-known list is the Gene ontology, which describes what genes do. There are also lists for describing traits of living things.

Databases

Main articles: List of biological databases and Biological database

Databases are very important for studying living things using computers. They store information about DNA, proteins, and many other parts of life. Some databases have real data from experiments, while others use data from other sources to make new guesses.

Some well-known databases include:

For studying DNA and proteins: Genbank, UniProt
For looking at protein shapes: Protein Data Bank (PDB)
For finding groups of proteins and special patterns: InterPro, Pfam
For new ways to read DNA: Sequence Read Archive
For studying how parts of cells work together: Metabolic Pathway Databases (KEGG, BioCyc), Interaction Analysis Databases, Functional Networks
For designing new tiny machines inside cells: GenoCAD ^{[citation needed]}

Software and tools

Software tools for bioinformatics help scientists study living things by providing different ways to work with data, like simple commands, fancy programs, or online services. These tools are created by special companies or public groups.

Many tools are free and open for anyone to use, and they have been around since the 1980s. These tools help scientists find new ways to study biology and share their work easily. Some well-known tools include Bioconductor, BioPerl, Biopython, BioJava, and BioJS.

Scientists can also use online services to run experiments and share data across the world. These services make it easier for everyone to access important tools without having to manage complicated software themselves. Some platforms that help scientists organize their work include Galaxy and UGENE.

Education platforms

Bioinformatics can be studied through online courses and special programs, not just in classrooms at universities. Many websites and tools help people learn bioinformatics, such as Rosalind and courses from the Swiss Institute of Bioinformatics Training Portal. The Canadian Bioinformatics Workshops share videos and slides from their workshops for free.

There are also big online classes called MOOCs that give certificates in bioinformatics. Examples include Coursera’s Bioinformatics Specialization at the University of California, San Diego, Genomic Data Science Specialization at Johns Hopkins University, and EdX’s Data Analysis for Life Sciences XSeries at Harvard University. Some projects, like 4273π, use simple computers such as Raspberry Pi to teach these topics to students and adults alike.

Conferences

There are many big meetings where people talk about bioinformatics. Some important ones are the European Conference on Computational Biology (ECCB), Intelligent Systems for Molecular Biology (ISMB), Pacific Symposium on Biocomputing (PSB), and Research in Computational Molecular Biology (RECOMB). These conferences help scientists share their ideas and discoveries in this field.