The Three Lakes Biodiversity Research Project

For posts listed in chronological order see the menu on the right.

Bioinformatics

Published: July 2025

Turning DNA sequences into ecological understanding

Bioinformatics is the process of turning large volumes of DNA sequencing data into meaningful biological information. In palaeoecology, it is essential for interpreting sedimentary ancient DNA (sedaDNA) and reconstructing past environments.

The trip to Tromsø was both interesting and overwhelming. I was plunged into the depths of sedimentary ancient DNA analysis, processing, and all the different aspects in and around that, along with all the jargon. I learned a lot, but unfortunately not in any proportion to the knowledge that surrounded me. A lot of the papers presented were clearly breaking new ground, and likewise some of the procedures employed in the processing of both the sediments and the data resulting from analysis. Some of the most highly regarded studies were referred to time and again.

In 2022 an article was published in Nature which has been highly influential in its groundbreaking use of sedimentary ancient DNA (sedaDNA). The title of the article was "A 2-million-year-old ecosystem in Greenland uncovered by environmental DNA" and amongst the authors was Mikkel Winther Pedersen from the University of Copenhagen. Mikkel introduced, and also presented at, the bioinformatics workshop at the sedaDNA Scientific Society conference in Tromsø in June 2025.

This is an artist’s impression of the Kap København formation two-million years ago in a time where the temperature was significantly warmer than northernmost Greenland today. Artist: Beth Zaiken. Click on the image below to see the University of Cambridge webpage that details the fieldwork in a readable form.

An artist’s impression of the Kap København formation two-million years ago in a time where the temperature was significantly warmer than northernmost Greenland today. Artist: Beth Zaiken/bethzaiken.com

Now there are lots of things about sedaDNA that are confusing, principally whole concepts and processes that are encompassed by single words and phrases.

Bioinformatics is one of them.

Bioinformatics is the way that data is handled and manipulated during, before, and after the extraction of sedimentary ancient DNA from a sample of sediment.

So I decided to translate some parts of the paper in question because it shows just how much work is done at the computing and information technology end of DNA analysis.

Abstract of the Paper

First though, here is the abstract, which details what the paper is all about.

Late Pliocene and Early Pleistocene epochs 3.6 to 0.8 million years ago had climates resembling those forecasted under future warming. Palaeoclimatic records show strong polar amplification with mean annual temperatures of 11–19 °C above contemporary values. The biological communities inhabiting the Arctic during this time remain poorly known because fossils are rare. Here we report an ancient environmental DNA (eDNA) record describing the rich plant and animal assemblages of the Kap København Formation in North Greenland, dated to around two million years ago. The record shows an open boreal forest ecosystem with mixed vegetation of poplar, birch and thuja trees, as well as a variety of Arctic and boreal shrubs and herbs, many of which had not previously been detected at the site from macrofossil and pollen records. The DNA record confirms the presence of hare and mitochondrial DNA from animals including mastodons, reindeer, rodents and geese, all ancestral to their present-day and late Pleistocene relatives. The presence of marine species including horseshoe crab and green algae support a warmer climate than today. The reconstructed ecosystem has no modern analogue. The survival of such ancient eDNA probably relates to its binding to mineral surfaces. Our findings open new areas of genetic research, demonstrating that it is possible to track the ecology and evolution of biological communities from two million years ago using ancient eDNA.

So let’s skip straight to the DNA extraction and processing. If you want to read the whole paper — and it is worth reading — it is here: A 2-million-year-old ecosystem in Greenland uncovered by environmental DNA.

Some Background Points

A couple of things to bear in mind. The sedaDNA they were hoping to find was going to be old — very old — around 2 million years, and hopefully preserved in sediment. It was therefore likely to have been severely fragmented, because DNA normally exists as long-chain molecules, but over time it degrades and breaks up into short lengths.

Also, the DNA that is present in the environment can come from various sources, depending on the organism. The three main types of DNA are:

  • Nuclear DNA – from the nucleus of a cell
  • Mitochondrial DNA – from the mitochondria of a cell
  • Plastid DNA – from structures such as chloroplasts

Not all organisms have all these, and the different types of DNA have very different properties. So there is not just one kind of DNA per organism.

Another thing to know is that DNA has different properties at the two ends of the strand — one end is called the 5′ end, the other the 3′ end. Known as 5 prime and 3 prime, these refer to specific carbon atoms in the sugar molecule (deoxyribose) that make up the DNA backbone. The 5′ carbon is where the phosphate group is attached, and the 3′ carbon has a hydroxyl group (-OH), which is important for adding new nucleotides. This is one way that strands can be marked for tracking. It is possible to process several different samples all together in one batch, which saves time and money. To do this, the different samples are processed so that each sample has a unique identifying sequence put on the 3′ end.

It is appropriate to mention here that sequences used in primers, although they only make use of the four bases, can also include what are called degenerate primers. These are primers that include mixed base positions. These make use of IUPAC nucleotide codes — so within these primers, as well as A, C, T and G, you might see R, Y, S, W, K, M, B, D, H, V and N. These act as wildcards. So for example:

  • N means allow any one of A, C, T or G
  • B means not A, so any of C, T or G

And so on.

I have simplified and translated into plain English, in the fifteen points below, the main tenor of the bioinformatics of this research.

1. Ancient DNA from Sediment Is Extremely Old and Very Damaged

The DNA the researchers hoped to find was 2 million years old. Over time DNA breaks into tiny fragments, so instead of long strands, ancient sediment DNA (sedaDNA) is usually just short, degraded pieces from many organisms.

2. DNA Can Come from Different Parts of a Cell

Organisms can contribute different types of DNA to sediments:

  • Nuclear DNA (main genome, in the nucleus)
  • Mitochondrial DNA (from mitochondria; abundant in animals)
  • Plastid DNA (from chloroplasts; abundant in plants)

Each kind behaves differently and preserves differently, so scientists must consider all of them.

3. Samples Need Unique ID Tags for Tracking

To analyse many samples at once, each sample is given a unique DNA barcode — a short artificial DNA sequence added to one end of the fragments so the sequencer knows which sample each read came from. This is called multiplexing and ensure efficient and economic use of the analysis process.

4. Primers Can Include “Wildcard” Bases to Detect More Species

Primers are short DNA sequences used to match and amplify targets.

They sometimes include degenerate bases (like N, R, Y) which allow variation — helpful for catching many species that might differ slightly in sequence.

5. The Team Collected 41 Soil Samples and Created 65 DNA Libraries

A DNA library is a prepared batch of DNA fragments ready for sequencing, each with barcodes. Libraries are needed because ancient samples usually contain extremely little usable DNA.

6. They Checked Whether Plant DNA Was Present Using Droplet Digital PCR

They targeted a chloroplast gene called psbD, which nearly all plants have.

Droplet digital PCR splits the sample into tens of thousands of droplets and runs PCR in each tiny droplet. A droplet glows if the target sequence is present, allowing very sensitive detection even for fragments only about 39 bp long.

7. They Also Looked for Grass DNA Using Another Chloroplast Gene, psbA

By designing primers specific to Poaceae (the grass family), they could detect grass-related DNA even in tiny, degraded pieces.

8. To Find Mammal DNA, They Used an “Arctic PaleoChip” Enrichment Method

This uses known Arctic mammal DNA fragments as “bait” to pull out matching pieces from each library. It works even if the match is not perfect — ideal for ancient, degraded remains.

9. All Samples Were Then Sequenced on High-Throughput Machines

Sequencing machines like HiSeq and NovaSeq read millions or billions of DNA fragments.

Out of 16.8 billion raw reads, about 2.87 billion high-quality reads remained after cleaning, with very short, low-quality, or duplicate sequences removed.

10. The Cleaned DNA Was Scanned for Short Patterns (“k-mers”)

A k-mer is a short DNA “word”, usually 31 bases long.

Software such as Simka compares k-mers across samples to find shared patterns.

This does not assume any species beforehand, which makes it useful for discovering unexpected organisms.

11. Reads Were Identified Using HOLI and a Large Arctic Plant DNA Database

HOLI is a bioinformatics pipeline that tries to match each DNA read to the best-fitting species in a reference library of more than 1,500 Arctic and Boreal plants.

Because the ancient DNA might have mutated over 2 million years, they allowed a 95% similarity threshold — close enough to be meaningful, but not so strict that ancient changes would exclude matches.

12. Taxonomic Assignment Used the “Lowest Common Ancestor” Method (ngsLCA)

Sometimes a short read matches several species.

ngsLCA assigns it to the most specific group all matches belong to, for example “deer family” rather than one exact species.

This avoids overclaiming and is safer for short, damaged DNA.

13. Researchers Tested Whether DNA Was Truly Ancient, Not Contamination

They used a tool called metaDMG to check for chemical damage typical of ancient DNA:

  • C → T changes at 5′ ends
  • G → A changes at 3′ ends

These occur due to cytosine deamination after death.

They only trusted taxa showing strong, statistically robust ancient damage patterns.

14. Strict Filtering Removed Unreliable Taxa and Noisy Samples

To avoid false claims, they removed:

  • species with too few matching reads
  • samples with very low total DNA
  • taxa not appearing in at least three samples

Finally, they converted counts into proportions, letting samples be compared fairly.

15. The Final Dataset Represents a Rigorously Verified Ancient Ecosystem

Only DNA fragments that were:

  • frequently found
  • matched known species at high confidence
  • showed clear ancient damage
  • appeared across multiple sediment samples

...were accepted as real evidence.

This produced a remarkably reliable reconstruction of a 2-million-year-old Arctic ecosystem that no longer exists today.

Greenland today.

Final Thoughts

I’ll stop here. The purpose of this was to take the bioinformatics part of the paper apart, explain it in a fairly jargon-free way, and show how relatively straightforward it is in concept.

The rest of the paper is certainly worth reading, whether you are interested in sedaDNA extraction and analysis specifically and how it can be used, or in the science behind the whole palaeoecological investigation, or simply out of curiosity about a completely new type of ecological community — a community that does not exist any more, dating from just before the start of the series of ice ages.