
Artificial intelligence is speeding up the cataloguing of insect collections
Researchers at the Museum für Naturkunde Berlin have teamed up with data scientists to develop a new method for extracting label information from digitised insect specimens in a largely automated process.
Researchers at the Museum für Naturkunde Berlin have teamed up with data scientists to develop a new method for extracting label information from digitised insect specimens in a largely automated process. The pipeline, called ELIE, uses artificial intelligence to reliably recognise and analyse printed labels. This eliminates much of the previously time-consuming manual transcription work – a significant step forward for the digitisation of natural history collections worldwide.
With over a million described species, insects represent the most species-rich group of all living organisms. Natural history collections worldwide preserve around 500 million insect specimens collected over the past three centuries. Each of these specimens bears labels containing key information such as the location, date of collection or collector’s name. This data forms an indispensable basis for research in the fields of taxonomy, evolutionary biology and ecology.
Despite modern high-throughput methods for digitising collection items, the transfer of this label information has so far been carried out predominantly by hand. Researchers at the Museum für Naturkunde Berlin (MfN) have now, in collaboration with experts in digitisation and data science, developed a new pipeline that significantly simplifies and accelerates this process.
The ELIE (‘Entomological Label Information Extraction’) pipeline automates several steps of label analysis. Using image processing and machine learning techniques, ELIE identifies individual labels on digital images, aligns them and distinguishes between printed and handwritten text. Printed labels are automatically read using text recognition, whilst handwritten information is specifically separated for later manual processing. In addition, the system groups labels with identical content together, so that recurring information only needs to be checked once.
“With ELIE, we are alleviating one of the biggest bottlenecks in the digitisation of entomological collections,” says Margot Belot, data manager at the Museum für Naturkunde Berlin. “The automated analysis of printed labels significantly reduces the workload for researchers and curators and enables us to make our collections available for research more quickly and systematically.”
The new pipeline was tested, among other things, on 26,000 label images from the approximately 650,000 insect specimens that the Museum für Naturkunde Berlin digitised between 2022 and 2023 using a high-speed digitisation line from Picturae. The analysis shows that – depending on the degree of label duplication – information can be automatically extracted from up to almost 90 per cent of the printed labels. Further tests using datasets from the Smithsonian National Museum of Natural History in Washington and the Museum of Comparative Zoology at Harvard University demonstrate that ELIE can also be reliably applied to previously unknown collections.
The results were published in the journal Methods in Ecology and Evolution. The researchers view ELIE as a key component for the future digitisation of natural history collections and a contribution to the better utilisation of these unique archives of biological diversity.
