Skip to main content

Artificial Intelligence Accelerates Access to Insect Collections

Register for press mailing list

Please note that only people who register using our registration form receive our press releases.

A collage of analogue labels from insect specimens
Press release,

Researchers at the Museum für Naturkunde Berlin, together with data scientists, have developed a new method to largely automate the extraction of label information from digitized insect specimens. The pipeline, named ELIE, uses artificial intelligence to reliably detect and process printed labels. This significantly reduces the time-consuming manual transcription work and represents an important advance for the digitization of natural history collections worldwide.

With more than one million described species, insects represent the most diverse group of living organisms on Earth. Natural history collections worldwide house around 500 million insect specimens collected over the past three centuries. Each specimen carries labels containing essential information such as collection locality, date, and collector. These data form a crucial foundation for research in taxonomy, evolutionary biology, and ecology.

Despite the availability of high-throughput digitization workflows for collection objects, the transcription of label information is still largely performed manually. Researchers at the Museum für Naturkunde Berlin, working closely with experts in digitization and data science, have now developed a new pipeline that substantially simplifies and accelerates this process.

The pipeline, ELIE (“Entomological Label Information Extraction”), automates several steps of label processing. Using image analysis and machine learning techniques, ELIE detects individual labels in digital images, aligns them, and classifies them as either printed or handwritten. Printed labels are automatically processed using optical character recognition, while handwritten information is separated for targeted manual transcription. In addition, the system groups identical or highly similar labels, ensuring that recurring information only needs to be reviewed once.

“With ELIE, we address one of the major bottlenecks in the digitization of entomological collections,” says Margot Belot, Data manager at the Museum für Naturkunde Berlin. “Automating the transcription of printed labels significantly relieves researchers and curators and allows us to make our collections available for scientific use more quickly and systematically.”

The pipeline was tested, among other datasets, on 26,000 of the label images from the 650,000 insect specimens digitized at the MfN between 2022 and 2023 using a high-speed conveyor-based imaging system developed by the company Picturae. The results show that, depending on the degree of label redundancy, information from up to nearly 90 percent of printed labels can be extracted automatically. Further tests with datasets from the Smithsonian National Museum of Natural History in Washington, D.C., and the Museum of Comparative Zoology at Harvard University demonstrate that ELIE can be reliably applied to previously unseen collections.

The results have been published in the journal Methods in Ecology and Evolution. The researchers see ELIE as an important building block for the future digitization of natural history collections and as a contribution to making these unique archives of biodiversity more accessible for research.

Collage analoger Etiketten von Insektenpräparaten © picturae

Keywords