Structured Data on Commons and GLAM: open questions and fresh challenges

Originally published at: https://space.wmflabs.org/2020/02/28/structured-data-on-commons-and-glam-open-questions-and-fresh-challenges/

Since 2019, files on Wikimedia Commons can be enhanced with multilingual and machine-readable structured data. This addition brings many benefits for cultural institutions or GLAMs (Galleries, Libraries, Archives and Museums) partnering with Wikimedians, as GLAMs also store data about their collections in very structured ways.

In the past year, I have worked together with GLAM staff and Wikimedia community members to ‘test’ this new technology, and explore its potential, in a series of pilot projects. What does Structured Data on Commons make possible? Which new questions and challenges appear?

  1. We have imported the collection of a small museum to Wikimedia Commons and Wikidata, in order to experiment with data modeling, and to explore the potential of structured data to make small collections accessible online for the first time.
  2. Community members have developed a micro-contributions game (the ISA Tool), testing the potential of ‘easy’ structured data editing for newcomers.
  3. Wikimedians have described digitized books with structured data, to prepare them for transcription on Wikisource, investigating how Structured Data on Commons can help to avoid data duplication across Wikimedia projects and how it can make cross-wiki workflows more efficient.
  4. Various organizations researched data synchronization and roundtripping with external databases – a feature that many larger GLAM institutions ask for, and for which structured data on Wikimedia projects provides more advanced foundations.
  5. A fifth pilot project has brought together, and connected, the work of three generations of prominent Belgian silversmiths, exploring the potential of structured data to connect art collections around the world and to link them to their broader context. This pilot also highlighted the description of copyright and licenses in structured data.

Working on this set of pilot projects, what were some of the most common new challenges we discovered for GLAM-Wiki projects using structured data? In most of these projects, we spent a lot of time thinking about data modeling, and about the right place to put certain data. Read on for more!

Creative works and media files: does the difference matter?

As Structured Data on Commons was under development, the Wikimedia Commons community started exploring how to best describe (GLAM) files with structured data on Wikimedia Commons. How should the community use properties and items from Wikidata to indicate who created a file, when it was created, what can be seen in it? In August 2018, community members started brainstorming a potential data model in the so-called properties table

Advice from the cultural sector

The cultural sector itself has a lot of experience in storing digital media files and describing them. I have asked several GLAM staff familiar with such data modeling to look at the emerging data model expressed in the properties table, asking for feedback.

What was the most common pattern in the feedback we received? When dealing with files on Wikimedia Commons that show a creative work (for instance a photo of a sculpture or a scan of a book) it’s very important to make a clear distinction between the creative work, and the file that shows this work. Antoine Isaac, R&D Manager at the cultural aggregator Europeana, compiled a document with extensive comments, which includes a clear warning to avoid a situation of ‘Leonardo da Vinci creating thousands of JPEGs.’

Was this JPEG file created by Michelangelo in the early 16th Century? Or is this a 2016 sculpture by Wikimedian Jörg Bittner Unna? David by Michelangelo, Florence, Gallery dell’Accademia, 1501-1504 by Jörg Bittner Unna, CC BY 3.0

George Bruseker, Research and Development Engineer at Foundation for Research and Technology – Hellas (FORTH) is specialized in the CIDOC Conceptual Reference Model, a metadata standard for the cultural sector which is supported by the International Council of Museums. He provided feedback on the draft properties table and mapped part of this emerging data model to CIDOC CRM. Quoting him:

It is extremely important to make a crisp distinction between the description of the digital object qua digital object, and the various information objects that it encodes/carries/incorporates. If this is not done properly, there will be a lot of confusion and mistakes in the metadata and the Commons community will run into problems in the future!

George Bruseker
Distinction between a creative work and a file that represents that work. Infographic by Sandra Fauconnier, reusing a photo by Philip Pikart, CC BY-SA 3.0, a photo by Glenn Ashton, CC BY-SA 3.0, and Pascal Dagnan-Bouveret: Une noce chez le photographe, 1878-79, Museum of Fine Arts of Lyon, Public Domain

As seen in the examples above, this distinction matters for correct attribution and copyright determination, in line with the GLAM sector’s and Wikimedia’s own values of rigorousness and correctness. Many Wikimedians familiar with cultural heritage and structured data intuitively understand it and already bring it into practice. But it’s not intuitive to laypeople, and I’ve also heard some Wikimedians say ‘this is very hard’ and ‘this makes my head hurt’. Should we aim for precise descriptions that are more difficult to grasp, or for a system that is easy to understand but imprecise, with the risk of demotivating potential partners? In the longer run, as some Wikimedians have already suggested, it will be helpful if information templates and other user interface elements on Wikimedia Commons provide information and clues that make the distinction intuitively clear to laypeople.

New adventures in the land of federation, or: where should certain data be stored?

Structured Data on Commons is built on the concept of federation. What does this mean?

In a technical sense, a federated database system is a management system where multiple autonomous databases work together in a single, so-called federated, database. Wikibase Federation is implemented for Structured Data on Wikimedia Commons: it makes it possible to use entities (Items and Properties) defined on one Wikibase repository (i.e., Wikidata) on another Wikibase repository (i.e., Wikimedia Commons).

Structured Data on Commons – Project glossary

Wikimedia Commons is now Wikimedia’s first federated ‘structured data sister’ of Wikidata that is used very intensively – and we are starting to see some peculiar challenges around this: data lives in different places, but needs to make sense together.

Related to the above-mentioned issue of ‘work versus file’, GLAM-focused Wikimedians who work with structured data on Wikimedia Commons now face a new challenge: should the (separate) information about creative works be stored on Wikidata, or on Wikimedia Commons?

In February 2020, Wikidata contains nearly 2 million data items for artworks. Since Wikidata was founded in 2012, quite a few GLAM-Wiki projects have engaged with Wikidata as a general Linked Open Data storage base. To name just a few examples: Wikidata contains data about works in the collections of Brazilian and Flemish museums, of the National Library of Wales and the Metropolitan Museum of Art… Volunteer-driven projects like the Sum of All Paintings and Wiki Loves Monuments have also curated hundreds of thousands of Wikidata items of creative works and buildings. Wikimedia Commons volunteers have now started to connect files on Wikimedia Commons to these works on Wikidata, using structured data. The general process for Wiki Loves Monuments has recently been documented, and input is welcome.

  • Rietveld Schröder House in Utrecht, photo by ErikHonig, CC BY-SA 3.0
  • Some of the structured data of this photo, as stored on Wikimedia Commons.
  • … which points to data about the building itself, as stored on Wikidata.
Data stored in two places: an example of a photo of a building, submitted to the 2012 Wiki Loves Monuments competition, now described with structured data on Wikimedia Commons and Wikidata. And a more confusing example from the Jakob Smitsmuseum pilot project: a file showing a painting in that museum depicting the Canadian painter Frederick Coburn. The file is described with structured data on Wikimedia Commons, the painting (and the person it depicts) is described on Wikidata.
18th Century piece of embroidery from the Caucasus. Image from the David Collection, CC0. Should works like this one also be described with a Wikidata item?

But does this mean that every single creative work with a file on Wikimedia Commons should have its own Wikidata item? What about individual relief sculptures in temple façades, individual pages or even illustrations in books, or seemingly ‘unglamorous’ objects like ceramics shards and pieces of fabric? Is Wikidata’s infrastructure technically able to store items for all these millions of works? And does it make sense?

A related question: what would be the advantages and disadvantages of ‘data duplication’? In the example of the portrait painting of Frederick Coburn above: what would it mean if information about the creator, creation date and sitter of the painting would also be available on Wikimedia Commons? This would certainly help to make this image directly discoverable on Wikimedia Commons (see below). But it also brings risks of data that is unclear and out of sync, with mistakes on either side that are less easy to discover, and two communities maintaining similar data in parallel.

The structured data-related GLAM pilot projects that I have mentored have indeed gone the ‘creative work on Wikidata’ route. However, it will be very useful to learn from other GLAM-Wiki projects that explore how to keep all structured data on Commons, not creating separate Wikidata items for creative works. Dominic Byrd-McDevitt, Data Fellow at the Digital Public Library of America is already looking into that option. At the time of writing this blog post, he is experimenting how to do that. Input is certainly very welcome. If, as a reader of this blog post, you (have) run such an experiment yourself, please list it on the overview of GLAM projects using Structured Data on Commons, in order to help and inspire others.

Technically, this ‘federated’ situation also poses interesting challenges around upload and discovery. If someone searches for images of, say, the Hindu deity Ganesha on Wikimedia Commons, it would be great if they also find results of images that depict a sculpture of Ganesha, even if the ‘depicts:Ganesha’ statement is on Wikidata, not on Commons. And how to build an easy-to-understand (batch) upload tool that adds the right data in the right place in a federated way?

Works on Wikidata that depict the deity Ganesha (query). As that data is already present on Wikidata, how can these images also be made discoverable on Wikimedia Commons, without needlessly duplicating data?

These open questions and challenges are probably typical for a federated, distributed web of data at large. As the Wikimedia movement hopefully grows to become more diverse, we may pose these kinds of questions more often. What if more Wikimedia projects become powered by federated structured data? Wikibase, the software behind Wikidata, is increasingly used by institutions around the world, and this community is discussing the challenges around federation as well.

What’s next?

Millions of files on Wikimedia Commons are described with some structured data already; community members are adding more structured data to files every day. With each new GLAM-Wiki project, we can learn more as a community, and inspire each other. Please report on your projects in the This Month in GLAM newsletter, share your insights, and feel free to list your initiatives in the overview page of Structured Data on Commons and GLAM projects. Document and discuss your experiences in data modeling. You can ask questions on the dedicated Structured Data on Commons talk pages: general talk page, data modeling.

In order to help GLAM staff and Wikimedia communities get started, an introduction to structured data for GLAM-Wiki is now available on meta.wikimedia.org (work in progress!). Everyone is warmly invited to extend and improve this documentation, translate it, add examples to it…

Beginner-level documentation about structured data for GLAM-Wiki. Please help to improve!

And finally, GLAM-Wiki projects need solid technical infrastructure to contribute content to Wikimedia projects at scale – now also with support for structured data. Up until present, GLAM-Wiki collaborations around the world have flourished thanks to many dedicated batch upload, statistics and curation tools built by Wikimedia volunteers. In order to make this technical infrastructure more sustainable, Wikimedia Sverige (WMSE) is currently growing its capacity to become a GLAM Hub for the Wikimedia movement, and this includes the development of GLAM-Wiki tools that support structured data. If you want to stay informed, keep an eye on the meta.wikimedia.org page about this initiative!

This is the last of a series of blog posts on Wikimedia Space about GLAM pilot projects with Structured Data on Wikimedia Commons. Earlier posts are:

4 Likes

I indeed welcome all responses, thoughts and feedback, and am watching this thread! :hugs:

You might want to read and contribute to the discussion at https://commons.wikimedia.org/wiki/Commons:Village_pump#Depicts where even some ardent fans of Wikidata are questioning the way that data is populating the structured data tool and the fact that it is anything but structured and an opportunity for some creative vandalism at worst or complete overload with irrelevant, unstructured data at best.

I am indeed following that discussion closely. It focuses on the impact of the computer-aided tagging tool, which I think is not very useful for GLAM purposes. GLAM-related projects indeed need high-quality, detailed structured data; that’s what the majority of GLAM pilot projects have looked at, and what I will continue to advise the GLAM-Wiki community to work with.

It’s pretty good for finding EXACTLY what you are looking for. But there is still a lot of files that needs some structured data.