Structured Data on Commons - A Blog Series

Originally published at: https://space.wmflabs.org/2019/07/25/structured-data-on-commons-a-blog-series/

One hell of a mess....jpgFile:One_hell_of_a_mess….jpg (10 July 2004, 23:25:02) by Tom Cronin, CC-BY-SA-2.0. Its a picture of a big analog synthesiser…

Wikimedia Commons is the freely-licensed media repository hosted by the Wikimedia Foundation. Started in 2004, Commons contains over 50 million files—all of which are meant to contain educational value and help illustrate other Wikimedia projects such as Wikipedia. As with all Wikimedia projects, the content is created, curated, and edited by volunteers. In addition to the content work on the wikis, the Commons community participates in organizing and running thematic media contribution campaigns around the world such as Wiki Loves Monuments, Wiki Loves Food, and Wiki Loves Africa.

Structured Data on Commons (SDC) is a three-year software development project funded by the Sloan Foundation to provide the infrastructure for Wikimedia Commons volunteers to organize data about media files in a consistent, linked manner. The goals of the project are to make contributing to Commons easier by providing new ways to edit, curate, and write software for Commons, and to make general use of Commons easier by expanding capabilities in search and reuse. These goals will be served by improved support for multilingual content and ways of working on Commons. This is the first in a blog series that will document the different parts of implementing SDC, starting with this introduction to the project and brief outlines of the software involved in making it happen, each to be covered more in-depth later.

Wikidata for Commons-logo.svgFile:Wikidata_for_Commons-logo.svg (17 August 2014, 10:20:30 (upload date)) by

, PD ineligible.
English: Logo for d:Wikidata:WikiProject Commons

Part One – an introduction to the software

Commons is built on MediaWiki, the same software used by the other Wikimedia projects. MediaWiki was primarily developed to host text. Because of this, information about files on Commons is stored in plain-text descriptions (wikitext, templates) and categories. The information includes at least the uploader, author, source, date, and license of a file, with many other optional items. These pieces of data are usually only available in one language—mostly English—and, most importantly, not structured in a way that software developers can consistently write programs to understand the data that is stored in file pages. Data that is structured in a consistent, understandable way is called “machine-readable,” and having machine-readable data is a primary goal for the Structured Data on Commons project.

In order to provide this consistent, machine-readable data, the information needs to be stored in a database instead of plain-text in MediaWiki. Wikibase is the software solution for that need. Wikibase is the software that enables MediaWiki to store structured data or access data that is stored in a structured data repository, developed by Wikimedia Deutschland to support Wikidata. The project needed a way to use Wikibase on other wikis and connect the information back to Wikidata, a feature which had recently been developed. Called Federated Wikibase, this software is crucial to organizing media information on Commons.

The next piece of software needed was Multi-Content Revisions (MCR). MCR is a way of putting a wiki page together that needs to pull information from different places with different ways of storage—in other words, MCR can assemble information from both MediaWiki and Wikibase to be displayed and managed together. More information about Federated Wikibase and MCR will be covered in a future post in this series.

Once Federated Wikibase and MCR were ready for release, the Structured Data on Commons team produced the first user-facing feature to use the new underlying software: multilingual file captions. Captions—stored in Wikibase—have a similar function to the description template used on file pages, which is stored in MediaWiki; they both are supposed to say what is in the file. However, descriptions are not limited in length, they may contain extra detail not necessary to finding the file including wikilinks and external links, and while the template supports adding extra languages, the process is not necessarily easy. Captions support an easier way to add other languages and captions are limited in length and should describe the file only in a short, factual way. This makes files with captions easier to find in search in a structured, multilingual way for both humans and software programs alike.

After releasing Wikibase and MCR to Commons with captions to make sure it all worked, the development team put out support for the first structured statement type, “depicts.” Depicts statements make simple, factual claims about the content of a file and link to their matching concept on Wikidata. To further develop depicts statements, support for qualifiers was released as well. Qualifiers allow depicts statements to have more information about what is being depicted. So for example, a picture of a black house cat can have the structured statement depicts: house cat[color:black]. Depicts statements are on a new tab that was introduced to the file page, “Structured data.” Aside from captions, all structured data is on this tab.

Depicts with a qualifier.pngFile:Depicts_with_a_qualifier.png (11 July 2019) by User:Keegan (WMF), CC-BY-SA-3.0.

English: An example of a depicts statement with a qualifier.

After this short introduction, the SDC blog series will have further information about depicts and qualifiers, as well as support for making other types of statements about files.

Next: Part Two – Federated Wikibase and Multi-Content Revisions

4 Likes

@Joalpe This looks like excellent preparatory reading for this week’s Wikidata Lab :slight_smile:

2 Likes