måndag 9 mars 2020

2019 Carl Larsson who is that - sadly Europeana doesnt know --> #Metadatadebt

Update 2022-jul: Looks like something is done!!!  Carl Larsson.= person/60886 is cleaned and is now same as Carl Larsson = WD Q187310 435 items

Asked a question on EFS-1261 about status

Update 2022-apr: Europeana now have a bugreporting system and we get unique id:s this issue has id EFS-1261 se also GITHUB

Latest update 2022-may-1 Europeana dont see this as urgent --> Wikipedia will not link Europeana as quality is to bad ;-(

The problem: 
We asked Europeana show me the painter Carl Larsson link. Response is both the painter and another person a photographer..... i.e. Europeana cant understand the difference of

  • painter Carl Larsson = Wikidata Q187310 
  • photographer Carl Larsson = Wikidata Q5937128 
  • Another example 
    • Jenny Lind

What should have been done is using unique persistent identifiers see how Swedish Runic researcher understood that year 1750, 250 years ago link

The painter Carl Larsson agent/base/60886
were most objects displayed has nothing to do with the painter

Can we trust Europeana metadata or is it fake metadata? 

  • Why cant a network like Europeana deliver good quality?
  • Why is no one caring?
    • en:Wikipedia did see T243764
    • are topdown networks like Europeana not suited for new technologies like LinkedData ?
  • Europeana did a prototype 2012 and today 2020 we see no result why? instead en:Wikipedia takes an active decision that the quality is too bad to link Europeana....
    • is Europeana to academic? speaking about RDF instead of quality assure data delivered and miss skills communicating basic things like artist A in your museum needs to be explained its same as artist B in Wikipedia...
    • are museums in Europe not skilled enough?
      • in Sweden I see its the same people working with museums and linked data since 2012 and no one care that they don't deliver. Why?

        Example how we in Sweden built a semantic layer called UGC-hub "user-generated content" in 2012 and it has < 20 properties compare Wikidata > 7000 see Jupyter Notebook checking status of Swedish K-samsök/kulturnav 
        • no users using it...
        • nearly no semantic...
      • new technologies needs new skills - I guess everyone agrees about that but we dont see that with museums.... why?

Metadatadebt - the cost of getting your data usable because of lack of good metadata. 

This is an example of Metadatadebt because of text strings matching of entity names --> the result is a database with very bad search precision and a lot of noice and not being trustworthy.  One reason of having good linked data is to uniquely identify a person/place/organisation/concept also if the names are the same. Not caring/understanding this makes the data more or less unusable and I as an end user cant understand what is presented. In this case its a persons givenname "Carl Larsson"
that is assumed to be unique. As both Carl (also spelled Karl) and Larsson is on the list of the most common given names and surnames in Sweden its like asking for problems not using Linked data and the 5 star open data i,e, 
link your data to other data to provide context..... NOT send text strings...
Pictures from 5stardata.info

The cost of very bad metadata and creating new #fakemetadata

Other websites dont link Europeana:
  • en:Wikipedia with 112 Billion yearly views could with two lines of code add 160 000 links to Europeana using the links in Wikidata but decided not to do that because of lack of quality in Europeana see T243764 
    • interesting is that this has been reported to Europeana 6 jan 2020 but we cant track the Europeana actions. Compare how a modern and successful project like Wikidata has workboards and you can track/comment/subscribe on changes 

Quality issues Europeana  T243764 --> decision dont link them
112 B yearly page views en:Wikipedia
as we use Linked data it can easily scale to
all Wikipedias and 254B views in > 300 languages

Wikidata Change Stream
How Google in 20 minutes reads the Wikidata change stream
add this metadata to its knowledge graph
deliver a better product --> tweet

Europeana <-> Wikidata

I am a big fan of Linked data and Wikidata and started to link a lot of Swedish institutions to get added value. The latest work I did was adding to Wikidata 160 000 connections with Europeana Collections see links below the problem is that a person as Carl Larsson = Wikidata Q187310 is mixed with a photographer Carl Larsson = Wikidata Q5937128 that Swedish museum Gävleborgs Länsmuseum has more than  1 000 000 items from and many are uploaded to Europeana.

bigger picture / Europeana link and issue reported 2020 jan 6 T240809#5777845

I guess the reason we get this problem in Europeana is:

  • lack of entity management in Europeana
    • Items that are connected at the Swedish County museum Gävleborg are translated to text Carl Larsson when uploaded to Europeana
    • Europeana people then does a big mistake and text match everything called Carl Larsson and believes that is the same person
  • lack of feedback/error tracking
    • Europeana miss basic change management this error was reported early 2020 see task T240809#5777845 and we have got no helpdesk id yet or action plan. 
  • not having a network of museums speaking with each other and care about quality in Europeana
    • compare en:Wikipedia that reacted and in 2 weeks decided that we cant link Europeana as the quality is not good enough see T243764
    • The cost of bad metadata: not getting linked 160 000 times from one of the biggest websites with > 112 Billion yearly views 

Quality issues  T243764

Carl Larsson agent/base/60886

Good metadata is converted
to text strings and then
Europeana start guessing and adds new errors

Background how Wikidata and LinkedData scale 

As we add this connection Europeana <-> Wikidata we can easy also add links from different language versions of Wikipedia to Europeana e.g.

How this looks in Russian Wikipedia for Carl Larsson = Ларссон, Карл Улоф

How this looks in Bulgarian Wikipedia for Carl Larsson = Карл Ларсон

How this looks in Spanish Wikipedia = Carl_Larsson

A guess why we have this problem with "Strings" in Europana

  1. Swedish museum Gävleborgs Länsmuseum   is doing an excellent work tracking objects that belongs to photographer Carl Larsson = Wikidata Q5937128
  2. Gävleborgs Länsmuseum  upload its data to Digitalmuseum and we can easy find then photographer Carl Larsson = Wikidata Q5937128  objects using the Authority 48fd203b-2b93-4b0e-89a4-64e0a4509ce0 we dont do name matching --> we have control of the data
  3. Objects from the Swedish Digitalmuseum database are uploaded to Europeana and under this process there is no entity management instead we start use Text strings i.e. "Carl Larsson" is not uniquely identifying if a person is
    1. same as Wikidata Q5937128 
    2. or same as Wikidata Q187310

      they are both stored in the system as text string "Carl Larsson" and we have a very big metadatadebt as basic things like copyright can be based on the creators death date loosing the control of who is who is also loosing the control of many other parameters and you need to ask the original source if they have control.... (hopefully you can identify the record in the original system)

      In this case the Europeana data is useless for understanding "who is who"
  4. My guess is that the Europeana people has no understanding that in Sweden we can have more people with the same name Carl Larsson and "merging" all Carl Larsson and call them
    Europeana agent/base/60886 is something we are glad banks dont do ;-) --> we have a mess
    1. Small test of the Europeana data ...
      1. maybe filter on date as the painter Carl Larsson died January 22, 1919 and the photographer Carl Larsson had a company that delivered in his name later than that date
      2. Most of the items from Gävleborgs Länsmuseum is not the painter see test search with filter "proxy_dc_publisher:"Länsmuseet%20Gävleborg" --> 43,900 objects
        1. another test filter on aggregator

          could work but as our Carl Larsson were active during the same period and worked just 75 km from each other I guess the aggregator filtering will not help us....   

The Solution

  1. Better system for communication and tracking errors
  2. Better skills when handling entities
As Wikipedia is now moving also the project Wikicommons in direction to Linked data this challenge will expload --> today organisations need to step up and add new skills if they will avoid issues as above and use the new technology. I guess we need

  • Entity change management
    • If we try to have Linked data roundtrip we also need to synchronize entities

About structured data on commons a solution with Linked data describing pictures

Some definitions
  • Linked data roundtrip: the process when a picture is uploaded to Wikicommons and linked data is added to the picture and you would like to feed this metadata back to the original uploading system.

    As we now have Linked data  --> we also need to have entity data management and synchronization changes to the linked data entities themselve
  • Metadatadebt: compare technical debt when the metadata we manage lacks quality we have to pay a price for correcting it. Above I guess Europeana needs to reload all metadata and set up entity data management between Europeana and all Europeana aggregators and the providers

    An example of the cost paid by Europeana is that In the example above Europeana missed the opportunity to get linked from 140 000 articles in en:Wikipedia because of Entity metadatadebt. As en:Wikipedia has > 15 B monthly page views I guess its an insane "debt" and lesson learned fee the Europeana people has to pay.

    Today I feel the Europeana community tries to hide this problem as we see no emergency actions see
    • T243764 "en:Wikipedia <-> Europeana Entity has problem with the quality of Europeana Linked data
  • Entity data management 
    • When handling linked data we need to have change management also on the linked data objects.

      if we should do Linked data roundtrips on pictures uploaded we need to find a way how to synchronize objects used in the two linked data domains or do ontology reuse. Wikicommons has chosen to reuse the Wikidata ontology i.e. a change in Wikidata will directly be reflected in Wikicommons. See also the OCLC report "Creating Library Linked Data with Wikibase" and "Lesson 5: To populate knowledge graphs with library metadata, tools that facilitate the reuse and enhancement of data created elsewhere are recommended"

      Steps I thinks is getting more and more imoiortant as entity management also will be the dem facto standard for metadata in pictures
      1. pictures needs to have unique persistent identifiers
      2. when uploading a picture we need to be able to track it
        1. metadata added to a picture
        2. metadata deleted
        3. if this picture is downloaded to another platform
      • Example of how Wikicommons now use Wikibase and have unique identifiers with linked metadata available in JSON
      • My guess is with this new complexity handling entity management and linked data between loosely coupled system the complexity for the end user will increase and we need to design better User interfaces "hiding" this complexity for the end users

        Today in Wikicommons we import pictures and retype the metadata
        • with metadata data that is entities like Q87101582 Q30312943 this is not possible for the end user to retype

          we need people designing digital archives to sit down and define an infrastructure that supports an user interface for the end user with "drag-and-drop" that takes care of
          1. controlling the copyright of the picture
          2. translate entities from one platform to the users current platform
          3. tells the user that this picture is said to depict person xxx that is yyy on the "old platform" do you want to create this entity as same as external ID yyy

  • tweet about this #Metadatadebt

1 kommentar: