fredag 20 mars 2020

Wikibase et al FactGrid

FactGrid projekt som kör Wikibase

VARNING är lite komplext

Wikicommons

kör sedan någon månad även det Wikibase dvs. bilder kan ha strukturerad data

Wikidata



Visionen om Linked data på 5 minuter varför det är bra för en "memory" organisation



Linked Open Data - What is it? from Europeana on Vimeo.

Video jag gjorde om detta



onsdag 18 mars 2020

Vart är vi på väg med sv:Wikipedia WIkidata

pga av en fråga på sv:Wikipedia om hur externa identifierare skall visas i sv:Wikipedia skiver jag ned lite tankar

Lyssnar man på Wikimedia Sverige John finns stora tankar och hela deras ide bygger på att Wikipedia är bra men jag tycker Wikimedia Sverige är väldigt anonyma och visionslösa i sina aktiviteter - rätta mig om jag har fel... 

Lyssnar man på John så pratar han om rörelsen men han och hans kollegor syns knappt på sv:Wikipedia och tycker mig inte heller se konkret vad de verksmål de ställt upp i projektet LIBRISXL blir mer än det man alltid gjort som 
  • att ladda upp lite bilder på Wikicommons
     
  • att LIBRIS bibliotekarier skriver lite wiki artiklar.
Det borde finnas större konkreta tankar där en pusselbit är hur externa Wikidata identifierare skall utvecklas och synas i sv:Wikipedia eller inte synas ...

Jag tycker Wikimedia Sverige bör diskutera hur de ser Wikipedia/Wikidata skall utvecklas ihop med det svenska kulturarvet/museer/arkiv. 

Kommer Wikimedia Sverige att rulla ut 100 Wikibase installationer i Sverige och hur påverkar det infrastrukturer och Wikidata identifierare?!?!?

I mina LOD tester ser jag att Europeana inte får ihop de se blogpost pga 
  • Länkad data är svårt och kräver helt andra komopetenser än att flytta bilder eller kopiera text
  • att Wikidata/Wikipedia inte matchar bra ihop med det som finns i museernas arkiv.

    Wikidata/Wikipedia har ett notabilitet kriteria som känns som det blir att lokala museer utanför Stockholm, Göteborg inte har mycket gemensamt med de personer som beskriv.

    Jag gör en liten test med Länsmuseet i Gävle och de bilder de har och var dessa gubbar/gummor hittas på en kyrkogård i Gävle och mindre än 10 av dessa personer finns i SBL eller SKBL länk Gävle gamla kyrkogård

En vision


Det snygga ekosystemet jag ser framför mig är att
  1. en person finns i Wikidata
  2. Länsmuseet Gävleborg säger att de har bilder av person xxx som är samma som Wikidata Q nummer
  3. Länsmuseet Gävleborg laddar upp sina bilder i Digitalt museum
  4. Från Digitalt museum skickas bilderna via Ksamsök till Europeana där Europeana ser att den refererade personen saknas hos dom och skapar denna person baserad på metadata i WD samt att Europeana uppdaterar Wikidata Europeana Entity (P7704) med det nya agent id dom skapat
  5. Ryska, Spanska, Bulgariska... wikipedia som visar i Auktoritetsdata vad som finns Europeana Entity (P7704) får då en länk till Europeana och kan se de bilder Länsmuseet Gävleborg laddat upp sekunder senare....
efter gårdagens misslyckade driftsättning hos Europeana fick vi lite kontakt och kanske kan vi få alla delar att jobba ihop. Känns dock fel att att saker och ting inte formuleras tydligt. Europeana släppte just sin strategi 2020-2025 pdf men när man ser vad dom levererar idag och att man hoppas AI skall lösa delar av det hela börjar jag tvivla att man ser möjligheterna som finns som ex. SDC och dataroundtrips/"reuse Wikidata's ontology" -

Problem alla personer platsar inte  i Wikidata

  1. Hur skall svenska museer göra samma som för personer som inte platsar i Wikidata
    1. Borde inte GLAM Wikimedia Sverige ha en Wikibase som rymmer alla...

tisdag 17 mars 2020

Example Europeana record that is wrong connected


This is a try to understand the JSON for an record in Europeana

1) below the  picture that is wrongly connected to Europeana Entity agent/base/60886 

that is same as Wikidata Q187310 correct person is Q5937128 see also the following blogpost about the consequences.

The file I guess is sent to Europeana is http://kulturarvsdata.se/S-XLM/photo/XLMCL014128-3


==> I think we can see that the record is connected to http://data.europeana.eu/agent/base/60886 ==> Wikidata search ==> Wikidata  Q187310 


done by the Europeana proxy and that the data sent to Europeana seems to lack information about the same as


  • dc:creator = Larsson, Carl
    • same as is missing
      • Wikidata P1248 kulturnav =  48fd203b-2b93-4b0e-89a4-64e0a4509ce0
      • Wikidata Q5937128
      • Wikidata P7704 Europeana Entity - I can see the following challenges
        • missing an entity in Europeana for this person but a Wikidata item exist...... 
        • lesson learned when working with this Swedish museum link
          • Most people from this museum are "local people"
            • most of them has no Authority in kulturnav 
          • Most of the people are not in Wikidata as it is local people
          • As Europeana started using a subset of dbpedia that is a subset of Wikidata then many people has no agent in Europeana
            • See task T243907 tracking how new entities can be created in Europeana  

The source Digitalmuseum

  • JSON
    • Photographer = 
      • typenull,
      • role"Fotograf",
      • name"Larsson, Carl",
      • uuid"0B628C53-4BEB-4911-B4AC-F235168E706E

My Magnus Sälgö unskilled guess is that this picture IS NOT connected to Carl Larsson = Wikidata Q5937128 ==> we get problems in Europeana

The JSON in Kulturarvsdata S-XLM/photo/XLMCL014128-3



2) Another "better" example


 If we look on another example of the photographer Carl Larsson 


  • Europeana record/916124/S_XLM_photo_XLMCLP007303
    • JSON
      • the same person "Carl Larsson" is now "Larsson, Carl J." and not matched
        and it looks like the identifier for Carl Larsson Wikidata Q5937128 is not available in Europeana
        • id 48fd203b-2b93-4b0e-89a4-64e0a4509ce0
3) Another example from Malmö Konstmuseum Karin at the shore


This JSON has two Agent entities:
  1. viaf.org/viaf/4932213 comes from the provider
  2. data.europeana.eu/agent/base/60886 comes from Europeana
Conclusion: The Swedish aggregator K-Samök doesnt send good enough data to Europeana to make good decisions

Question: What is missing?

See also:







måndag 9 mars 2020

Carl Larsson who is that - sadly Europeana doesnt know --> #Metadatadebt

Metadatadebt - the cost of getting your data usable because of lack of good metadata. 

This is an example of Metadatadebt because of text strings matching of entity names --> the result is a database with very bad search precision and a lot of noice and not being trustworthy.  One reason of having good linked data is to uniquely identify a person/place/organisation/concept also if the names are the same. Not caring/understanding this makes the data more or less unusable and I as an end user cant understand what is presented. In this case its a persons givenname "Carl Larsson"
that is assumed to be unique. As both Carl (also spelled Karl) and Larsson is on the list of the most common given names and surnames in Sweden its like asking for problems not using Linked data..... 


Europeana <-> Wikidata

I am a big fan of Linked data and Wikidata and started to link a lot of Swedish institutions to get added value. The latest work I did was adding to Wikidata 160 000 connections with Europeana Collections see links below the problem is that a person as Carl Larsson = Wikidata Q187310 is mixed with a photographer Carl Larsson = Wikidata Q5937128 that Swedish museum Gävleborgs Länsmuseum has more than  1 000 000 items from and many are uploaded to Europeana.



bigger picture / Europeana link and issue reported 2020 jan 6 T240809#5777845

I guess the reason we get this problem in Europeana is:

  • lack of entity management in Europeana
    • Items that are connected at the Swedish County museum Gävleborg are translated to text Carl Larsson when uploaded to Europeana
    • Europeana people then does a big mistake and text match everything called Carl Larsson and believes that is the same person
  • lack of feedback/error tracking
    • Europeana miss basic change management this error was reported early 2020 see task T240809#5777845 and we have got no helpdesk id yet or action plan. 

Background how Wikidata and LinkedData scale 

As we add this connection Europeana <-> Wikidata we can easy also add links from different language versions of Wikipedia to Europeana e.g.

How this looks in Russian Wikipedia for Carl Larsson = Ларссон, Карл Улоф


How this looks in Bulgarian Wikipedia for Carl Larsson = Карл Ларсон


How this looks in Spanish Wikipedia = Carl_Larsson


A guess why we have this problem with "Strings" in Europana

  1. Swedish museum Gävleborgs Länsmuseum   is doing an excellent work tracking objects that belongs to photographer Carl Larsson = Wikidata Q5937128
     
  2. Gävleborgs Länsmuseum  upload its data to Digitalmuseum and we can easy find then photographer Carl Larsson = Wikidata Q5937128  objects using the Authority 48fd203b-2b93-4b0e-89a4-64e0a4509ce0 we dont do name matching --> we have control of the data
     
  3. Objects from the Swedish Digitalmuseum database are uploaded to Europeana and under this process there is no entity management instead we start use Text strings i.e. "Carl Larsson" is not uniquely identifying if a person is
     
    1. same as Wikidata Q5937128 
    2. or same as Wikidata Q187310

      they are both stored in the system as text string "Carl Larsson" and we have a very big metadatadebt as basic things like copyright can be based on the creators death date loosing the control of who is who is also loosing the control of many other parameters and you need to ask the original source if they have control.... (hopefully you can identify the record in the original system)

      In this case the Europeana data is useless for understanding "who is who"
       
  4. My guess is that the Europeana people has no understanding that in Sweden we can have more people with the same name Carl Larsson and "merging" all Carl Larsson and call them
    Europeana agent/base/60886 is something we are glad banks dont do ;-) --> we have a mess
     
    1. Small test of the Europeana data ...
       
      1. maybe filter on date as the painter Carl Larsson died January 22, 1919 and the photographer Carl Larsson had a company that delivered in his name later than that date
         
      2. Most of the items from Gävleborgs Länsmuseum is not the painter see test search with filter "proxy_dc_publisher:"Länsmuseet%20Gävleborg" --> 43,900 objects
         
        1. another test filter on aggregator

          f[PROVIDER][]=Swedish+Open+Cultural+Heritage+|+K-samsök
          could work but as our Carl Larsson were active during the same period and worked just 75 km from each other I guess the aggregator filtering will not help us....   

The Solution

  1. Better system for communication and tracking errors
  2. Better skills when handling entities
As Wikipedia is now moving also the project Wikicommons in direction to Linked data this challenge will expload --> today organisations need to step up and add new skills if they will avoid issues as above and use the new technology. I guess we need

  • Entity change management
    • If we try to have Linked data roundtrip we also need to synchronize entities



About structured data on commons a solution with Linked data describing pictures




Some definitions
  • Linked data roundtrip: the process when a picture is uploaded to Wikicommons and linked data is added to the picture and you would like to feed this metadata back to the original uploading system.

    As we now have Linked data  --> we also need to have entity data management and synchronization changes to the linked data entities themselve
     
  • Metadatadebt: compare technical debt when the metadata we manage lacks quality we have to pay a price for correcting it. Above I guess Europeana needs to reload all metadata and set up entity data management between Europeana and all Europeana aggregators and the providers

    An example of the cost paid by Europeana is that In the example above Europeana missed the opportunity to get linked from 140 000 articles in en:Wikipedia because of Entity metadatadebt. As en:Wikipedia has > 15 B monthly page views I guess its an insane "debt" and lesson learned fee the Europeana people has to pay.

    Today I feel the Europeana community tries to hide this problem as we see no emergency actions see
     
    • T243764 "en:Wikipedia <-> Europeana Entity has problem with the quality of Europeana Linked data"
       
  • Entity data management 
    • When handling linked data we need to have change management also on the linked data objects.

      if we should do Linked data roundtrips on pictures uploaded we need to find a way how to synchronize objects used in the two linked data domains or do ontology reuse. Wikicommons has chosen to reuse the Wikidata ontology i.e. a change in Wikidata will directly be reflected in Wikicommons. See also the OCLC report "Creating Library Linked Data with Wikibase" and "Lesson 5: To populate knowledge graphs with library metadata, tools that facilitate the reuse and enhancement of data created elsewhere are recommended"

      Steps I thinks is getting more and more imoiortant as entity management also will be the dem facto standard for metadata in pictures
       
      1. pictures needs to have unique persistent identifiers
      2. when uploading a picture we need to be able to track it
        1. metadata added to a picture
        2. metadata deleted
        3. if this picture is downloaded to another platform
      • Example of how Wikicommons now use Wikibase and have unique identifiers with linked metadata available in JSON
         
      • My guess is with this new complexity handling entity management and linked data between loosely coupled system the complexity for the end user will increase and we need to design better User interfaces "hiding" this complexity for the end users

        Today in Wikicommons we import pictures and retype the metadata
         
        • with metadata data that is entities like Q87101582 Q30312943 this is not possible for the end user to retype

          we need people designing digital archives to sit down and define an infrastructure that supports an user interface for the end user with "drag-and-drop" that takes care of
          1. controlling the copyright of the picture
          2. translate entities from one platform to the users current platform
          3. tells the user that this picture is said to depict person xxx that is yyy on the "old platform" do you want to create this entity as same as external ID yyy



  • tweet about this #Metadatadebt