Skip to content

ENH: Linked Datasets (RDF) #3402

Closed
Closed
@westurner

Description

@westurner

ENH: Linked Datasets (RDF)

  • This is very much a meta ticket.
  • There are a number of bare links here.
  • They are for documentation

(UPDATE: see westurner/pandasrdf#1)

Use Case

So I:

  • retrieved some data
    • from somewhere
    • about a certain #topic
  • perfomed analysis
    • with certain transformations and aggregations
    • with certain versions of certain tools
    • confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

  • Series (1D)
    • index
    • data
      • NumPy datatypes
  • DataFrame (2D)
    • index
    • column(s)
      • NumPy datatypes
  • Panel (3D)
  • Panel4D (4D)

Read or parse a data format into a DataSet:

Add metadata:

  • Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

  • pandas.DataFrame.
    • to_csv
    • to_dict
    • to_excel
    • to_gbq
    • to_html
    • to_latex
    • to_panel
    • to_period
    • to_records
    • to_sparse
    • to_sql
    • to_stata
    • to_string
    • to_timestamp
    • to_wide
  • to_ RDF
  • to_ CSVW
  • to_ HTML + RDFa
  • to_ JSONLD

Share or publish a serialized DataSet with the internet:

Implementation

What changes would be needed for Pandas core to support this workflow?

  • .meta schema
  • to_rdf for Series, DataFrames, Panels, and Panel4Ds
  • read_rdf for Series, DataFrames, Panels, and Panel 4Ds
  • ~@datastep process decorators
  • ~DataSet
  • ~DataCatalog of precomputed aggregations/views/slices.
  • Units support (.meta?)

.meta schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources
CSV2RDF (csvw)
W3C PROV (prov:)
schema.org (schema:)
  • http://schema.org
  • http://www.w3.org/wiki/WebSchemas
  • http://schema.rdfs.org/
  • https://schema.org/docs/full.html :
    • schema:Dataset -- A body of structured information describing some topic(s) of interest.
      • [schema:Thing, schema:CreativeWork]
      • distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
      • spatial, temporal
      • catalog -- A data catalog which contains a dataset (DataCatalog)
    • schema:DataCatalog -- collection of Datasets
      • [schema:Thing, schema:CreativeWork]
      • dataset -- A dataset contained in a catalog. (Dataset)
    • schema:DataDownload -- A dataset in downloadable form.
      • [schema:Thing, schema:CreativeWork]
      • contentSize
      • contentURL
      • uploadDate
W3C RDF Data Cube (qb:)

to_rdf

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

  • output fmt
  • JSON-LD: compaction

.

  • Series.meta
  • Series.to_rdf()
  • DataFrame.meta
  • DataFrame.to_rdf()
  • Panel.meta
  • Panel.to_rdf()
  • Panel4D.meta
  • Panel4D.to_rdf()

read_rdf

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

  • Series.read_rdf()
  • DataFrame.read_rdf()
  • Panel.read_rdf()
  • Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.

@datastep / PROV

  • Objective: Additive journal of transformations
  • Link to source script(s) URIs
  • Decorator for annotating data transformations with metadata.
  • Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

  • DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
    • 'this is an aggregation of that'
      • 'this' has a URI
      • 'that' has a URI
  • What if there is no metadata for df2?

Units support

RDF Datatypes

JSON-LD (RDF in JSON)

Linked Data Primer

Linked Data Abstractions

  • Graphs are represented as triples of (s,p,o)
  • Subject, Predicate, Object
  • Queries are patterns with ?references
    • graph.triples((None, None, None))
    • SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
  • subjects are linked to objects by predicates
    • subjects and predicate are identified by URI 'keys'

URIs and URLs

  • a URI is like a URL
  • usually, we expect URLs to be 'dereferencable` HTTP URIs
  • a URI may start with a different URI prefix
    • urn:
    • uuid:

SQL and Linked Data

  • there exist standard mappings for whole SQL tablesets
    • rdb2rdf
    • similar to application scaffolding
    • ACL support adds complexity
  • virtuoso supports SQL and RDF and SPARQL
  • rdflib-sqlalchemy maps RDF onto SQL tables
    • fairly inefficiently, when compared to native triplestores

Named Graphs

  • Quads: (g, s, p, o)
  • g: sometimes called the 'context' of a triple
  • Metadata about GRAPH ?g
  • Multiple named graphs in one file: TriX, TriG

Linked Data Formats

  • NTriples
  • RDF/XML
    • TriX
  • Turtle, N3
    • TriG
  • JSON-LD

Choosing Schema

  • XSD, RDF, RDFS, DCTERMS
  • Which schema is most popular?
  • Which schema is a best fit for the data?
  • Which schema will search engines index for us?
  • What do the queries look like?
  • Years Later... What is OWL?
  • Why would we start with RDFS now?

Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

  • where and how was it downloaded? (digital sense)
  • how was it collected? (process control sense)

Datasets have structure:

  • Tabular, Hierarchical
  • 1D, 2D, 3D, 4D
  • Graph-based
    • Chains
    • Flows
  • Schema

5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

Metadata

Metadata

Assignees

No one assigned

    Labels

    IdeasLong-Term Enhancement Discussions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions