Skip to content

Add validation command to CASE-Utilities-Python #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "dependencies/CASE"]
path = dependencies/CASE
url = https://github.com/casework/CASE.git
[submodule "dependencies/CASE-Examples-QC"]
path = dependencies/CASE-Examples-QC
url = https://github.com/ajnelson-nist/CASE-Examples-QC.git
34 changes: 34 additions & 0 deletions CONTRIBUTE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Contributing to CASE-Utilities-Python


## Deploying a new ontology version

1. After cloning this repository, ensure the CASE submodule is checked out. This can be done with either `git submodule init && git submodule update`, `make .git_submodule_init.done.log`, or `make check`.
2. Update the CASE submodule pointer to the new tagged release.
3. The version of CASE is also hard-coded in [`case_utils/ontology/version_info.py`](case_utils/ontology/version_info.py). Edit the variable `CURRENT_CASE_VERSION`.
4. From the top source directory, run `make clean`. This guarantees a clean state of this repository as well as the ontology submodules.
5. Still from the top source directory, run `make`.
6. Any new `.ttl` files will be created under [`case_utils/ontology/`](case_utils/ontology/). Use `git add` to add each of them. (The patch-weight of these files could overshadow manual revisions, so it is fine to commit the built files after the manual changes are committed.)

Here is a sample sequence of shell commands to run the build:

```bash
# (Starting from fresh `git clone`.)
make check
pushd dependencies/CASE
git checkout master
git pull
popd
git add dependencies/CASE
# (Here, edits should be made to case_utils/ontology/version_info.py)
make
pushd case_utils/ontology
git add case-0.6.0.ttl # Assuming CASE 0.6.0 was just released.
# and/or
git add uco-0.8.0.ttl # Assuming UCO 0.8.0 was adopted in CASE 0.6.0.
popd
make check
# Assuming `make check` passes:
git commit -m "Update CASE ontology pointer to version 0.6.0" dependencies/CASE case_utils/ontology/version_info.py
git commit -m "Build CASE 0.6.0.ttl" case_utils/ontology/case-0.6.0.ttl
```
50 changes: 47 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@ SHELL := /bin/bash

PYTHON3 ?= $(shell which python3.9 2>/dev/null || which python3.8 2>/dev/null || which python3.7 2>/dev/null || which python3.6 2>/dev/null || which python3)

all:
case_version := $(shell $(PYTHON3) case_utils/ontology/version_info.py)
ifeq ($(case_version),)
$(error Unable to determine CASE version)
endif

all: \
.ontology.done.log

.PHONY: \
download
Expand All @@ -35,10 +41,28 @@ all:
$(MAKE) \
--directory dependencies/CASE-Examples-QC/tests \
ontology_vocabulary.txt
test -r dependencies/CASE/ontology/master/case.ttl \
|| (git submodule init dependencies/CASE && git submodule update dependencies/CASE)
test -r dependencies/CASE/ontology/master/case.ttl
$(MAKE) \
--directory dependencies/CASE \
.git_submodule_init.done.log \
.lib.done.log
touch $@

.ontology.done.log: \
dependencies/CASE/ontology/master/case.ttl
# Do not rebuild the current ontology file if it is already present. It is expected not to change once built.
# touch -c: Do not create the file if it does not exist. This will convince the recursive make nothing needs to be done if the file is present.
touch -c case_utils/ontology/case-$(case_version).ttl
$(MAKE) \
--directory case_utils/ontology
# Confirm the current monolithic file is in place.
test -r case_utils/ontology/case-$(case_version).ttl
touch $@

check: \
.git_submodule_init.done.log
.ontology.done.log
$(MAKE) \
PYTHON3=$(PYTHON3) \
--directory tests \
Expand All @@ -49,12 +73,32 @@ clean:
--directory tests \
clean
@rm -f \
.git_submodule_init.done.log
.*.done.log
@# 'clean' in the ontology directory should only happen when testing and building new ontology versions. Hence, it is not called from the top-level Makefile.
@test ! -r dependencies/CASE/README.md \
|| $(MAKE) \
--directory dependencies/CASE \
clean
@# Restore CASE validation output files that do not affect CASE build process.
@test ! -r dependencies/CASE/README.md \
|| ( \
cd dependencies/CASE \
&& git checkout \
-- \
tests/examples \
|| true \
)
@#Remove flag files that are normally set after deeper submodules and rdf-toolkit are downloaded.
@rm -f \
dependencies/CASE-Examples-QC/.git_submodule_init.done.log \
dependencies/CASE-Examples-QC/.lib.done.log

# This recipe guarantees timestamp update order, and is otherwise intended to be a no-op.
dependencies/CASE/ontology/master/case.ttl: \
.git_submodule_init.done.log
test -r $@
touch $@

distclean: \
clean
@rm -rf \
Expand Down
32 changes: 29 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,33 @@ Installation is demonstrated in the `.venv.done.log` target of the [`tests/`](te
## Usage


### `case_validate`

This repository provides `case_validate` as an adaptation of the `pyshacl` command from [RDFLib's pySHACL](https://github.com/RDFLib/pySHACL). The command-line interface is adapted to run as though `pyshacl` were provided the full CASE ontology (and adopted full UCO ontology) as both a shapes and ontology graph. "Compiled" (or, "aggregated") CASE ontologies are in the [`case_utils/ontology/`](case_utils/ontology/) directory, and are installed with `pip`, so data validation can occur without requiring networking after this repository is installed.

To see a human-readable validation report of an instance-data file:

```bash
case_validate input.json
```

If `input.json` is not conformant, a report will be emitted, and `case_validate` will exit with status `1`. (This is a `pyshacl` behavior, where `0` and `1` report validation success. Status of >`1` is for other errors.)

To produce the validation report as a machine-readable graph output, the `--format` flag can be used to modify the output format:

```bash
case_validate --format turtle input.json > result.ttl
```

To use one or more supplementary ontology files, the `--ontology-graph` flag can be used, more than once if desired, to supplement the selected CASE version:

```bash
case_validate --ontology-graph internal_ontology.ttl --ontology-graph experimental_shapes.ttl input.json
```

Other flags are reviewable with `case_validate --help`.


### `case_file`

To characterize a file, including hashes:
Expand Down Expand Up @@ -86,10 +113,9 @@ This project follows [SEMVER 2.0.0](https://semver.org/) where versions are decl

## Ontology versions supported

This repository supports the ontology versions that are linked as submodules in the [CASE Examples QC](https://github.com/ajnelson-nist/CASE-Examples-QC) repository. Currently, the ontology versions are:
This repository supports the CASE ontology version that is linked as a submodule [here](dependencies/CASE). The CASE version is encoded as a variable (and checked in unit tests) in [`case_utils/ontology/version_info.py`](case_utils/ontology/version_info.py), and used throughout this code base, as `CURRENT_CASE_VERSION`.

* CASE - 0.4.0
* UCO - 0.6.0
For instructions on how to update the CASE version for an ontology release, see [`CONTRIBUTE.md`](CONTRIBUTE.md).


## Repository locations
Expand Down
191 changes: 191 additions & 0 deletions case_utils/case_validate/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
#!/usr/bin/env python3

# This software was developed at the National Institute of Standards
# and Technology by employees of the Federal Government in the course
# of their official duties. Pursuant to title 17 Section 105 of the
# United States Code this software is not subject to copyright
# protection and is in the public domain. NIST assumes no
# responsibility whatsoever for its use by other parties, and makes
# no guarantees, expressed or implied, about its quality,
# reliability, or any other characteristic.
#
# We would appreciate acknowledgement if the software is used.

"""
This script provides a wrapper to the pySHACL command line tool,
available here:
https://github.com/RDFLib/pySHACL

Portions of the pySHACL command line interface are preserved and passed
through to the underlying pySHACL validation functionality.

Other portions of the pySHACL command line interface are adapted to
CASE, specifically to support CASE and UCO as ontologies that store
subclass hierarchy and node shapes together (rather than as separate
ontology and shape graphs). More specifically to CASE, if no particular
ontology or shapes graph is requested, the most recent version of CASE
will be used. (That most recent version is shipped with this package as
a monolithic file; see case_utils.ontology if interested in further
details.)
"""

__version__ = "0.1.0"

import argparse
import importlib.resources
import logging
import os
import pathlib
import sys
import typing

import rdflib.util # type: ignore
import pyshacl # type: ignore

import case_utils.ontology

from case_utils.ontology.version_info import *

_logger = logging.getLogger(os.path.basename(__file__))

def main() -> None:
parser = argparse.ArgumentParser(description="CASE wrapper to pySHACL command line tool.")

# Configure debug logging before running parse_args, because there
# could be an error raised before the construction of the argument
# parser.
logging.basicConfig(level=logging.DEBUG if ("--debug" in sys.argv or "-d" in sys.argv) else logging.INFO)

case_version_choices_list = ["none", "case-" + CURRENT_CASE_VERSION]

# Add arguments specific to case_validate.
parser.add_argument(
'-d',
'--debug',
action='store_true',
help='Output additional runtime messages.'
)
parser.add_argument(
"--built-version",
choices=tuple(case_version_choices_list),
default="case-"+CURRENT_CASE_VERSION,
help="Monolithic aggregation of CASE ontology files at certain versions. Does not require networking to use. Default is most recent CASE release."
)
parser.add_argument(
"--ontology-graph",
action="append",
help="Combined ontology (i.e. subclass hierarchy) and shapes (SHACL) file, in any format accepted by rdflib recognized by file extension (e.g. .ttl). Will supplement ontology selected by --built-version. Can be given multiple times."
)

# Inherit arguments from pyshacl.
parser.add_argument(
'--abort',
action='store_true',
help='(As with pyshacl CLI) Abort on first invalid data.'
)
parser.add_argument(
'-w',
'--allow-warnings',
action='store_true',
help='(As with pyshacl CLI) Shapes marked with severity of Warning or Info will not cause result to be invalid.',
)
parser.add_argument(
"-f",
"--format",
choices=('human', 'turtle', 'xml', 'json-ld', 'nt', 'n3'),
default='human',
help="(ALMOST as with pyshacl CLI) Choose an output format. Default is \"human\". Difference: 'table' not provided."
)
parser.add_argument(
'-im',
'--imports',
action='store_true',
help='(As with pyshacl CLI) Allow import of sub-graphs defined in statements with owl:imports.',
)
parser.add_argument(
'-i',
'--inference',
choices=('none', 'rdfs', 'owlrl', 'both'),
default='none',
help="(As with pyshacl CLI) Choose a type of inferencing to run against the Data Graph before validating. Default is \"none\".",
)
parser.add_argument(
'-o',
'--output',
dest='output',
nargs='?',
type=argparse.FileType('x'),
help="(ALMOST as with pyshacl CLI) Send output to a file. If absent, output will be written to stdout. Difference: If specified, file is expected not to exist. Clarification: Does NOT influence --format flag's default value of \"human\". (I.e., any machine-readable serialization format must be specified with --format.)",
default=sys.stdout,
)

parser.add_argument("in_graph")

args = parser.parse_args()

data_graph = rdflib.Graph()
data_graph.parse(args.in_graph)

ontology_graph = rdflib.Graph()
if args.built_version != "none":
ttl_filename = args.built_version + ".ttl"
_logger.debug("ttl_filename = %r.", ttl_filename)
ttl_data = importlib.resources.read_text(case_utils.ontology, ttl_filename)
ontology_graph.parse(data=ttl_data, format="turtle")
if args.ontology_graph:
for arg_ontology_graph in args.ontology_graph:
_logger.debug("arg_ontology_graph = %r.", arg_ontology_graph)
ontology_graph.parse(arg_ontology_graph)

# Determine output format.
# pySHACL's determination of output formatting is handled solely
# through the -f flag. Other CASE CLI tools handle format
# determination by output file extension. case_validate will defer
# to pySHACL behavior, as other CASE tools don't (at the time of
# this writing) have the value "human" as an output format.
validator_kwargs : typing.Dict[str, str] = dict()
if args.format != "human":
validator_kwargs['serialize_report_graph'] = args.format

validate_result : typing.Tuple[
bool,
typing.Union[Exception, bytes, str, rdflib.Graph],
str
]
validate_result = pyshacl.validate(
data_graph,
shacl_graph=ontology_graph,
ont_graph=ontology_graph,
inference=args.inference,
abort_on_first=args.abort,
allow_warnings=True if args.allow_warnings else False,
debug=True if args.debug else False,
do_owl_imports=True if args.imports else False,
**validator_kwargs
)

# Relieve RAM of the data graph after validation has run.
del data_graph

conforms = validate_result[0]
validation_graph = validate_result[1]
validation_text = validate_result[2]

# NOTE: The output logistics code is adapted from pySHACL's file
# pyshacl/cli.py. This section should be monitored for code drift.
if args.format == "human":
args.output.write(validation_text)
else:
if isinstance(validation_graph, rdflib.Graph):
raise NotImplementedError("rdflib.Graph expected not to be created from --format value %r." % args.format)
elif isinstance(validation_graph, bytes):
args.output.write(validation_graph.decode("utf-8"))
elif isinstance(validation_graph, str):
args.output.write(validation_graph)
else:
raise NotImplementedError("Unexpected result type returned from validate: %r." % type(validation_graph))

sys.exit(0 if conforms else 1)

if __name__ == "__main__":
main()
Loading