Background
Using high level languages like OWL, SWRL, and SPARQL as schema or constraint languages for RDF and Linked Data has several advantages:
- Unifying the domain model with data quality rules
- Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
- Being able to query the domain model, data quality rules, integration model, mapping rules, etc with SPARQL
- Being able to use automated reasoning about all of these things to insure logical consistency, explain errors and problems, etc
But neither OWL 2, SWRL, nor SPARQL were explicitly designed for this usage. In Stardog, you can use OWL, SWRL, or SPARQL as a constraint language for RDF and as a modeling or inference language. This chapter explains how.
If you are also interested in the theoretical background, please see the ICV specification, which has all the formal details.
Terminology
In the sections below, we explain the operational usage of Integrity Constraint Validation (ICV) in Stardog: it may be easier for you to understand if you read this section on terminology first.
- Schema, TBox
- A schema (or "terminology box" a.k.a., TBox) is a set of statements that define the relationships between data elements, including property and class names, their relationships, etc. In practical terms, schema statements for a Stardog database are RDF Schema and OWL 2 terms, axioms, and definitions.
- Data, ABox
- All of the triples in a Stardog database that aren't part of the schema are part of the data (or "assertional box" a.k.a. ABox).
- Integrity Constraint
- A declarative expression of some rule or constraint which data must conform to in order to be valid. Integrity Constraints are typically domain and application specific. They can be expressed in OWL 2 (any legal syntax), SWRL rules, or (a restricted form of) SPARQL queries.
- Constraints
- Constraints that have been associated with a Stardog database and which are used to validate the data it contains. Each Stardog may optionally have one and only one set of constraints associated with it.
- ICV, Integrity Constraint Validation
- The process of checking whether some Stardog database is valid with respect to some integrity constraints. The result of ICV is a boolean value (true if valid, false if invalid) and, optionally, an explanation of constraint violations.
- Closed World Assumption, Closed World Reasoning
- Stardog ICV assumes a closed world with respect to data and constraints: that is, it assumes that all relevant data is known to it and included in a database to be validated. It interprets the meaning of Integrity Constraints in light of this assumption; if a constraint says a value must be present, the absence of that value is interpreted as a constraint violation and, hence, as invalid data.
- Open World Assumption, Open World Reasoning
- A legal OWL 2 inference may violate or satisfy an Integrity Constraint in Stardog. In other words, you get to have your cake (OWL as a constraint language) and eat it, too (OWL as modeling or inference language). This means that constraints are applied to a Stardog database with respect to an OWL 2 profile.
- Monotonicity
- OWL is a monotonic language: that means you can't ever add anything to a Stardog database that causes there to be fewer legal inferences. Or, put another way, the only way to decrease the number of legal inferences is to delete something.
- Monotonicity interacts with ICV in the following ways:
- Adding data to or removing it from a Stardog database may make it invalid.
- Adding schema statements to or removing them from a Stardog database may make it invalid.
- Adding new constraints to a Stardog database may make it invalid.
- Deleting constraints from a Stardog database cannot make it invalid.
Validating Linked Data
The use of high-level languages (OWL 2, SWRL, and SPARQL) to validate RDF data using closed world semantics is one of Stardog's unique capabilities.
ICV and OWL 2 Reasoning
An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by statement that's been legally inferred by Stardog. This means that when ICV is enabled for a Stardog database, it has to be enabled relative to a reasoning type. The valid choices of reasoning type are any type or kind of reasoning supported by Stardog. As of 1.2.1, those types include
- no reasoning
- RDFS
- OWL 2 QL
- OWL 2 EL
- OWL 2 RL
- OWL 2 DL
Future releases of Stardog will include RDF Schema and OWL 2 DL reasoning types.
The important implication here is that Integrity Constraint Validation is performed with three inputs: a Stardog database, a set of constraints, and a reasoning type (which may be, of course, no reasoning). This is the case because domain modelers, ontology developers, or integrity constraint authors must consider the interactions between explicit and inferred statements and how these are accounted for in integrity constraints.
Security Implications
There is a security implication in this design choice that may not be obvious. Changing the reasoning type associated with a database and integrity constraint validation has serious security implications with respect to a Stardog database and may only be performed by a user role with sufficient privileges for that action.
How to Use ICV in Stardog
This section will describe how to use Stardog ICV via the SNARL APIs. For more information on using SNARL in general, please refer to the section on programming with Stardog in Java.
There is command-line interface support for many of the operations necessary to using ICV with a Stardog database; please see the Administration chapter for that documentation.
To use ICV in Stardog, one must:
- create some constraints
- associate those constraints with a Stardog database
Creating Constraints
Constraints can be created using the
ConstraintFactory which provides
methods for creating integrity constraints from OWL axioms or from SPARQL select queries.
ConstraintFactory expects your constraints, if they are defined as OWL axioms, as an RDF triple (or graph). To aid
in authoring constraints in OWL, ExpressionFactory
is provided for building the RDF equivalent of the OWL axioms of your constraint.
You can also write your constraints in OWL in your favorite editor, such as Protege, and load them into the database from your OWL file.
We recommend defining your constraints as OWL axioms, but you are free to define them using SPARQL SELECT queries. If you choose to define a constraint using a SPARQL select query, please keep in mind that if your query returns results, those are interpreted as the violations of the integrity constraint.
An example of creating a simple constraint using the ExpressionFactory.
Adding Constraints to Stardog
The
This example shows how to add an integrity constraint to a Stardog database.
Here we show how to add a set of constraints as defined in a local OWL ontology.
IC Validation
Checking whether or not the contents of a database are valid is easy. Once you have an ICVConnection
you can simply call its isValid()
method which will return whether or not the contents of the database are valid with respect to the constraints associated with
that database. Similarly, you can provide some constraints to
the isValid() method to see if the data in the database is invalid for those specific constraints; which can be a subset
of the constraints associated with the database, or they can be new constraints you are working on.
If the data is invalid for some constraints—either the explicit constraints in your database or a new set of constraints you have authored—you
can get some information about what the violation was from the SNARL IC Connection.
ICVConnection.getViolationBindings()
will return the constraints which are violated, and for each constraint, you can get the violations as the set of bindings that satisfied the constraint
query. You can turn the bindings into the individuals which are in the violation using
ICV.asIndividuals.
ICV and Transactions
In addition to using the ICConnection a data oracle to tell whether or not your data is valid with respect to some constraints, you can also use Stardog's ICV support to protect your database from invalid data by using ICV as a guard within transactions.
When guard mode for ICV is enabled in Stardog, each commit is inspected to ensure that the contents of the database are valid for the set of constraints that have been associated with it. Should someone attempt to commit data which violates one or more of the constraints defined for the database, the commit will fail and the data will not be added/removed from your database.
By default, reasoning is not used when you enable guard mode, however you are free to specify any of the reasoning types supported by Stardog when enabling guard mode. If you have provided a specific reasoning type for guard mode it will be used during validation of the integrity constraints. This means you can author your constraints with the expectation of inference results satisfying a constraint.
This illustrates how to create a persistent disk database with ICV guard mode enabled at the QL reasoning type. Guard mode can also be enabled when the database is created on the command line
ICV Examples
Rather than discuss the formal semantics of ICV here, we will look at some examples. The following examples use OWL 2's Manchester syntax; and they assume a simple data schema, which is available as an OWL ontology and as a UML diagram. The examples also assume that the default namespace is <http://example.com/company.owl#> and that xsd: is bound to the standard, <http://www.w3.org/2001/XMLSchema#>.
We provide reference code for each of the different following examples. This is also included in the examples directory in the Stardog distribution.
Subsumption Constraints
This kind of constraint guarantees certain subclass and superclass (i.e., subsumption) relationships exist between instances.
Managers must be employees
| Constraint |
Class: Manager SubClassOf: Employee |
|
| A |
Individual: Alice Types: Manager |
Invalid |
| B |
Individual: Alice Types: Manager, Employee |
Valid |
This constraint says that if an RDF individual is an instance of Manager, then it must also be an instance of Employee. In ontology A, the only instance of Manager, namely Alice, is not an instance of Employee; therefore, A is invalid. In B, Alice is an instance of both Manager and Employee; therefore, B is valid.
Domain-Range Constraints
These constraints control the types of domain and range instances for properties.
Only project leaders can be responsible for projects.
| Constraint |
ObjectProperty: is_responsible_for Domain: Project_Leader Range: Project |
|
| A |
Individual: Alice Facts: is_responsible_for MyProject Individual: MyProject Types: Project |
Invalid |
| B |
Individual: Alice Types: Project_Leader Facts: is_responsible_for MyProject Individual: MyProject |
Invalid |
| C |
Individual: Alice Types: Project_Leader Facts: is_responsible_for MyProject Individual: MyProject Types: Project |
Valid |
This constraint says that if an RDF instance i is related to an RDF instance j via the property is_responsible_for, then i must be an instance of Project_Leader and j must be an instance of Project. In ontology A, there is only one pair of individuals related via is_responsible_for, namely (Alice, MyProject), and MyProject is an instance of Project; but Alice is not an instance of Project_Leader. Therefore, A is invalid. In B, Alice is an instance of Project_Leader, but MyProject is not an instance of Project; therefore, B is not valid. In C, Alice is an instance of Project_Leader, and MyProject is an instance of Project; therefore, C is valid.
Only employees can have an SSN.
| Constraint |
DataProperty: SSN Domain: Employee |
|
| A |
Individual: Bob Facts: SSN "123-45-6789" |
Invalid |
| B |
Individual: Bob Types: Employee Facts: SSN "123-45-6789" |
Valid |
This constraint says that if an RDF instance i has a data assertion via the the property SSN, then i must be an instance of Employee. In ontology A, Bob is not known to be an instance of Employee but has SSN, therefore, A is invalid. In B, Bob is defined to be an instance of Employee, therefore the ontology is valid.
Each date of birth must be a date.
| Constraint |
DataProperty: DOB Range: xsd:date |
|
| A |
Individual: Bob Facts: DOB "1970-01-01" |
Invalid |
| B |
Individual: Bob Facts: DOB "1970-01-01"^^xsd:date |
Valid |
This constraint says that if an RDF instance i is related to a literal l via the data property DOB, then l must have the XML Schema type xsd:date. In ontology A, Bob is related to the untyped literal "1970-01-01" via DOB so A is invalid. In B, the literal "1970-01-01" is properly typed so the ontology is valid.
Participation Constraints
These constraints control whether (or not) an RDF instance participates in some specified relationship.
Each supervisor must supervise at least one employee.
| Constraint |
Class: Supervisor SubClassOf: supervises some Employee |
|
| A |
Individual: Alice |
Valid |
| B |
Individual: Alice Types: Supervisor |
Invalid |
| C |
Individual: Alice Types: Supervisor Facts: supervises Bob Individual: Bob |
Invalid |
| D |
Individual: Alice Types: Supervisor Facts: supervises Bob Individual: Bob Types: Employee |
Valid |
This constraint says that if an RDF instance i is of type Supervisor, then i must be related to an individual j via the property supervises, and j must be an instance of Employee. In ontology A, Supervisor has no instances; therefore, A is vacuously valid. In B, the only instance of Supervisor, namely Alice, is related to no individual; therefore, B is invalid. In C, Alice is related to Bob via supervises, but Bob is not an instance of Employee; therefore, C is invalid. In D, Alice is related to Bob via supervises, and Bob is an instance of Employee; therefore, D is valid.
Each project must have a valid project number.
| Constraint |
Class: Project SubClassOf: number some integer[> 0, < 5000] |
|
| A |
Individual: MyProject |
Valid |
| B |
Individual: MyProject Types: Project |
Invalid |
| C |
Individual: MyProject Types: Project Facts: number "23" |
Invalid |
| D |
Individual: MyProject Types: Project Facts: number "6000"^^integer |
Invalid |
| E |
Individual: MyProject Types: Project Facts: number "23"^^integer |
Valid |
This constraint says that if an RDF instance i is of type Project, then i must be related via the property number to an integer between 0 and 5000 (inclusive). In ontology A, the individual MyProject is not known to be an instance of Project so the constraint does not apply and the ontology is valid. In B, MyProject is an instance of Project but is not known to have any data assertions via number so A is invalid. In C, MyProject does have a data property assertion via number but the literal "23" is untyped (not an integer) therefore the ontology is invalid. In D, MyProject is related to an integer via number but it is out of the range so the ontology is invalid. Finally, in E, MyProject is related to the integer 23 which is in the range of [0,5000] so this ontology is valid.
Cardinality Constraints
These constraints control the number of various relationships or property values.
Employees mustn't work on more than 3 projects
| Constraint |
Class: Employee SubClassOf: works_on max 3 Project |
|
| A |
Individual: Bob |
Valid |
| B |
Individual: Bob Types: Employee Facts: works_on MyProject Individual: MyProject Types: Project |
Valid |
| C |
Individual: Bob
Types: Employee
Facts: works_on MyProject, works_on MyProjectFoo, works_on MyProjectBar,
works_on MyProjectBaz
Individual: MyProject
Types: Project
Individual: MyProjectFoo
Types: Project
Individual: MyProjectBar
Types: Project
Individual: MyProjectBaz
Types: Project
|
Invalid |
This constraint says that if an RDF instance i is an Employee, then i must not be related via the property works_on to more than 3 named individuals of class Project. In ontology A, Bob is not known to be an instance of Employee so the constraint does not apply and the ontology is valid. In B, Bob is an instance of Employee but is known to work on only a single project, namely MyProject, so the ontology is valid. In C, Bob is related to 4 named individuals of class Project (namely, MyProject, MyProjectFoo, MyProjectBar, and MyProjectBaz) via works_on. Due to the weak UNA these individuals are considered distinct so the ontology is invalid.
Departments must have at least 2 employees.
| Constraint |
Class: Department SubClassOf: inverse(works_in) min 2 Employee |
|
| A |
Individual: MyDepartment |
Valid |
| B |
Individual: MyDepartment Types: Department Individual: Bob Types: Employee Facts: works_in MyDepartment |
Invalid |
| C |
Individual: MyDepartment Types: Department Individual: Bob Types: Employee Facts: works_in MyDepartment Individual: Alice Types: Employee Facts: works_in MyDepartment |
Valid |
This constraint says that if an RDF instance i is a Department, then there should exist at least 2 instances j and k of class Employee which are related to i via the property works_in (or, equivalently, i should be related to them via the inverse of works_in). In ontology A, the individual MyDepartment is not known to be an instance of Department so the constraint does not apply and the ontology is valid. In B, MyDepartment is an instance of Department but only one instance of Employee, namely Bob, is known to work in it, so the ontology is invalid. In C, MyDepartment is related to the individuals Bob and Alice, which are both instances of Employee and (due to the weak Unique Name Assumption that Stardog adopts for ICV), are distinct, so the ontology is valid.
Managers must manage exactly 1 department.
| Constraint |
Class: Manager SubClassOf: manages exactly 1 Department |
|
| A |
Individual: Isabella |
Valid |
| B |
Individual: Isabella Types: Manager |
Invalid |
| C |
Individual: Isabella Types: Manager Facts: manages MyDepartment |
Invalid |
| D |
Individual: Isabella Types: Manager Facts: manages MyDepartment Individual: MyDepartment Types: Department |
Valid |
| E |
Individual: Isabella Types: Manager Facts: manages MyDepartment, MyDepartment1 Individual: MyDepartment Types: Department Individual: MyDepartment1 Types: Department |
Invalid |
This constraint says that if an RDF instance i is a Manager, then it must be related to exactly 1 instance of Department via the property manages. In ontology A, the individual Isabella is not known to be an instance of Manager so the constraint does not apply and the ontology is valid. In B, Isabella is an instance of Manager but is not related to any instances of Department, so the ontology is invalid. In C, Isabella is related to the individual MyDepartment via the property manages but MyDepartment is not known to be an instance of Department, so the ontology is invalid. In D, Isabella is related to exactly one instance of Department, namely MyDepartment, so the ontology is valid. Finally, in E, Isabella is related to 2 distinct (again, because of weak UNA) instances of Department, namely MyDepartment and MyDepartment1, so the ontology is invalid.
Entities must not have more than one name.
| Constraint |
DataProperty: name Characteristics: Functional |
|
| A |
Individual: MyDepartment |
Valid |
| B |
Individual: MyDepartment Facts: name "Human Resources" |
Valid |
| C |
Individual: MyDepartment Facts: name "Human Resources", name "Legal" |
Invalid |
This constraint says that no RDF instance i can have more than 1 assertion via the data property name. In ontology A, the individual MyDepartment does not have any data property assertions so A is valid. In B, MyDepartment has a single assertion via name, so the ontology is also invalid. In C, MyDepartment is related to 2 literals, namely "Human Resources" and "Legal", via name, so the ontology is invalid.
Property Constraints
These constraints control how instances are related to one another via properties.
The manager of a department must work in that department.
| Constraint |
ObjectProperty: manages SubPropertyOf: works_in |
|
| A |
Individual: Bob Facts: manages MyDepartment |
Invalid |
| B |
Individual: Bob Facts: manages MyDepartment, works_in MyDepartment |
Valid |
This constraint says that if an RDF instance i is related to j via the property manages, then i must also be related to j va the property works_in. In ontology A, Bob is related to MyDepartment via manages, but not via works_in, so the ontology is invalid. In B, Bob is related to MyDepartment via both manages and works_in, so the ontology is valid.
Department managers must supervise all the department's employees.
| Constraint |
ObjectProperty: is_supervisor_of SubPropertyChain: manages o inverse(works_in) |
|
| A |
Individual: Jose Facts: manages MyDepartment, is_supervisor_of Maria Individual: Maria Facts: works_in MyDepartment Individual: Diego Facts: works_in MyDepartment |
Invalid |
| B |
Individual: Jose Facts: manages MyDepartment, is_supervisor_of Maria, is_supervisor_of Diego Individual: Maria Facts: works_in MyDepartment Individual: Diego Facts: works_in MyDepartment |
Valid |
This constraint says that if an RDF instance i is related to j via the property manages and k is related to j via the property works_in, then i must be related to k via the property is_supervisor_of. In ontology A, Jose is related to MyDepartment via manages, Diego is related to MyDepartment via works_in, but Jose is not related to Diego via any property, so the ontology is invalid. In B, Jose is related to Maria and Diego, who both are related to MyDepartment via works_in, via the property is_supervisor_of, so the ontology is valid.
Complex Constraints
These constraints are more complex, often including multiple conditions, etc.
Employee Constraints
Each employee either works on at least one project, supervises at least one employee that works on at least one project, or manages at least one department.
| Constraint |
Class: Employee SubClassOf: works_on some Project or supervises some (Employee and works_on some Project) or manages some Department |
|
| A |
Individual: Esteban Types: Employee |
Invalid |
| B |
Individual: Esteban Types: Employee Facts: supervises Lucinda Individual: Lucinda Types: Employee |
Invalid |
| C |
Individual: Esteban Types: Employee Facts: supervises Lucinda Individual: Lucinda Types: Employee Facts: works_on MyProject Individual: MyProject Types: Project |
Valid |
| D |
Individual: Esteban Types: Employee Facts: manages MyDepartment Individual: MyDepartment Types: Department |
Valid |
| E |
Individual: Esteban Facts: manages MyDepartment, works_on MyProject Individual: MyDepartment Types: Department Individual: MyProject Types: Project |
Valid |
This constraint says that if an individual i is an instance of Employee, then at least one of three conditions must be met: First, it is related to an instance of Project via the property works_on. Second, it is related to an instance j via the property supervises and j is an instance of Employee and also related to some instance of Project via the property works_on. Third, it is related to an instance of Department via the property manages.
Ontologies A and B are invalid because none of the conditions are met. C meets the second condition: Esteban (who is an Employee) is related to Lucinda via the property supervises whereas Lucinda is both an Employee and related to MyProject, which is a Project, via the property works_on. D meets the third condition: Esteban is related to an instance of Department, namely MyDepartment, via the property manages. Finally, E meets the first and the third conditions because in addition to managing a department Esteban is also related an instance of Project, namely MyProject, via the property works_on
Employees and US government funding
Only employees who are American citizens can work on a project that receives funds from a US government agency.
| Constraint |
Class: Project and receives_funds_from some US_Government_Agency SubClassOf: inverse(works_on) only (Employee and nationality value "US") |
|
| A |
Individual: MyProject Types: Project Facts: receives_funds_from NASA Individual: NASA Types: US_Government_Agency |
Valid |
| B |
Individual: MyProject Types: Project Facts: receives_funds_from NASA Individual: NASA Types: US_Government_Agency Individual: Andy Types: Employee Facts: works_on MyProject |
Invalid |
| C |
Individual: MyProject Types: Project Facts: receives_funds_from NASA Individual: NASA Types: US_Government_Agency Individual: Andy Types: Employee Facts: works_on MyProject, nationality "US" |
Invalid |
| D |
Individual: MyProject Types: Project Facts: receives_funds_from NASA Individual: NASA Types: US_Government_Agency Individual: Andy Types: Employee Facts: works_on MyProject, nationality "US" Individual: Heidi Types: Supervisor Facts: works_on MyProject, nationality "US" |
Invalid |
| E |
Individual: MyProject Types: Project Facts: receives_funds_from NASA Individual: NASA Types: US_Government_Agency Individual: Andy Types: Employee Facts: works_on MyProject, nationality "US" Individual: Heidi Types: Supervisor Facts: works_on MyProject, nationality "US" Class: Supervisor SubClassOf: Employee |
Valid |
This constraint says that if an individual i is an instance of Project and is related to an instance of US_Government_Agency via the property receives_funds_from, then any individual j which is related to i via the property works_on must satisfy two conditions: First, it must be an instance of Employee. Second, it must not be related to any literal other than "US" via the data property nationality.
Ontology A is valid because there is no individual related to MyProject via works_on, so the constraint is vacuously satisfied. Ontology B is invalid since Andy is related to MyProject via works_on, MyProject is an instance of Project and is related to an instance of US_Government_Agency, that is, NASA, via receives_funds_from, but Andy does not have any data property assertions. C is valid because both conditions are met. D is not valid because Heidi violated the first condition: she is related to MyProject via works_on but is not known to be an instance of Employee. Finally, this is fixed in the ontology E which states that every instance of Supervisor is an instance of Employee, so Heidi is inferred to be an instance of Employee and, consequently, the ontology is valid.
