ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance

Unofficial draft 27 March 2014

Contributors:
Víctor Cuevas-Vicenttín, UC Davis/University of New Mexico
Bertram Ludäscher, UC Davis
Paolo Missier, Newcastle University
Khalid Belhajjame, PSL, Paris-Dauphine University, LAMSADE
Fernando Chirigati, New York University
Yaxing Wei, Oak Ridge National Laboratory
Saumen Dey, UC Davis
Parisa Kianmajd, UC Davis
David Koop, New York University
Shawn Bowers, Gonzaga University
Ilkay Altintas, UC San Diego

Abstract

Provenance describes the origin and processing history of an artifact. Data provenance is an important form of metadata that explains how a particular data product was generated, by detailing the steps in the computational process producing it. Provenance information brings transparency and helps to audit and interpret data products. The state of the art scientific workflow systems (e.g. Kepler, Taverna, VisTrails, etc.) provide environments for specifying and enacting complex computational pipelines commonly referred to as scientific workflows. In such systems, provenance information is automatically captured in the form of execution traces. However, they often rely on proprietary formats that make the interchange of provenance information difficult. Furthermore, the workflow itself, which represents very useful information, may be disregarded in provenance traces. The evolution history of the workflow (i.e. its provenance) can likewise be missing. To address these shortcomings we propose ProvONE, a standard for scientific workflow provenance representation. ProvONE is defined as an extension of the W3C recommended standard PROV, aiming to capture the most relevant information concerning scientific workflow computational processes, and providing extension points to accommodate the specificities of particular scientific workflow systems.

This document specifies the ProvONE model and details how its constituting parts are related to the W3C PROV standard. The description provided is complemented by examples including queries on ProvONE data.

Status of This Document

This document specifies a potential standard published publicly for evaluation and possible adoption. However, it is not associated with or is supported by any standards organisation.

Please Send Comments

This document was published by the DataONE Provenance Working Group as a proposal for a standard. If you wish to make comments regarding this document, please send them to dataone-provwg@googlegroups.com. All comments are welcome.

1. Introduction

Historically, one of the main uses of provenance has been to support claims of attribution and authenticity, and therefore of value for material objects (e.g. works of art, manuscripts, etc.). In science, provenance is required to provide evidence in support of the experimental results that underpin scientific publications. The importance of provenance still applies in e-Science settings, where the data is obtained through computational methods. In these cases, the provenance of the experimental outcome is typically a graph structured account of the individual computational steps, which is recorded automatically, at the level of detail specified by the system instrumentation. This form of provenance, suitably encoded for machine processing, can then be exploited using a variety of graph query and analysis tools.

This scenario, where each piece of scientific data obtained by a computational method is associated with its provenance, is becoming increasingly prevalent. Regarding scientific workflows, detailed execution traces are routinely collected by a number of broadly used Workflow Management Systems (WfMSs) including Taverna, Kepler, VisTrails, Galaxy, e-Science Central, Pegasus, and others. However, these systems often adopt proprietary models for encoding the provenance traces captured by workflow executions. Moreover, they adopt different models to specify the workflows themselves. Such heterogeneity makes it difficult for a scientist to analyze and compare provenance traces captured using the same or similar workflows that were specified and enacted using different systems. The absence of a standard model for representing workflow provenance also means that opportunities for stitching the traces produced by different workflows, and therefore assisting the scientist in her analysis, are likely to be missed.

This document presents ProvONE, a model for scientific workflow provenance that aims to fulfill the requirements of the desired standard. The name originates from its development in the context of the DataONE Project, which is creating a large scale and federated data infrastructure serving the Earth sciences community. Nevertheless, ProvONE is designed to support a large variety of WfMSs that in turn are used by numerous scientific communities.

1.1 Relation to other standards

The provenance community has made significant efforts in developing standard models that can be used for capturing and publishing provenance of artifacts and resources on the Web. These efforts resulted in, first, the Open Provenance Model (OPM) [MCF+11], and more recently, the W3C PROV model [PROV]. While such models are useful and are being adopted by academics and industrials alike, as suggested by the number of PROV implementations, they do not suffice for encoding scientific workflow provenance. The reason being, that both OPM and PROV were developed as minimal models meant to be used for tracking the provenance of resources on the Web regardless of their types. As such, they do not provide all the concepts that are necessary for specifying workflows and encoding the provenance of data products used and generated as a result of their execution. Consequently, many WfMSs adopt their own provenance models, resulting in the aforementioned loss of interoperability opportunities.

Thus the need arises for a new model that acts as a standard for encoding scientific workflow provenance. Instead of creating such a model from scratch, the W3C PROV model can be used as a starting point. A preliminary proposal following this direction was published in [MDB+13]; an independent extension of PROV for scientific workflows is also presented in [CSdO+13], as well as in [BKG+13] (focusing on workflow preservation). This document aims to incorporate and standardize the ideas of these works, as well as additional contributions, to derive an adequate standard that can be used by the scientific workflow community.

1.2 Aspects covered by ProvONE

ProvONE aims to provide the fundamental information required to understand and analyze scientific workflow-based computational experiments. Therefore, it covers the main aspects that have been identified as relevant in the provenance literature. These correspond to prospective and retrospective provenance [ZWF06] as well as process provenance [FSC+06]; additionally, some essential elements of data structure are also considered. Each of these aspects is described next.

1.3 Structure of this Document

Section 2 provides an overview of the ProvONE conceptual model, covering the aspects outlined in Section 1.2. The conceptual model of ProvONE is given using the Unified Modeling Language [UML].

Section 3 provides a detailed characterization of the various components of ProvONE, which is serialized as an OWL 2 ontology. It clarifies how the ProvONE concepts are related to the W3C PROV concepts, accompanying the descriptions with examples.

Section 4 gives references to additional resources that form part of the ProvONE standard.

1.4 Namespaces

The following namespaces and prefixes are used throughout this document.

Table 1 ◊: Prefix and Namespaces used in this specification
prefixnamespace IRI definition
provhttp://www.w3.org/ns/prov#The PROV namespace [PROVO]
provonehttp://purl.org/provoneThe ProvONE namespace [ProvONE]
xsdhttp://www.w3.org/2000/10/XMLSchema#XML Schema namespace [XMLSCHEMA11-2]
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#The RDF namespace [RDF-CONCEPTS]
rdfshttp://www.w3.org/2000/01/rdf-schema#The RDFS namespace [RDF-SCHEMA]
owlhttp://www.w3.org/2002/07/owl#OWL 2 specification namespace [OWL2]
dctermshttp://purl.org/dc/terms/Dublin Core Metadata Elements namespace [DC-RDF]
wfmshttp://www.wfms.org/registry.xsdPlaceholder example WfMS namespace
:http://example.com/Artificial namespace for examples

2. ProvONE Conceptual Model Overview

This section introduces ProvONE informally through an UML class diagram representing its conceptual model and brief descriptions of each of the aspects covered by the model.

The ProvONE conceptual model is illustrated by the UML diagram of Figure 1. All classes have a correspondant PROV type denoted by a UML stereotype (e.g. «entity»), whereas this is the case for only a subset of the associations (e.g. «used»). Each of the aspects covered by ProvONE is briefly described next.

ProvONE Conceptual Model UML Diagram
Figure 1 ◊: ProvONE Conceptual Model UML Diagram

Workflow Representation. The various tasks that form part of a workflow are represented by the Process class. Processes can be either atomic or composite, the later case specified through the hasSubProcess self association. A given process can be distinguished as a Workflow. Each Process has a series of InputPorts and OutputPorts, while the ports from the various Processes are connected through DataLinks. Note that both input and output ports can be associated with multiple DataLinks, thus allowing workflow models in which a single output is copied and sent to multiple destinations as well as in which tasks take inputs from different sources through a single input port.

In order to specify executable instances of a Workflow, default parameters can be defined for some of its constituting Processes. The default parameters are represented by Data elements that will described shortly. A sequential control link, denoted by the SeqCtrlLink class, can be used to specify that the execution of a given Process can only start once the execution of a preceding Process terminates. Finally, a particular Process that specifies a workflow or part of it can be associated with a User that assumes responsibility for its creation.

Trace Representation. The execution traces associated with a given Workflow are represented in ProvONE through the ProcessExec class. Each ProcessExec instance represents the execution of a particular Process, which itself may be a Workflow. For the execution of a Process, a series of input Data items are read from the InputPorts and used to generate a series of output Data items sent through the OutputPorts. Whenever a Data item is sent from an InputPort to an OutputPort, this event is recorded through the dataOnLink association between the Data item and the DataLink between the two ports. In this manner, the graph structure that represents the provenance of the workflow results is generated.

Data Structure Representation. The data elements associated with workflow instances and traces are represented by the Data class. This class is defined to be generic and to represent data items of various types (e.g. XML, JSON, CSV files, etc.). In the ProvONE model, each Data instance is uniquely identifiable regardless of it sharing the same value as another Data instance. Although specific data types are not covered directly in ProvONE, collections of Data items are represented through the Collection class. A Collection may in turn represent a set, bag, list or another variant of a group of items.

Workflow Evolution Representation. The specific changes that are performed during the specification of a Workflow are not modeled directly in ProvONE, since these are expected to vary among different WfMSs. However, the different versions of a Workflow form a derivation tree that can be represented using PROV's wasDerivedFrom association, as is explained in the next section.

The ProvONE constructs are summarized in Table 2. The first column lists the aspects covered by ProvONE, serving to indicate the various constructs associated with each aspect. The second and third columns indicate the type of each construct as presented in the UML class diagram (class or association) and the construct name, respectively. The last column contains a link to each construct specification in Section 3.

Table 2 ◊: ProvONE Constructs
ProvONE AspectConstruct typeNameSpecification
WorkflowClass ProcessSection 3.1.1
InputPortSection 3.1.2
OutputPortSection 3.1.3
DataLinkSection 3.1.4
SeqCtrlLinkSection 3.1.5
WorkflowSection 3.1.6
UserSection 3.1.7
Association hasSubProcessSection 3.1.8
sourcePToCLSection 3.1.9
CLtoDestPSection 3.1.10
hasInPortSection 3.1.11
hasOutPortSection 3.1.12
hasDefaultParamSection 3.1.13
DLToInPortSection 3.1.14
outPortToDLSection 3.1.15
inPortToDLSection 3.1.16
DLToOutPortSection 3.1.17
wasAttributedToSection 3.1.18
wasDerivedFromSection 3.1.19
TraceClass ProcessExecSection 3.2.1
Association dataOnLinkSection 3.2.2
usedSection 3.2.3
wasGeneratedBySection 3.2.4
wasAssociatedWithSection 3.2.5
wasInformedBySection 3.2.6
isPartOfSection 3.2.7
Data StructureClass DataSection 3.3.1
CollectionSection 3.3.2
Association wasDerivedFromSection 3.3.3
hadMemberSection 3.3.4

3. ProvONE Model Specification

This section presents the specification of the various components of the ProvONE model outlined in the previous section, covering them as presented in Figure 1 and Table 2. The specification takes the form of an OWL 2 [OWL2] ontology that extends the W3C PROV-O ontology [PROVO].

The namespace for all ProvONE terms is http://purl.org/provone.

The encoding of the ProvONE ontology can be found under this link: provone.owl

3.1 ProvONE Workflow Specification

3.1.1 Process class

A Process represents a computational task that consumes and produces data through its input and output ports, respectively. It can be atomic or composite, the latter case represented by a possibly nested Process.

IRI:http://purl.org/provone/provone#Process

has super-class

is in domain of

is in range of
Example 1

The following RDF fragment specifies a Process identified within the RDF document by process_1.

1    @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
2    @prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
3    @prefix owl:     <http://www.w3.org/2002/07/owl#> .
4    @prefix dcterms: <http://purl.org/dc/terms/> .
5    @prefix prov:    <http://www.w3.org/ns/prov#> .
6    @prefix provone: <http://purl.org/provone> .
7    @prefix wfms:    <http://www.wfms.org/registry.xsd> .
8    @prefix :        <http://example.com/> .
9    
10   :process_1 
11   
12      a provone:Process;
13      dcterms:identifier "e1"^^xsd:string;
14      dcterms:title "CDMSBoxfill"^^xsd:string;
15      wfms:package "gov.llnl.uvcdat.cdms"^^xsd:string;
16   
17   .
Line 12 specifies class membership. In order to specify additional attributes for the Process, we employ first the Dublin Core Metadata Elements [DC-RDF]. In particular, we assign the string e1 as an identifier, thus stating an identifier explicitly, which is independent of the RDF node identifier. In addition, a descriptive title is given in line 14. Additional attributes associated with the specific WfMS in use, in this case a placeholder example, can also be specified in a similar fashion. In this case the software package responsible for the execution of the Process within the WfMS is specified in line 15.

3.1.2 InputPort class

An InputPort enables a Process to receive input Data items, which in turn may be produced by other Processes and sent through DataLinks or specified as default parameters.

IRI:http://purl.org/provone/provone#InputPort

has super-class

is in domain of

is in range of
Example 2

The following RDF fragment specifies an InputPort identified within the RDF document by p1_ip1.

   
1    :p1_ip1 
2    
3       a provone:InputPort;
4       dcterms:identifier "e1_ip1"^^xsd:string;
5       dcterms:title "input_vars"^^xsd:string;
6       wfms:signature "gov.llnl.uvcdat.cdms:CDMSVariable"^^xsd:string;
7    
8    .
Again we make use of the Dublin Core Metadata Elements as well as of elements specific to an example placeholder WfMS. The InputPort is given an identifier and a descriptive title in lines 4 and 5, respectively. A signature denoting the type of data that the InputPort consumes is defined in line 6.

3.1.3 OutputPort class

An OutputPort enables a Process to produce output Data items, which may be forwarded to other Processes via DataLinks.

IRI:http://purl.org/provone/provone#OutputPort

has super-class

is in domain of

is in range of
Example 3

The following RDF fragment specifies an OutputPort identified within the RDF document by p1_op1.

  
1    :p1_op1 
2      
3       a provone:OutputPort;
4       dcterms:identifier "e1_op1"^^xsd:string;
5       dcterms:title "output_vars"^^xsd:string;
6       wfms:signature "gov.llnl.uvcdat.cdms:CDMSVariable"^^xsd:string;
7   
8   .
The OutputPort is given an identifier and a descriptive title in lines 4 and 5, respectively. A signature denoting the type of data that the OutputPort produces is defined in line 6.

3.1.6 Workflow class

A Workflow is a distinguished Process, which indicates that is meant to represent a computational experiment in its entirety. It is also subject to versioning by prov:wasDerivedFrom through its super-class provone:Process.

IRI:http://purl.org/provone/provone#Workflow

has super-class

is in domain of

is in range of
Example 6

The following RDF fragment specifies a Workflow identified within the RDF document by workflow_1.

1    :workflow_1 
2    
3       a provone:Workflow;
4       dcterms:identifier "wf1"^^xsd:string;
5       dcterms:title "ModelComparison"^^xsd:string;
6    
7    .
The Workflow is given the string wf1 as an identifier in line 4. In addition, it is given the string ModelComparison as a descriptive title in line 5.

3.1.7 User class

A User is the person responsible for the specification of a Process, which in turn can be a Workflow. Its specification serves attribution and accountability purposes.

IRI:http://purl.org/provone/provone#User

has super-class

is in domain of
  • none

is in range of
Example 7

The following RDF fragment specifies a User identified within the RDF document by u1.

 
1    :u1 
2    
3       a provone:User;
4       dcterms:identifier "John"^^xsd:string;
5    
6    .
The User is given the string John as an identifier (line 4), corresponding to his username in the system.

3.1.8 hasSubProcess object property

hasSubProcess specifies the recursive composition of Processes, a parent Process includes a child Process as part of its specification.

IRI:http://purl.org/provone/provone#hasSubProcess

has domain

has range
Example 8

The following RDF fragment illustrates the use of the hasSubProcess object property by extending Example 1 of the Process class.

18    
19   :top_process 
20   
21      a provone:Process;
22      dcterms:identifier "main"^^xsd:string;
23   
24   .
25
26   :top_process provone:hasSubProcess :process_1 .
A Process identified within the document as top_process, is given the identifier value main. Subsequently, in line 26 it is specified that the same Process top_process has as a sub-process the Process process_1, defined in Example 1.

3.1.9 sourcePToCL object property

sourcePToCL relates a SeqCtrlLink to its source Process.

IRI:http://purl.org/provone/provone#sourcePToCL

has domain

has range
Example 9

The following RDF fragment illustrates the use of the sourcePToCL object property by complementing Example 1 of the Process class and Example 5 of the SeqCtrlLink class.

1    
2    :process_1 provone:sourcePToCL :p1_p2CL .
3
Line 2 specifies that the Process process_1, defined in Example 1, is the source Process of the SeqCtrlLink p1_p2CL, defined in Example 5.

3.1.10 CLtoDestP object property

CLtoDestP relates a SeqCtrlLink to its destination Process.

IRI:http://purl.org/provone/provone#CLtoDestP

has domain

has range
Example 10

The following RDF fragment illustrates the use of the CLtoDestP object property by complementing Example 5 of the SeqCtrlLink class.

1
2    :process_2 
3   
4       a provone:Process;
5       dcterms:identifier "e2"^^xsd:string;
6       dcterms:title "TemporalStatistics"^^xsd:string;
7    .    
8
9    :p1_p2CL provone:CLToDestP :process_2 .
10
A Process identified within the document as process_2, is given the identifier value e2 and TemporalStatistics as its title. Line 9 states that the SeqCtrlLink p1_p2CL, defined in Example 5, has as its destination the Process process_2.

3.1.11 hasInPort object property

hasInPort enables to specify the InputPorts of a particular Process.

IRI:http://purl.org/provone/provone#hasInPort

has domain

has range
Example 11

The following RDF fragment illustrates the use of the hasInPort object property by complementing Example 1 of the Process class and Example 2 of the InputPort class.

1    
2    :process_1 provone:hasInPort :p1_ip1 .
3
Line 2 specifies that the Process process_1, defined in Example 1, has an InputPort p1_ip1, defined in Example 2.

3.1.12 hasOutPort object property

hasOutPort enables to specify the OutputPorts of a particular Process.

IRI:http://purl.org/provone/provone#hasOutPort

has domain

has range
Example 12

The following RDF fragment illustrates the use of the hasOutPort object property by complementing Example 1 of the Process class and Example 3 of the OutputPort class.

1    
2    :process_1 provone:hasOutPort :p1_op1 .
3
Line 2 specifies that the Process process_1, defined in Example 1, has an OutputPort p1_op1, defined in Example 3.

3.1.13 hasDefaultParam object property

hasDefaultParam specifies that a given InputPort has a certain Data item as a default parameter. This enables to specify especially configured Workflow instances.

IRI:http://purl.org/provone/provone#hasDefaultParam

has domain

has range
Example 13

The following RDF fragment illustrates the use of the hasDefaultParam object property by complementing Example 2 of the InputPort class and Example 24 of the Data class.

1    
2    :p1_ip1 provone:hasDefaultParam :data1 .
3
Line 2 specifies that the InputPort p1_ip1, defined in Example 2 and associated in Example 11 with Process process_1 of Example 1, has as a default parameter the Data item data1, defined in Example 24.

3.1.14 DLToInPort object property

DLToInPort connects a DataLink to an InputPort of a Process, while the same DataLink can be connected to an OutputPort of another Process.

IRI:http://purl.org/provone/provone#DLToInPort

has domain

has range
Example 14

The following RDF fragment illustrates the use of the DLToInPort object property by complementing Example 4 of the DataLink class.

1    
2    :p2_ip1 
3   
4       a provone:InputPort;
5       dcterms:identifier "e2_ip1"^^xsd:string;
6       dcterms:title "vars"^^xsd:string;
7       wfms:signature "gov.llnl.uvcdat.cdms:CDMSVariable"^^xsd:string;
8   
9    .
10    
11   :p1_p2DL provone:DLToInPort :p2_ip1 .
12
First, an InputPort p2_ip1 is defined beginning at line 2. This InputPort is compatible with the OutputPort defined in Example 3 since it adopts the same signature. Line 11 specifies that the DataLink p1_p2DL of Example 4 is connected to the InputPort p2_ip1 defined previously. Together with Example 15, the previous statement links the OutputPort of Example 3 with the InputPort of another Process.

3.1.15 outPortToDL object property

outPortToDL connects an OutputPort of a Process to a DataLink, which then can be linked to an InputPort of another Process.

IRI:http://purl.org/provone/provone#outPortToDL

has domain

has range
Example 15

The following RDF fragment illustrates the use of the outPortToDL object property by complementing Example 3 of the OutputPort class and Example 4 of the DataLink class.

1     
2    :p1_op1 provone:outPortToDL :p1_p2DL .
3
Line 2 specifies that the OutputPort p1_op1 of Example 3 is connected to the DataLink p1_p2DL of Example 4. Together with Example 14, the previous statement links the OutputPort of Example 3 with the InputPort of another Process.

3.1.16 inPortToDL object property

inPortToDL connects the InputPort of a Process to a DataLink. In turn, the same DataLink can then be connected to the InputPort of a nested Process.

IRI:http://purl.org/provone/provone#inPortToDL

has domain

has range
Example 16

The following RDF fragment illustrates the use of the inPortToDL object property.

1    :pmain_ip1 
2       a provone:InputPort;
3       dcterms:identifier "e1_ip1"^^xsd:string;
4    .
5
6    :pa_ip1 
7       a provone:InputPort;
8       dcterms:identifier "a_ip1"^^xsd:string;
9    .
10
11   :spdl1 
12      a provone:DataLink;
13      dcterms:identifier "pmain_paDL1"^^xsd:string;
14   .
15
16   :pmain_ip1 provone:inPortToDL :spdl1 .
17
18   :spdl1 provone:DLToInPort :pa_ip1 .
First, two InputPorts pmain_ip1 and pa_ip1 are defined in lines 1-9, the first corresponds to the top process and the second to one of its subprocesses. Lines 11-14 specify the DataLink spdl1, which is then is used to connect the two InputPorts by an inPortToDL statement in line 16 and a corresponding DLToInPort statement in line 18.

3.1.17 DLToOutPort object property

DLToOutPort connects a DataLink to the OutputPort of a Process. In turn, the same DataLink can be linked to the OutputPort of a nested Process.

IRI:http://purl.org/provone/provone#DLToOutPort

has domain

has range
Example 17

The following RDF fragment illustrates the use of the DLToOutPort object property.

1    :pmain_op1 
2       a provone:OutputPort;
3       dcterms:identifier "e1_op1"^^xsd:string;
4    .
5
6    :pa_op1 
7       a provone:OutputPort;
8       dcterms:identifier "a_op1"^^xsd:string;
9    .
10
11   :spdl2 
12      a provone:DataLink;
13      dcterms:identifier "pmain_paDL2"^^xsd:string;
14   .
15
16   :pmain_op1 provone:outPortToDL :spdl2 .
17
18   :spdl2 provone:DLToInPort :pa_op1 .
Two OutputPorts pmain_op1 and pa_op1 are defined in lines 1-9, the first corresponds to the top process and the second to one of its subprocesses. Lines 11-14 specify the DataLink spdl2, which is then is used to connect the two OutputPorts by an outPortToDL statement in line 16 and a corresponding DLToOutPort statement in line 18.

3.1.18 wasAttributedTo object property

prov:wasAttributedTo is adopted in ProvONE to relate a Process to the User who is responsible for its creation.

IRI:http://www.w3.org/ns/prov#wasAttributedTo

has domain

has range
Example 18

The following RDF fragment illustrates the use of the wasAttributedTo object property by complementing Example 1 of the Process class and Example 7 of the User class.

1     
2    :process_1 prov:wasAttributedTo :u1 .
3
Line 2 specifies that Process process_1 of Example 1 is attributed to or correspondingly was authored by User u1 of Example 7.

3.1.19 wasDerivedFrom object property

prov:wasDerivedFrom is adopted in ProvONE, in relation to workflow structure, to describe the evolution of processes and workflows.

IRI:http://www.w3.org/ns/prov#wasDerivedFrom

has domain

has range
Example 19

The following RDF fragment illustrates the use of the wasDerivedFrom object property by extending Example 6 of the Workflow class.

1     
2    :workflow_1update1 
3   
4       a provone:Workflow;
5       dcterms:identifier "wf1upd1"^^xsd:string;
6       dcterms:title "ModelComparison"^^xsd:string;
7
8    :workflow_1update1 prov:wasDerivedFrom :workflow_1 .
9
First, a Workflow workflow_1update1 is defined beginning at line 2. Line 8 specifies that Workflow workflow_1update1 was derived from Workflow workflow_1 of Example 6, which implies that it is a new version and the result of workflow evolution. Hence it is given the same title.

3.1 ProvONE Trace Specification

3.2.1 ProcessExec class

A ProcessExec represents the execution of a Process. If the Process in question is a Workflow, then the ProcessExec represents a trace of its execution.

IRI:http://purl.org/provone/provone#ProcessExec

has super-class

is in domain of

is in range of
Example 20

The following RDF fragment specifies a ProcessExec identified within the RDF document by process_1ex1.

1    :process_1ex1
2   
3       a provone:ProcessExec;
4       dcterms:identifier "e1_ex1"^^xsd:string;
5       prov:startTime "2013-08-21 13:37:53"^^xsd:string;
6       prov:endTime "2013-08-21 13:37:53"^^xsd:string;
7       wfms:cached "0"^^xsd:integer;
8       wfms:completed "1"^^xsd:integer; 
9
10   .
A ProcessExec is created with the string e1_ex1 as an identifier. In addition, timestamps denoting the moment in time at which the execution begins, and then is completed, are specified through the prov:startedAtTime and prov:endedAtTime data properties, respectively. In addition, the value 0 in line 7 indicates that the result was not obtained from a cache, while the 1 value in line 8 indicates that the execution was completed successfully.

3.2.3 used object property

prov:used is adopted in ProvONE to state that a ProcessExec made use of a particular Data item as input for its execution.

IRI:http://www.w3.org/ns/prov#used

has domain

has range
Example 22

The following RDF fragment illustrates the use of the used object property by complementing Example 8 of the ProcessExec class and Example 26 of the Data class.

1     
2    :process_1ex1 prov:used :data1 .
3
Line 2 specifies that ProcessExec process_1ex1 of Example 8 used as an input Data item data1 of Example 26.

3.2.4 wasGeneratedBy object property

prov:wasGeneratedBy is adopted in ProvONE to state that a ProcessExec produced a particular Data item as output with its execution.

IRI:http://www.w3.org/ns/prov#wasGeneratedBy

has domain

has range
Example 23

The following RDF fragment illustrates the use of the wasGeneratedBy object property by complementing Example 8 of the ProcessExec class.

1     
2    :data2
3   
4       a provone:Data;
5       dcterms:identifier "cdms1"^^xsd:string;
6       rdfs:label "cdms_data"^^xsd:string;
7       wfms:type "gov.llnl.uvcdat.cdms:CDMSVariable"^^xsd:string;
8    
9    :data2 prov:wasGeneratedBy :process_1_ex1 .
10
First, a Data item data2 is defined beginning at line 2, which also appears in Example 21. Line 9 specifies that the Data item data2 was produced as an output of ProcessExec process_1ex1 of Example 8.

3.2.5 wasAssociatedWith object property

prov:wasAssociatedWith is adopted in ProvONE to state that a ProcessExec represents the execution of a particular Process.

IRI:http://www.w3.org/ns/prov#wasAssociatedWith

has domain

has range
Example 24

The following RDF fragment illustrates the use of the wasAssociatedWith object property by complementing Example 8 of the ProcessExec class and Example 1 of the Process class.

1     
2     :process_1ex1 prov:wasAssociatedWith :process_1 .
3
Line 2 specifies that ProcessExec process_1ex1 of Example 8 is an execution of Process process_1 of Example 1.

3.2.6 wasInformedBy object property

prov:wasInformedBy is adopted in ProvONE to state that a ProcessExec communicates with another ProcessExec through an output-input relation, and thereby triggers its execution.

IRI:http://www.w3.org/ns/prov#wasInformedBy

has domain

has range
Example 25

The following RDF fragment illustrates the use of the wasInformedBy object property by complementing Example 8 of the ProcessExec class.

1     
2    :process_2ex1
3   
4      a provone:ProcessExec;
5      dcterms:identifier "e2_ex1"^^xsd:string;
6      prov:startTime "2013-08-21 13:37:54"^^xsd:string;
7      prov:endTime "2013-08-21 13:37:54"^^xsd:string;
8      wfms:cached "0"^^xsd:integer;
9      wfms:completed "1"^^xsd:integer; 
10
11   .
12
13   :process_2ex1 prov:wasInformedBy :process_1ex1 .
14
First, a ProcessExec process_2ex1 is defined beginning at line 2. Line 13 specifies that ProcessExec process_2ex1 defined previously received data from ProcessExec process_1ex1 of Example 8.

3.2.7 isPartOf object property

Enables to specify the structure of ProcessExecution instances, a parent ProcessExecution (associated with a Workflow) has child ProcessExecutions (associated with Processes and subworkflows).

IRI:http://purl.org/provone/provone#isPartOf

has domain

has range
Example 26

The following RDF fragment illustrates the use of the isPartOf object property by complementing Example 6 of the Workflow class and Example 8 of the ProcessExec class.

1     
2    :workflow_1ex1
3   
4      a provone:ProcessExec;
5      dcterms:identifier "wf1_ex1"^^xsd:string;
6      prov:startTime "2013-08-21 13:37:54"^^xsd:string;
7      prov:endTime "2013-08-21 13:37:59"^^xsd:string;
8      wfms:completed "1"^^xsd:integer; 
9
10   .
11
12   :workflow_1ex1 prov:wasAssociatedWith :workflow_1  .
13
14   :process_1ex1 prov:isPartOf :workflow_1ex1 .
15
First, a ProcessExec workflow_1ex1 is defined beginning at line 2 and associated with Workflow workflow_1 of Example 6 in line 12. Line 14 specifies that ProcessExec process_1ex1 of Example 8 is part of ProcessExec workflow_1ex1.

3.3 ProvONE Data Structure Specification

3.3.1 Data class

A Data item represents the basic unit of information consumed or produced by a Process. Multiple Data items may be grouped into a Collection.

IRI:http://purl.org/provone/provone#Data

has super-class

is in domain of

is in range of
Example 27

The following RDF fragment specifies a Data item identified within the RDF document by data1.

1    :data1
2   
3       a provone:Data;
4       dcterms:identifier "defparam1"^^xsd:string;
5       rdfs:label "filename"^^xsd:string;
6       prov:value "DLEM_NEE_onedeg_v1.0nc"^^xsd:string;
7       wfms:type "edu.sci.wfms.basic:File"^^xsd:string; 
8
9    .
10
A Data item is created with the string defparam1 as an identifier. It is also given the descriptive string filename through the rdfs:label data property. The prov:value data property specifies the actual value of the data item, namely DLEM_NEE_onedeg_v1.0nc. Finally, the type of the data item as defined by the WfMS is specified in line 7 to be edu.sci.wfms.basic:File.

3.3.2 Collection class

Instead of specifying a new class or subclass, we adopt explicitly as part of the ProvONE model PROV's prov:Collection class, whose description we cite below.

A Collection is an entity that provides a structure to some constituents, which are themselves entities. These constituents are said to be member of the collections.

IRI:http://www.w3.org/ns/prov#Collection

Example 28

The following RDF fragment specifies a Collection identified within the RDF document by col1.

 
1    :col1
2   
3       a prov:Collection;
4       dcterms:identifier "inputset1"^^xsd:string;
5
6    .
7
A Collection is created with the string inputset1 as an identifier.

3.3.3 wasDerivedFrom object property

prov:wasDerivedFrom is adopted in ProvONE, in relation to data structure, to describe dependencies between the Data items produced during workflow execution.

IRI:http://www.w3.org/ns/prov#wasDerivedFrom

has domain

has range
Example 29

The following RDF fragment illustrates the use of the wasDerivedFrom object property by extending Example 27 of the Data class.

1     
2    :data2
3   
4       a provone:Data;
5       dcterms:identifier "defparam1"^^xsd:string;
6       rdfs:label "filename"^^xsd:string;
7       prov:value "DLEM_NEE_onedeg_v1.0nc"^^xsd:string;
8       wfms:type "edu.sci.wfms.basic:File"^^xsd:string; 
9
10   .
11
12   :data2 prov:wasDerivedFrom :data1 .
13
First, a Data item data2 is defined beginning at line 2, which also appears in Example 21. Line 12 specifies that the Data item data2 was produced from Data item data1 of Example 27.

3.3.4 hadMember object property

prov:hadMember is adopted in ProvONE, in relation to data structure, to specify the Data items that form part of a Collection.

IRI:http://www.w3.org/ns/prov#hadMember

has domain

has range
Example 30

The following RDF fragment illustrates the use of the hadMember object property by extending Example 28 of the Collection class.

1     
2    :infile1
3   
4       a provone:Data;
5       dcterms:identifier "data_file1"^^xsd:string;
6       rdfs:label "file1"^^xsd:string;
7       prov:value "file1.dat"^^xsd:string;
8       wfms:type "edu.sci.wfms.basic:File"^^xsd:string; 
9
10   .
11
12   :infile2
13   
14       a provone:Data;
15       dcterms:identifier "data_file2"^^xsd:string;
16       rdfs:label "file2"^^xsd:string;
17       prov:value "file2.dat"^^xsd:string;
18       wfms:type "edu.sci.wfms.basic:File"^^xsd:string; 
19
20   .
21
22   :col1 prov:hadMember :infile1 .
23
24   :col1 prov:hadMember :infile2 .
25
Two Data items infile1 and infile2 are defined in lines 2-20. Line 22 specifies that Data item infile1 was a member of Collection item col1 of Example 28. Analogously, line 24 specifies Data item infile2 also as part of Collection item col1.

A. References

[BKG+13]
Khalid Belhajjame, Graham Klyne, Daniel Garijo, Oscar Corcho, Esteban García-Cuesta, and Raul Palma. Wf4ever Research Object Model. 20 August 2013. URL: http://wf4ever.github.io/ro/
[CSdO+13]
Flavio Costa, Vítor Silva, Daniel de Oliveira, Kary Ocaña, Eduardo Ogasawara, Jonas Dias, and Marta Mattoso. Capturing and Querying Workflow Runtime Provenance with PROV: a Practical Approach. In Proceedings of the Joint EDBT/ICDT 2013 Workshops, EDBT'13, pages 282-289, New York, NY, USA, 2013. ACM.
[DC-RDF]
Dublin Core Metadata Initiative. DCMI term declarations represented in RDF schema language. 2012 URL: http://dublincore.org/schemas/rdfs/
[FSC+06]
Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo. Managing Rapidly-Evolving Scientific Workflows. In Proceedings of the 2006 international conference on Provenance and Annotation of Data, IPAW'06, pages 10-18, Berlin, Heidelberg, 2006. Springer-Verlag.
[MCF+11]
Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, Beth Plale, Yogesh Simmhan, Eric Stephan, and Jan Van den Bussche. The Open Provenance Model Core Specification (v1.1). Future Gener. Comput. Syst., 27(6):743-756, June 2011.
[MDB+13]
Paolo Missier, Saumen Dey, Khalid Belhajjame, Víctor Cuevas-Vicenttín, and Bertram Ludäscher. D-PROV: Extending the PROV Provenance Model with Workflow Structure. In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance, TaPP '13, pages 9:1-9:7, Berkeley, CA, USA, 2013. USENIX Association.
[OWL2]
World Wide Web Consortium (W3C). OWL 2 Web Ontology Language Document Overview (Second Edition). W3C Recommendation 11 December 2012, URL: http://www.w3.org/TR/owl2-overview/
[PROV]
World Wide Web Consortium (W3C). PROV-Overview: An Overview of the PROV Family of Documents. W3C Working Group Note 30 April 2013, URL: http://www.w3.org/TR/prov-overview/
[PROVO]
World Wide Web Consortium (W3C). PROV-O: The PROV Ontology. W3C Recommendation 30 April 2013, URL: http://www.w3.org/TR/prov-o/
[RDF-CONCEPTS]
Graham Klyne; Jeremy J. Carroll.Resource Description Framework (RDF): Concepts and Abstract Syntax.10 February 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210
[RDF-SCHEMA]
Dan Brickley; Ramanathan V. Guha. RDF Vocabulary Description Language 1.0: RDF Schema.10 February 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-rdf-schema-20040210
[UML]
Object Management Group. Unified Modeling Language: Superstructure. version 2.0, 2005 URL: http://www.omg.org/spec/UML/2.0/Superstructure/PDF/
[XMLSCHEMA11-2]
Henry S. Thompson et al. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. 5 April 2012. W3C Recommendation. URL: http://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/
[ZWF06]
Yong Zhao, Michael Wilde, and Ian Foster. Applying the Virtual Data Provenance Model. In Proceedings of the 2006 international conference on Provenance and Annotation of Data, IPAW'06, pages 148-161, Berlin, Heidelberg, 2006. Springer-Verlag.