trustworthy whole-system provenance for the linux kernel · 2020. 1. 25. · usenix association...

Open access to the Proceedings of the 24th USENIX Security Symposium

is sponsored by USENIX

Trustworthy Whole-System Provenance for the Linux Kernel

Adam Bates, Dave (Jing) Tian, and Kevin R.B. Butler, University of Florida; Thomas Moyer, MIT Lincoln Laboratory

https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/bates

This paper is included in the Proceedings of the 24th USENIX Security Symposium

August 12–14, 2015 • Washington, D.C.

ISBN 978-1-939133-11-3

USENIX Association 24th USENIX Security Symposium 319

Trustworthy Whole-System Provenance for the Linux Kernel

Adam Bates, Dave (Jing) Tian,Kevin R.B. ButlerUniversity of Florida

{adammbates,daveti,butler}@ufl.edu

Thomas Moyer

MIT Lincoln [email protected]

AbstractIn a provenance-aware system, mechanisms gather

and report metadata that describes the history of each ob-ject being processed on the system, allowing users to un-derstand how data objects came to exist in their presentstate. However, while past work has demonstrated theusefulness of provenance, less attention has been givento securing provenance-aware systems. Provenance it-self is a ripe attack vector, and its authenticity and in-tegrity must be guaranteed before it can be put to use.

We present Linux Provenance Modules (LPM),the first general framework for the development ofprovenance-aware systems. We demonstrate that LPMcreates a trusted provenance-aware execution environ-ment, collecting complete whole-system provenancewhile imposing as little as 2.7% performance overheadon normal system operation. LPM introduces new mech-anisms for secure provenance layering and authenticatedcommunication between provenance-aware hosts, andalso interoperates with existing mechanisms to providestrong security assurances. To demonstrate the poten-tial uses of LPM, we design a Provenance-Based DataLoss Prevention (PB-DLP) system. We implement PB-DLP as a file transfer application that blocks the trans-mission of files derived from sensitive ancestors whileimposing just tens of milliseconds overhead. LPM is thefirst step towards widespread deployment of trustworthyprovenance-aware applications.

1 Introduction

A provenance-aware system automatically gathers andreports metadata that describes the history of each ob-ject being processed on the system. This allows users totrack, and understand, how a piece of data came to ex-ist in its current state. The application of provenance

The Lincoln Laboratory portion of this work was sponsored by theAssistant Secretary of Defense for Research & Engineering under AirForce Contract #FA8721-05-C-0002. Opinions, interpretations, con-clusions and recommendations are those of the author and are not nec-essarily endorsed by the United States Government.

is presently of enormous interest in a variety of dis-parate communities including scientific data processing,databases, software development, and storage [43, 53].Provenance has also been demonstrated to be of greatvalue to security by identifying malicious activity in datacenters [5, 27, 56, 65, 66], improving Mandatory AccessControl (MAC) labels [45, 46, 47], and assuring regula-tory compliance [3].

Unfortunately, most provenance collection mecha-nisms in the literature exist as fully-trusted user spaceapplications [28, 27, 41, 56]. Even kernel-based prove-nance mechanisms [43, 48] and sketches for trustedprovenance architectures [40, 42] fall short of providinga provenance-aware system for malicious environments.The problem of whether or not to trust provenance is fur-ther exacerbated in distributed environments, or in lay-ered provenance systems, due to the lack of a mechanismto verify the authenticity and integrity of provenance col-lected from different sources.

In this work, we present Linux Provenance Modules(LPM), the first generalized framework for secure prove-nance collection on the Linux operating system. Mod-ules capture whole-system provenance, a detailed recordof processes, IPC mechanisms, network activity, andeven the kernel itself; this capture is invisible to the ap-plications for which provenance is being collected. LPMintroduces a gateway that permits the upgrading of lowintegrity workflow provenance from user space. LPMalso facilitates secure distributed provenance through anauthenticated, tamper-evident channel for the transmis-sion of provenance metadata between hosts. LPM inter-operates with existing security mechanisms to establish ahardware-based root of trust to protect system integrity.

Achieving the goal of trustworthy whole-systemprovenance, we demonstrate the power of our approachby presenting a scheme for Provenance-Based Data LossPrevention (PB-DLP). PB-DLP allows administrators toreason about the propagation of sensitive data and controlits further dissemination through an expressive policysystem, offering dramatically stronger assurances thanexisting enterprise solutions, while imposing just mil-

320 24th USENIX Security Symposium USENIX Association

Used Used Used Used Used WasControlledBy

WasGeneratedBy WasGeneratedByWasGeneratedBy WasGeneratedBy WasGeneratedBy

/etc/rc.local:0 /bin/ps:0 /var/spool/cron/root:0 /etc/passwd:0 /etc/shadow:0

Malicious Binary

/etc/rc.local:1 /bin/ps:1 /var/spool/cron/root:1 /etc/passwd:1 /etc/shadow:1

root

Figure 1: A provenance graph showing the attack footprint of a malicious binary. Edges encode relationships that flowbackwards into the history of system execution, and writing to an object creates a second node with an incrementedversion number. Here, we see that the binary has rewritten /etc/rc.local, likely in an attempt to gain persistenceafter a system reboot.

liseconds of overhead on file transmission. To our knowl-edge, this work is the first to apply provenance to DLP.

Our contributions can thus be summarized as follows:

• Introduce Linux Provenance Modules (LPM).LPM facilitates secure provenance collection at thekernel layer, supports attested disclosure at the ap-plication layer, provides an authenticated channelfor network transmission, and is compatible withthe W3C Provenance (PROV) Model [59]. In eval-uation, we demonstrate that provenance collectionimposes as little as 2.7% performance overhead.

• Demonstrate secure deployment. LeveragingLPM and existing security mechanisms, we createa trusted provenance-aware execution environmentfor Linux. Through porting Hi-Fi [48] and provid-ing support for SPADE [29], we demonstrate therelative ease with which LPM can be used to secureexisting provenance collection mechanisms. Weshow that, in realistic malicious environments, oursis the first proposed system to offer secure prove-nance collection.

• Introduce Provenance-Based Data Loss Preven-tion (PB-DLP). We present a new paradigm forthe prevention of data leakage that searches objectprovenance to identify and prevent the spread ofsensitive data. PB-DLP is impervious to attemptsto launder data through intermediary files and IPC.We implement PB-DLP as a file transfer applica-tion, and demonstrate its ability to query object an-cestries in just tens of milliseconds.

2 Background

Data provenance, sometimes called lineage, describesthe actions taken on a data object from its creation upto the present. Provenance can be used to answer a va-riety of historical questions about the data it describes.Such questions include, but are not limited to, “Whatprocesses and datasets were used to generate this data?"

and “In what environment was the data produced?" Con-versely, provenance can also answer questions about thesuccessors of a piece of data, such as “What objects onthe system were derived from this object?" Although po-tential applications for such information are nearly lim-itless, past proposals have conceptualized provenance indifferent ways, indicating that a one-size-fits-all solutionto provenance collection is unlikely to meet the needs ofall of these audiences. We review these past proposalsfor provenance-aware systems in Section 8.

The commonly accepted representation for data prove-nance is a directed acyclic graph (DAG). In this work, weuse the W3C PROV-DM specification [59] because it ispervasive and facilitates the exchange of provenance be-tween deployments. An example PROV-DM graph of amalicious binary is shown in Figure 1. This graph de-scribes an attack in which a binary running with rootprivilege reads several sensitive system files, then ed-its those files in an attempt to gain persistent access tothe host. Edges encode relationships between nodes,pointing backwards into the history of system execution.Writing to an object triggers the creation of a second ob-ject node with an incremented version number. This par-ticular provenance graph could serve as a valuable foren-sics tool, allowing system administrators to better under-stand the nature of a network intrusion.

2.1 Data Loss Prevention

Data Loss Prevention (DLP) is enterprise software thatseeks to minimize the leakage of sensitive data by moni-toring and controlling information flow in large, complexorganizations [1].1 In addition to the desire to control in-tellectual property, another motivator for DLP systemsis demonstrating regulatory compliance for personally-identifiable information (PII),2 as well as directives such

1 Our overview of data loss prevention is based on review of pub-licly available product descriptions for software developed by Bit9,CDW, Cisco, McAfee, Symantec, and Titus.

2 See NIST SP 800-122

2


as PCI,3 HIPAA,4 SOX.5 or E.U. Data Protection.6 Asencryption can be used to protect data at rest from unau-thorized access, the true DLP challenge involves prevent-ing leakage at the hands of authorized users, both mali-cious and well-meaning agents. This latter group is asurprisingly big problem in the fight to control an organi-zation’s intellectual property; a 2013 study conducted bythe Ponemon Institute found that over half of companies’employees admitted to emailing intellectual property totheir personal email accounts, with 41 percent admittingto doing so on a weekly basis [2]. It is therefore im-portant for a DLP system to be able to exhaustively ex-plain which pieces of data are sensitive, where that datahas propagated to within the organization, and where itis (and is not) permitted to flow.

As DLP systems are proprietary and are marketed soas to abstract away the complex details of their inter-nal workings, we cannot offer a complete explanation oftheir core features. However, some of the mechanismsin such systems are known. Many DLP products use aregular expression-based approach to identify sensitivedata, operating similarly to a general-purpose version ofCornell’s Spider7. For example, in PCI compliance,3

DLP might attempt to identify credit card numbers inoutbound emails by searching for 16 digit numbers thatpass a Mod-10 validation check [39]. Other DLP systemsuse a label-based approach to identify sensitive data, tag-ging document metadata with security labels. The Titussystem accomplishes this by having company employeesmanually annotate the documents that they create;8 plug-ins for applications (e.g., Microsoft Office) then preventthe document from being transmitted to or opened byother employees that lack the necessary clearance. In ei-ther approach, DLP software is difficult to configure andprone to failure, offering marginal utility at great price.

3 Linux Provenance Modules

To serve as the foundation for secure provenance-awaresystems, we present Linux Provenance Modules (LPM).We provide a working definition for the provenance oursystem will collect in §3.1. In §3.2 we consider the ca-pabilities and aims of a provenance-aware adversary, andidentify security and design goals in §3.3. The LPM de-sign is presented in §3.4, and in §3.5 we demonstrate itssecure deployment. An expanded description of our sys-tem is available in our technical report [8].

3 See https://www.pcisecuritystandards.org4 See http://www.hhs.gov/ocr/privacy5 Short for the Sarbanes-Oxley Act, U.S. Public Law No. 107-206 See EU Directive 95/46/EC7 See http://www2.cit.cornell.edu/security/tools8 See http://www.titus.com

3.1 Defining Whole-System Provenance

In the design of LPM, we adopt a model for whole-system provenance9 that is broad enough to accom-modate the needs of a variety of existing provenanceprojects. To arrive at a definition, we inspect fourpast proposals that collect broadly scoped provenance:SPADE [29], LineageFS [53], PASS [43], and Hi-Fi [48].SPADE provenance is structured around primitive oper-ations of system activities with data inputs and outputs.It instruments file and process system calls, and asso-ciates each call to a process ID (PID), user identifier, andnetwork address. LineageFS uses a similar definition,associating process IDs with the file descriptors that theprocess reads and writes. PASS associates a process’soutput with references to all input files and the commandline and process environment of the process; it also ap-pends out-of-band knowledge such as OS and hardwaredescriptions, and random number generator seeds, if pro-vided. In each of these systems, networking and IPCactivity is primarily reflected in the provenance recordthrough manipulation of the underlying file descriptors.Hi-Fi takes an even broader approach to provenance,treating non-persistent objects such as memory, IPC, andnetwork packets as principal objects.

We observe that, in all instances, provenance-awaresystems are exclusively concerned with operations oncontrolled data types, which are identified by Zhang etal. as files, inodes, superblocks, socket buffers, IPCmessages, IPC message queue, semaphores, and sharedmemory [64]. Because controlled data types represent asuper set of the objects tracked by system layer prove-nance mechanisms, we define whole-system provenanceas a complete description of agents (users, groups) con-trolling activities (processes) interacting with controlleddata types during system execution.

3.2 Threat Model & Assumptions

We consider an adversary that has gained remote accessto a provenance-aware host or network. Once inside thesystem, the attacker may attempt to remove provenancerecords, insert spurious information into those records,or find gaps in the provenance monitor’s ability to recordinformation flows. A network attacker may also attemptto forge or strip provenance from data in transit. Be-cause captured provenance can be put to use in other ap-plications, the adversary’s goal may even be to target theprovenance monitor itself. The implications and meth-ods of such an attack are domain-specific. For example:

9This term is coined in [48], but not explicitly defined. We demon-strate the concrete requirements of a collection mechanism for whole-system provenance in this work.

3


• Scientific Computing: An adversary may wish to ma-nipulate provenance in order to commit fraud, or to in-ject uncertainty into records to trigger a “Climategate”-like controversy [50].

• Access Control: When used to mediate access deci-sions [7, 45, 46, 47], an attacker could tamper withprovenance in order to gain unauthorized access, or toperform a denial-of-service attack on other users by ar-tificially escalating the security level of data objects.

• Networks: Provenance metadata can also be associ-ated with packets in order to better understand networkevents in distributed systems [5, 65, 66]. Coordinatingmultiple compromised hosts, an attacker may attemptto send unauthenticated messages to avoid provenancegeneration and to perform data exfiltration.

We define a provenance trusted computing base (TCB)to be the kernel mechanisms, provenance recorder, andstorage back-ends responsible for the collection andmanagement of provenance. Provenance-aware appli-cations are not considered part of the TCB.

We make the following assumption with regards to theTCB. In Linux, kernel modules have unrestricted accessto kernel memory, meaning that there is no mechanismfor protecting LPM from the rest of the kernel. The ker-nel code is therefore trusted; we assume that the stockkernel will not seek to tamper with the TCB. However,we do consider the possibility that the kernel could becompromised after installation through its interactionswith user space applications. To facilitate host attestationin distributed environments, we also assume access toa Public Key Infrastructure (PKI) for provenance-awarehosts to publish their public signing keys.

3.3 System GoalsWe set out to provide the following security assurancesin the design of of our system-layer provenance collec-tion mechanism. McDaniel et al. liken the needs of asecure provenance monitor [42] to the reference monitorguarantees laid out by Anderson [4]: complete media-tion, tamperproofness, and verifiability. We define theseguarantees as follows:

G1 Complete. Complete mediation for provenance hasbeen discussed elsewhere in the literature in termsof assuring completeness [32]: that the provenancerecord be gapless in its description of system activ-ity. To facilitate this, LPM must be able to observeall information flows that pass through controlleddata types.

G2 Tamperproof. As many provenance use cases in-volve enhancing system security, LPM will be an

user space

kernel space

Neo4j

NF Hooks

Prov. Module

RelayBuffer

SNAP

GZip

SQL

Prov. Hooks

IMA

TPM

Prov. Aware Applications

System Provenance Workflow Provenance

Integrity Measurements

ProvenanceRecorder

Figure 2: Diagram of the LPM Framework. Kernelhooks report provenance to a recorder in userspace,which uses one of several storage back-ends. Therecorder is also responsible for evaluating the integrityof workflow provenance prior to storing it.

adversarial target. The TCB must therefore be im-pervious to disabling or manipulation by processesin user space.

G3 Verifiable. The functionality of LPM must beverifiably correct. Additionally, local and remoteusers should be able to attest whether the host withwhich they are communicating is running the se-cured provenance-aware kernel.

Through surveying past work in provenance-awaresystems, we identify the following additional goals tosupport whole-system provenance:

G4 Authenticated Channel. In distributed environ-ments, provenance-aware systems must provide ameans of assuring authenticity and integrity ofprovenance as it is communicated over open net-works [7, 42, 48, 65]. While we do not seek toprovide a complete distributed provenance solutionin LPM, we do wish to provide the required build-ing blocks within the host for such a system to ex-ist. LPM must therefore be able to monitor ev-ery network message that is sent or received by thehost, and reliably explain these messages to otherprovenance-aware hosts in the network.

G5 Attested Disclosure. Layered provenance, whereadditional metadata is disclosed from higher opera-tional layers, is a desirable feature in provenance-aware systems, as applications are able to reportworkflow semantics that are invisible to the oper-ating system [44]. LPM must provide a gateway for

4


upgrading low integrity user space disclosures be-fore logging them in the high integrity provenancerecord. This is consistent with the Clark-Wilson In-tegrity model for upgrading or discarding low in-tegrity inputs [17].

In order to bootstrap trust in our system, we have im-plemented LPM as a parallel framework to Linux Secu-rity Modules (LSM) [60, 61]. Building on these results,we show in Section 4 that this approach allows LPM toinherit the formal assurances that have been verified forthe LSM architecture.

3.4 Design & ImplementationAn overview of the LPM architecture is shown in Fig-ure 2. The LPM patch places a set of provenance hooksaround the kernel; a provenance module then registersto control these hooks, and also registers several Netfil-ter hooks; the module then observes system events andtransmits information via a relay buffer to a provenancerecorder in user space that interfaces with a datastore.The recorder also accepts disclosed provenance from ap-plications after verifying their correctness using the In-tegrity Measurements Architecture (IMA) [52].

In designing LPM, we first considered using an exper-imental patch to the LSM framework that allows “stack-ing” of LSM modules 10. However, at this time, no stan-dard exists for handling when modules make conflict-ing decisions, creating the potential unpredicted behav-ior. We also felt that dedicated provenance hooks werenecessary; by collecting provenance after LSM autho-rization routines, we ensure that the provenance historyis an accurate description of authorized system events. Ifprovenance collection occurred during authorization, aswould be the case with stacked LSMs, it would not bepossible to provide this property.

3.4.1 Provenance Hooks

The LPM patch introduces a set of hook functions in theLinux kernel. These hooks behave similarly to the LSMframework’s security hooks in that they facilitate mod-ularity, and default to taking no action unless a moduleis enabled. Each provenance hook is placed directly be-neath a corresponding security hook. The return value ofthe security hook is checked prior to calling the prove-nance hook, thus assuring that the requested activity hasbeen authorized prior to provenance capture; we considerthe implications of this design in Section 4. A workflowfor the hook architecture is depicted in Figure 3. TheLPM patch places over 170 provenance hooks, one foreach of the LSM authorization hooks. In addition to the

10See https://lwn.net/Articles/518345/

user spaceText Editor

kernel spaceopen System Call

Look Up Inode

Error Checks

DAC Checks

LSM Hook

LPM Hook

Access Inode

Examine context.Does request pass policy?Grant or deny.

Examine context.Collect provenance.If successful, grant.

LSM Module

LPM Module

"Authorized?"Yes or No

"Prov collected?"Yes or No

Figure 3: Hook Architecture for the open system call.Provenance is collected after DAC and LSM checks, en-suring that it accurately reflects system activity. LPMwill only deny the operation if it fails to generate prove-nance for the event.

hooks that correspond to existing security hooks, we alsosupport Pohly et al.’s Hi-Fi [48] hook that is necessary topreserve Lamport timestamps on network messages [38].

3.4.2 Netfilter Hooks

LPM uses Netfilter hooks to implement a cryptographicmessage commitment protocol. In Hi-Fi, provenance-aware hosts communicated by embedding a provenancesequence number in the IP options field [49] of each out-bound packet [48]. This approach allowed Hi-Fi to com-municate as normal with hosts that were not provenance-aware, but unfortunately was not secure against a net-work adversary. In LPM, provenance sequence numbersare replaced with Digital Signature Algorithm (DSA)signatures, which are space-efficient enough to embed inthe IP Options field. We have implemented full DSAsupport in the Linux kernel by creating signing rou-tines to use with the existing DSA verification func-tion. DSA signing and verification occurs in the NetFil-ter inet_local_out and inet_local_in hooks.In inet_local_out, LPM signs over the immutablefields of the IP header, as well as the IP payload. Ininet_local_in, LPM checks for the presence of asignature, then verifies the signature against a config-urable list of public keys. If the signature fails, the packetis dropped before it reaches the recipient application,thus ensuring that there are no breaks in the continuity ofthe provenance log. The key store for provenance-awarehosts is obtained by a PKI and transmitted to the ker-nel during the boot process by writing to securityfs.LPM registers the Netfilter hooks with the highest prior-

5


ity levels, such that signing occurs just before transmis-sion (i.e., after all other IPTables operations), and sig-nature verification occurs just after the packet enters theinterface (i.e., before all other IPTables operations).

3.4.3 Provenance Modules

Here, we introduce two of our own provenance modules(Provmon, SPADE), as well as briefly mention the workof our peers (UPTEMPO):

• Provmon. Provmon is an extended port of the Hi-Fisecurity module [48]. The original Hi-Fi code basewas 1,566 lines of code, requiring 723 lines to be mod-ified in the transition. Our extensions introduced 728additional lines of code. The process of porting didnot affect the module’s functionality, although we havesubsequently extended the Hi-Fi protocol to captureadditional lineage information:

File Versioning. The original Hi-Fi protocol did nottrack version information for files, leading to uncer-tainty as to the exact contents of a file at the time itwas read. Accurately recovering this information inuser space was not possible due to race conditions be-tween kernel events. Because versioning is necessaryto break cycles in provenance graphs [43], we haveadded a version field to the provenance context for in-odes, which is incremented on each write.

Network Context. Hi-Fi omitted remote host addressinformation for network events, reasoning that sourceinformation could be forged by a dishonest agent in thenetwork. These human-interpretable data points werereplaced with an assigned random identifier for eachpacket. We found, however, that these identifiers couldnot be interpreted without remote address information,and incorporated the recording of remote IP addressesand ports into Provmon.

• SPADE. The SPADE system is an increasingly pop-ular option for provenance auditing, but collectingprovenance in user space limits SPADE’s expressive-ness and creates the potential for incomplete prove-nance. To address this limitation, we have cre-ated a mechanism that reports LPM provenance intoSPADE’s Domain-Specific Language pipe [29]. Thispermits the collection of whole-system provenancewhile simultaneously leveraging SPADE’s existingstorage, remote query, and visualization utilities.

• Using Provenance to Expedite MAC Policies (UP-TEMPO). Using LPM as a collection mechanism,Moyer et al. investigate provenance analysis as ameans of administrating Mandatory Access Control(MAC) policies [54]. UPTEMPO first observes systemexecution in a sterile environment, aggregating LPM

provenance in a centralized data store. It then recov-ers the implicit information flow policy through min-ing the provenance store to generate a MAC policy forthe distributed system, decreasing both administratoreffort and the potential for misconfiguration.

3.4.4 Provenance Recorders

LPM provides modular support for different storagethrough provenance recorders. To prevent an infiniteprovenance loop, recorders are flagged as provenance-opaque [48] using the security.provenance ex-tended attribute, which is checked by LPM before creat-ing a new event. Each recorder was designed to be as ag-nostic to the active LPM as possible, making them easyto adapt to new modules.

We currently provide provenance recorders that of-fer backend storage for Gzip, PostGreSQL, Neo4j, andSNAP. Commentary on our PostGreSQL and Neo4j re-porters can be found in our technical report [8]. We makeuse of the Gzip and SNAP recorders during our evalua-tion in Section 6.

The Gzip recorder incurs low storage overheads andfast insertion speeds. On our test bed, we observedthis recorder processing up to 400,000 events per sec-ond from the Provmon provenance stream. However, be-cause the provenance is not stored in an easily queriedform, this back-end is best suited for environments wherequeries are an offline process.

To create graph storage that was efficient enough forLPM, we used the SNAP graphing library11 to designa recorder that maintains an in-memory graph databasethat is fully compliant with the W3C PROV-DM Model[59]. We have observed insertion speeds of over 150,000events per second using the SNAP recorder, and highlyefficient querying as well. This recorder is further evalu-ated in Section 6.

3.4.5 Workflow Provenance

To support layered provenance while preserving our se-curity goals, we require a means of evaluating the in-tegrity of user space provenance disclosures. To ac-complish this, we extend the LPM Provenance Recorderto use the Linux Integrity Measurement Architecture(IMA) [35, 52]. IMA computes a cryptographic hash ofeach binary before execution, extends the measurementinto a TPM Platform Control Register (PCR), and storesthe measurement in kernel memory. This set of measure-ments can be used by the Recorder to make a decisionabout the integrity of the a Provenance-Aware Applica-tion (PAA) prior to accepting the disclosed provenance.When a PAA wishes to disclose provenance, it opens a

11See http://snap.stanford.edu

6


UsedUsedWasDerivedFrom

WasGeneratedByWasDerivedFrom

WasGeneratedBy

a.pngb.png

mogrify -format jpg *.png

a.jpgb.jpg

Figure 4: A provenance graph of image conversion.Here, workflow provenance (WasDerivedFrom) encodesa relationship that more accurately identifies the outputfiles’ dependencies compared to only using kernel layerobservations (Used, WasGeneratedBy).

new UNIX domain socket to send the provenance datato the Provenance Recorder. The Recorder uses its ownUNIX domain socket to recover the process’s pid, thenuses the /proc filesystem to find the full path of the bi-nary, then uses this information to look up the PAA inthe IMA measurement list. The disclosed provenance isrecorded only if the signature of PAA matches a known-good cryptographic hash.

As a demonstration of this functionality, we createda provenance-aware version of the popular ImageMag-ick utility 12. ImageMagick contains a batch conversiontool for image reformatting, mogrify. Shown in Fig-ure 4, mogrify reads and writes multiple files duringexecution, leading to an overtainting problem – at thekernel layer, LPM is forced to conservatively assumethat all outputs were derived from all inputs, creatingfalse dependencies in the provenance record. To addressthis, we extended the Provmon protocol to support anew message, provmsg_imagemagick_convert,which links an input file directly to its output file. Whenthe recorder receives this message, it first checks the listof IMA measurements to confirm that ImageMagick isin a good state. If successful, it then annotates the exist-ing provenance graph, connecting the appropriate inputand output objects with WasDerivedFrom relationships.Our instrumentation of ImageMagick demonstrates thatLPM supports layered provenance at no additional costover other provenance-aware systems [29, 43], and doesso in a manner that provides assurance of the integrity ofthe provenance log.

3.5 Deployment

We now demonstrate how we used LPM in the deploy-ment of a secure provenance-aware system. Additionalbackground on the security technologies use in our de-ployment can be found in our technical report [8].

12See http://www.imagemagick.org

3.5.1 Platform Integrity

We configured LPM to run on a physical machine witha Trusted Platform Module (TPM). The TPM provides aroot of trust that allows for a measured boot of the sys-tem. The TPM also provides the basis for remote attes-tations to prove that LPM was in a known hardware andsoftware configuration. The BIOS’s core root of trust formeasurement (CRTM) bootstraps a series of code mea-surements prior to the execution of each platform com-ponent. Once booted, the kernel then measures the codefor user space components (e.g., provenance recorder)before launching them, through the use of the Linux In-tegrity Measurement Architecture (IMA)[52]. The resultis then extended into TPM PCRs, which forms a verifi-able chain of trust that shows the integrity of the systemvia a digital signature over the measurements. A remoteverifier can use this chain to determine the current stateof the system using TPM attestation.

We configured the system with Intel’s Trusted Boot(tboot),13 which provides a secure boot mechanism, pre-venting system from booting into the environment wherecritical components (e.g., the BIOS, boot loader andthe kernel) are modified. Intel tboot relies on the In-tel TXT14 to provide a secure execution environment.15 Additionally, we compiled support for IMA into theprovenance-aware kernel, which is necessary in order forthe LPM Recorder to be able to measure the integrity ofprovenance-aware applications.

3.5.2 Runtime Integrity

After booting into the provenance-aware kernel, the run-time integrity of the TCB (defined in §3.2) must also beassured. To protect the runtime integrity of the kernel,we deploy a Mandatory Access Control (MAC) policy,as implemented by Linux Security Modules. On our pro-totype deployments, we enabled SELinux’s MLS policy,the security of which was formally modeled by Hicks etal. [33]. Refining the SELinux policy to prevent Ac-cess Vector Cache (AVC) denials on LPM componentsrequired minimal effort; the only denial we encounteredwas when using the PostgreSQL recorder, which wasquickly remedied with the audit2allow tool. Pre-serving the integrity of LPM’s user space components,such as the provenance recorder, was as simple as creat-ing a new policy module. We created a policy module toprotect the LPM recorder and storage back-end using thesepolicy utility. Uncompiled, the policy module wasonly 135 lines.

13 See http://sf.net/projects/tboot14 See https: //www.kernel.org/doc/Documentation/intel_txt.txt15For virtual environments, similar functionality can be provided on

Xen via TPM sealing and the virtual TPM (vTPM), which is bound tothe physical TPM of the host system.

7


4 Security

In this section, we demonstrate that our system meetsall of the required security goals for trustworthy whole-system provenance. In this analysis, we consider an LPMdeployment on a physical machine that was enabled withthe Provmon module and has been configured to the con-ditions described in Section 3.5.Complete (G1). We defined whole-system provenanceas a complete description of agents (users, groups) con-trolling activities (processes) interacting with controlleddata types during system execution (§ 3.1). LPM at-tempts to track these system objects through the place-ment of provenance hooks (§3.4.1), which directly fol-low each LSM authorization hook. The LSM’s completemediation property has been formally verified [20, 64];in other words, there is an authorization hook prior toevery security-sensitive operation. Because every inter-action with a controlled data type is considered security-sensitive, we know that a provenance hook resides onall control paths to the provenance-sensitive operations.LPM is therefore capable of collecting complete prove-nance on the host.

It is important to note that, as a consequence of plac-ing provenance hooks beneath authorization hooks, LPMis unable to record failed access attempts. However, in-serting the provenance layer beneath the security layerensures accuracy of the provenance record. Moreover,failed authorizations are a different kind of metadata thanprovenance because they do not describe processed data;this information is better handled at the security layer,e.g., by the SELinux Access Vector Cache (AVC) Log.Tamperproof (G2). The runtime integrity of the LPMtrusted computing base is assured via the SELinux MLSpolicy, and we have written a policy module that protectsthe LPM user space components (§3.5.2). Therefore, theonly way to disable LPM would be to reboot the sys-tem into a different kernel; this action can be disallowedthrough secure boot techniques,13 and is detectable byremote hosts via TPM attestation (§3.5.1).Verifiable (G3). While we have not conducted an in-dependent formal verification of LPM, our argument forits correctness is as follows. A provenance hook followseach LSM authorization hook in the kernel. The correct-ness of LSM hook placement has been verified throughboth static and dynamic analysis techniques [20, 25, 34].Because an authorization hook exists on the path of ev-ery sensitive operation to controlled data types, and LPMintroduces a provenance hook behind each authorizationhook, LPM inherits LSM’s formal assurance of completemediation over controlled data types. This is sufficientto ensure that LPM can collect the provenance of everysensitive operation on controlled data types in the kernel(i.e., whole-system provenance).

Authenticated Channel (G4). Through use of Net-filter hooks [57], LPM embeds a DSA signature in ev-ery outbound network packet. Signing occurs immedi-ately prior to transmission, and verification occurs im-mediately after reception, making it impossible for anadversary-controlled application running in user spaceto interfere. For both transmission and reception, thesignature is invisible to user space. Signatures areremoved from the packets before delivery, and LPMfeigns ignorance that the options field has been set ifget_options is called. Hence, LPM can enforce thatall applications participate in the commitment protocol.

Prior to implementing our own message commitmentprotocol in the kernel, we investigated a variety of ex-isting secure protocols. The integrity and authenticity ofprovenance identifiers could also be protected via IPsec[36], SSL tunneling,16 or other forms of encapsulation[5, 65]. We elected to move forward with our approachbecause 1) it ensures the monitoring of all all processesand network events, including non-IP packets, 2) it doesnot change the number of packets sent or received, en-suring that our provenance mechanism is minimally in-vasive to the rest of the Linux network stack, and 3)it preserves compatibility with non-LPM hosts. An al-ternative to DSA signing would be HMAC [9], whichoffers better performance but requires pairwise keyingand sacrifices the non-repudiation policy; BLS, whichapproaches the theoretical maximum security parame-ter per byte of signature [12]; or online/offline signatureschemes [15, 23, 26, 55].

Authenticated Disclosures (G5). We make useof IMA to protect the channel between LPM andprovenance-aware applications wishing to discloseprovenance. IMA is able to prove to the provenancerecorder that the application was unmodified at the timeit was loaded into memory, at which point the recordercan accept the provenance disclosure into the officialrecord. If the application is known to be correct (e.g.,through formal verification), this is sufficient to estab-lish the runtime integrity of the application. However, ifthe application is compromised after execution, this ap-proach is unable to protect against provenance forgery.

A separate consideration for all of the above securityproperties are Denial of Service (DoS) attacks. DoS at-tacks on LPM do not break its security properties. If anattacker launches a resource exhaustion attack in orderto prevent provenance from being collected, all kerneloperations will be disallowed and the host will cease tofunction. If a network attacker tampers with a packet’sprovenance identifier, the packet will not be delivered tothe recipient application. In all cases, the provenancerecord remains an accurate reflection of system events.

16See http://docs.oracle.com/cd/E23823_01/html/816-5175/kssl-5.html

8


Algorithm 1 Summarizes a’s propagation through the system.

Require: a is an entity1: procedure REPORT(a)2: Locations = [ ] � Assigns an empty list.3: for each s in a,FindSuccessors(a) do4: if s.type is File then5: Locations.Add(< s.disk,s.directory >)6: else if s.type is Network Packet then7: Locations.Add(< s.remote_ip,s.port >)8: end if9: end for

10: return Locations11: end procedure

5 LPM Application: Provenance-BasedData Loss Prevention

To further demonstrate the power of LPM, we now in-troduce a set of mechanisms for Provenance-Based DataLoss Prevention (PB-DLP) that offer dramatically sim-plified administration and improved enforcement overexisting DLP systems. A provenance-based approach isa novel and effective means of handling data loss preven-tion; to our knowledge, we are the first in the literatureto do so. The advantage of our approach when comparedto existing systems is that LPM-based provenance-awaresystems already perform system-wide capture of infor-mation flows between kernel objects. Data loss preven-tion in such a system therefore becomes a matter of pre-venting all derivations of a sensitive source entity, e.g., aPayment Card Industry (PCI) database, from being writ-ten to a monitored destination entity (e.g., a network in-terface).

We begin by defining a policy format for PB-DLP. In-dividual rules take the form

< Srcs = [src1,src2, . . . ,srcn],dst >

where Srcs is a list of entities representing persistentdata objects, and dst is a single entity representing eithera persistent data object such as a file or interface or an ab-stract entity such as a remote host. The goal for PB-DLPis as follows – an entity e1 with ancestors A is writtento entity e2 if and only if A �⊇ Srcs for all rules in therule set where e2 = dst. The reason that sources are ex-pressed as sets is that, at times, the union of informationis more sensitive than its individual components. For ex-ample, sharing a person’s last name or birthdate may bepermissible, while sharing the last name and birthdate isrestricted as PII.2

Below, we define the functions that realize this goal.First, we define two provenance-based functions as thebasis for a DLP monitoring phase, which allows admin-istrators to learn more about the propagation of sensitivedata on their systems. Then, we define mechanisms for aDLP enforcement phase.

Algorithm 2 Mediates request to write e to d given Rules.

Require: e,d are entitiesRequire: Rules is a PB-DLP policy1: procedure PROVWRITE(e,d,Rules)2: for each rule in Rules do3: if d = rule.dst then4: A = FindAncestors(e)5: NumSrcs = length(rule.Srcs)6: for each src in rule.Srcs do7: if src in A then8: NumSrcs−−9: end if

10: end for11: if NumSrcs = 0 then � A ⊇ Srcs, deny.12: return PB-DLP_DENY13: end if14: end if15: end for16: return PB-DLP_PERMIT � A �⊇ Srcs, permit.17: end procedure

5.1 Monitoring PhaseThe goal of monitoring is to allow administrators to rea-son about how sensitive data is stored and put to use ontheir systems. The end product of the monitor phase is aset of rules (a policy) that restrict the permissible flowsfor sensitive data sources. Monitoring is an ongoing pro-cess in DLP, where administrators attempt to iterativelyimprove protection against data leakage. The first step isto identify the data that needs protection. Identifying thesource of such information is often quite simple; for ex-ample, a database of PCI or PII data. However, reliablyfinding data objects that were derived from this source isextraordinarily complicated using existing solutions, butis simple now with LPM. To begin, we define a helperfunction for system monitoring:

1. FindSuccessors(Entity): This function performs aprovenance graph traversal to obtain the list of dataobjects derived from Entity.

FindSuccessors can then be used as the basis for afunction that summarizes the spread of sensitive data:

2. Report(Entity): List the locations that a target objectand its successors have propagated. This function isdefined in Algorithm 1.

The information provided by Report is similar to thedata found in the Symantec DLP Dashboard [1], andcould be used as the backbone of a PB-DLP user inter-face. Administrators can use this information to write aPB-DLP policy or revise an existing one.

5.2 Enforcement PhasePossessing a PB-DLP policy, the goal of the enforcementphase is to prevent entities that were derived from sensi-tive sources from being written to restricted locations. To

9


do so, we need to inspect the object’s provenance to dis-cover the entities from which it was derived. We definethe following helper function:

3. FindAncestors(Entity): This function performs aprovenance graph traversal to obtain the list of dataobjects used in the creation of Entity.

FindAncestors can be then used as the basis for a func-tion that prevents the spread of sensitive data:

4. ProvWrite(Entity, Destination, Rules): Write the tar-get entity to the destination if and only if it is valid tothe provided rule set, as defined in Algorithm 2.

5.3 File Transfer ApplicationIn many enterprise networks that are isolated from theInternet via firewalls and proxies, it is desirable to sharefiles with external users. File transfer services are oneway to achieve this, and provide a single entry/exit pointto the enterprise network where files being transferredcan be examined before being released.17 In the caseof incoming files, scans can check for known malware,and in some cases, check for other types of maliciousbehavior from unknown malware.

We implemented PB-DLP as a file transfer applica-tion for provenance-aware systems using LPM’s Prov-mon module. The application interfaced with LPM’sSNAP recorder using a custom API. Before permittinga file to be transmitted to a remote host, the applicationran a query that traversed WasDerivedFrom edges to re-turn a list of the file’s ancestors, permitting the transferonly if the file was not derived from a restricted source.PB-DLP allows internal users to share data, while ensur-ing that sensitive data is not exfiltrated in the process.

Because provenance graphs continue to grow indefi-nitely over time, in practice the bottleneck of this appli-cation is the speed of provenance querying. We evaluatethe performance of PB-DLP queries in Section 6.3.

5.4 PB-DLP AnalysisBelow, we select two open source systems that approx-imate label based and regular expression (regex) basedDLP solutions, and compare their benefits to PB-DLP.

5.4.1 Label-Based DLP

The SELinux MLS policy [31] provides information flowsecurity through a label-based approach, and could beused to approximate a DLP solution without relying on

17 Two examples of vendors that provide this capability are FireEye(http://www.fireeye.com) and Accellion (http://www.accellion.com/)

commercial products. Proprietary label-based DLP sys-tems rely on manual annotations provided by users, re-quiring them to provide correct labeling based on theirknowledge of data content. Using SELinux as an exem-plar labeling system is therefore an extremely conserva-tive approach to analysis.

Within an MLS system, each subject, and object, isassigned a classification level, and categories, or com-partments. Consider an example system, with classi-fication levels, {A,B} with A dominating B, and com-partments {α,β}. We can model our policy as a lat-tice, where each node in the lattice is a tuple of theform {< level >,{compartments}}. Once the policyis defined, it is possible to enforce the simple and *-properties. If a user has access to data with classificationlevel A, and compartment α , he cannot read anything incompartment {β} (no read-up). Furthermore, when datais accessed in A,{α}, the user cannot write anything toB,{α} (no write-down).

In order to use SELinux’s MLS enforcement as a DLPsolution, the administrator configures the policy to en-force the constraint that no data of specified types canbe sent over the network. However, this is difficult inpractice. Consider an example system that processes PII.The users of the system may need to access information,such as last names, and send these to the payroll depart-ment to ensure that each employee receives a paycheck.Separately, the user may need to send a list of birthdaysto another user in the department to coordinate birthdaycelebrations for each month. Either of these activities areacceptable (Figure 5, Decision Condition 2). However,it is common practice for organizations to have strictersharing policies for data that contains multiple forms ofPII, so while either of these identifiers could be transmit-ted in isolation, the two pieces of information combinedcould not be shared (Figure 5, Decision Condition 3).

The MLS policy cannot easily handle this type of datafusion. In order to provide support for correctly label-ing fused data, an administrator would need to define thepower set of all compartments within the MLS policy.In the example above, the administrator would define thefollowing compartments: {}, {α}, {β}, {α,β}. In thedefault configuration SELinux supports 256 unique cate-gories, meaning an SELinux DLP policy could only sup-port eight types of data. Furthermore, the MLS policydoes not support defining multiple categories within asingle sensitivity level18. This implies that the MLS pol-icy cannot support having a security level for A,{α} andfor A,{α,β}. Instead, the most restrictive labeling mustbe defined to protect the data on the system. In contrast,PB-DLP can support an arbitrary number of data fusions.

18See the definition of level statements at http://selinuxproject.org/page/MLSStatements

10


Used

UsedUsedWasGeneratedBy WasGeneratedBy

Birth_Dates:0

SSNs:0

Training_Data:0

PII_Data:0 PII_Data.gz:0join Birth_Dates SSNs > PII_Data gzip PII_Data

3 4

1

2

Figure 5: A provenance graph of PII data objects that are first fused and then transformed. The numbers markDLP decision conditions. Objects marked by green circles should not be restricted, while red octagons should berestricted. Label-Based DLP correctly handles data resembling PII (1,2) and data transformations (4), but struggleswith data fusions (3). Regex-Based DLP correctly identifies data fusions (3), but is prone to incorrect handling of dataresembling PII (1) and fails to identify data transformations (4). PB-DLP correctly handles all conditions.

5.4.2 Regex-Based DLP

The majority of DLP software relies on pattern matchingtechniques to identify sensitive data. While enterprise so-lutions offer greater sophistication and customizability,their fundamental approach resembles that of Cornell’sSpider 7, a forensics tools for identifying sensitive per-sonal data (e.g., credit card or social security numbers).Because it is open source, we make use of Spider as anexemplar application for regex-based DLP.

Regex approaches are prone to false positives.Spider is pre-configured with a set of regularexpressions for identifying potential PII, e.g.,(\d{3}-\d{2}-\d{4}) identifies a social se-curity number. However, it is common practice fordevelopers to generate and distribute training datasetsto aid in software testing (Figure 5, Decision Condition1). Spider is oblivious to information flows, insteadsearching for content that bears structural similarityto PII, and therefore would be unable to distinguishbetween true PII and training data. PB-DLP tracks thepropagation of data from its source onwards, and couldtrivially differentiate between true PII and training sets.

Regex approaches are also prone to false negatives.Even after the most trivial data transformations, PII andPCI data is no longer identifiable to the Spider system(Figure 5, Decision Condition 4), permitting its exfiltra-tion. To demonstrate, we generated a file full of ran-dom valid social security numbers that Spider was able toidentify. We then ran gzip on the file and stored it in asecond file. Spider was unable to identify the second file,but PB-DLP correctly identified both files as PII since thegzip output was derived from a sensitive input.

6 Evaluation

We now evaluate the performance of LPM. Our bench-marks were run on a bare metal server machine with 12GB memory and 2 Intel Xeon quad core CPUs. The RedHat 2.6.32 kernel was compiled and installed under 3 dif-ferent configurations: all provenance disabled (Vanilla),

Test Type Vanilla LPM ProvmonProcess tests, times in µseconds (smaller is better)null call 0.14 0.14 (0%) 0.14 (0%)null I/O 0.21 0.21 (0%) 0.32 (52%)stat 1.57 1.6 (2%) 2.8 (78%)open/close file 2.75 2.42 (-12%) 3.91 (42%)signal install 0.25 0.25 (0%) 0.25 (0%)signal handle 1.37 1.29 (-6%) 1.39 (1%)fork process 380 396 (4%) 401 (6%)exec process 873 879 (1%) 911 (4%)shell process 2990 3000 (0%) 3113 (4%)File and memory latencies in µseconds (smaller is better)file create (0k) 11.5 11.2 (-3%) 15.8 (37%)file delete (0k) 8.51 8.12 (-5%) 11.8 (39%)file create (10k) 23.4 21.6 (-8%) 28.8 (23%)file delete (10k) 12.5 12 (-4%) 14.7 (18%)mmap latency 1062 1053 (-1%) 1120 (5%)protect fault 0.32 0.3 (-6%) 0.346 (8%)page fault 0.016 0.016 (0%) 0.016 (0%)100 fd select 1.53 1.53 (0%) 1.53 (0%)

Table 1: LMBench measurements for LPM kernels. Alltimes are in microseconds. Percent overhead for modi-fied configurations are shown in parenthesis.

LPM scaffolding installed but without an enabled mod-ule (LPM), and LPM installed with the Provmon moduleenabled (Provmon).

6.1 Collection Performance

We used LMBench to microbenchmark LPM’s impacton system calls as well as file and memory latencies.Table 1 shows the averaged results over 10 trials foreach kernel, with a percent overhead calculation againstVanilla. For most measures, the performance differ-ences between LPM and Vanilla are negligible. Com-paring Vanilla to Provmon, there are several measuresin which overhead is noteworthy: stat, open/close, filecreation and deletion. Each of these benchmarks in-volve LMBench manipulating a temporary file that re-sides in /usr/tmp/lat_fs/. Because an absolutepath is provided, before each system call occurs LM-Bench first traverses the path to the file, resulting inthe creation of 3 different provenance events in Prov-mon’s inode_permission hook, each of which istransmitted to user space via the kernel relay. While

11


0

1

2

3

4

5

0 2 4 6 8 10

Sto

rage

Cos

t (G

Byt

es)

Time (Minutes)

Raw Prov. StreamGZip Recorder

SNAP Recorder

Figure 6: Growth of provenancestorage overheads during kernelcompilation.

0.95

0.96

0.97

0.98

0.99

1

0 5 10 15 20 25

Cum

ulat

ive

Den

sity

Response Time (Milliseconds)

Figure 7: Performance of ancestryqueries for objects created duringkernel compilation.

0

1000

2000

3000

4000

5000

Iperf Performance

Thro

ughp

ut (

Mbp

s)

VanillaLPM

ProvmonBatch Sig

Figure 8: LPM network overheadcan be reduced with batch signatureschemes.

the overheads seem large for these operations, the log-ging of these three events only imposes approximately1.5 microseconds per traversal. Moreover, the over-head for opening and closing is significantly higher thanthe overhead than reads and writes (Null I/O); thus, thehigher open/close costs are likely to be amortized overthe course of regular system use. A provenance modulecould further reduce this overhead by maintaining stateabout past events within the kernel, then blocking thecreation of redundant provenance records.

Test Vanilla Provmon OverheadKernel Compile 598 sec 614 sec 2.7%Postmark 25 sec 27 sec 7.5%Blast 376 sec 390 sec 4.8%

Table 2: Benchmarking Results. Our provenance moduleimposed just 2.7% overhead on kernel compilation.

To gain a more practical sense of the costs of LPM, wealso performed multiple benchmark tests that representedrealistic system workloads. Each trial was repeated 5times to ensure consistency. The results are summarizedin Table 2. For the kernel compile test, we recompiledthe kernel source (in a fixed configuration) while bootedinto each of the kernels. Each compilation occurred on16 threads. The LPM scaffolding (without an enabledmodule) is not included, because in both tests it imposedless than 1% overhead. In spite of seemingly high over-heads for file I/O, Provmon imposes just 2.7% overheadon kernel compilation, or 16 seconds. The Postmark testsimulates the operation of an email server. It was con-figured to run 15,000 transactions with file sizes rang-ing from 4 KB to 1 MB in 10 subdirectories, with up to1,500 simultaneous transactions. The Provmon moduleimposed just 7.5% overhead on this task. To estimateLPM’s overhead for scientific applications, we ran theBLAST benchmarks 19, which simulates biological se-quence workloads obtained from analysis of hundreds ofthousands of jobs from the National Institutes of Health.

19See http://fiehnlab.ucdavis.edu/staff/kind/Collector/Benchmark/Blast_Benchmark

For kernel compile and postmark, Provmon outper-forms the PASS system, which exacted 15.6% and 11.5%overheads on kernel compilation and postmark, respec-tively [43]. Provmon introduces comparable kernel com-pilation overhead to Hi-Fi (2.8%) [48]. It is difficult tocompare our Blast results to SPADE and PASS, as bothused a custom workload instead of a publicly availablebenchmark. SPADE reports an 11.5% overhead on alarge workload [29], while PASS reports just an 0.7%overhead. Taken as a whole, though, LPM collection ei-ther meets or exceeds the performance of past systems,while providing additional security assurances.

6.2 Storage OverheadA major challenge to automated provenance collection isthe storage overhead incurred. We plotted the growth ofprovenance storage using different recorders during thekernel compilation benchmark, shown in Figure 6. LPMgenerated 3.7 GB of raw provenance. This required only450 MB of storage with the Gzip recorder, but prove-nance cannot be efficiently queried in this format. TheSNAP recorder builds an in-memory provenance graph.We approximated the storage overhead through pollingthe virtual memory consumed by the recorder process inthe /proc filesystem. The SNAP graph required 1.6 GBstorage; the reduction from the raw provenance streamis due to the fact that redundant events did not lead tothe creation of new graph components. In contrast, thePASS system generates 1.3 GB of storage overhead dur-ing kernel compilation while collecting less information(e.g., shared memory). LPM’s storage overheads are thuscomparable to other provenance-aware systems.

6.3 Query Performance (PB-DLP)We evaluated query performance using our exemplar PB-DLP application and LPM’s SNAP recorder. The prove-nance graph that was populated using the routine fromthe kernel compile benchmark. This yielded a raw prove-nance stream of 3.7 GB, which was translated by therecorder into a graph of 6,513,398 nodes and 6,754,059

12


edges. We were then able to use the graph to issue an-cestry queries, in which the subgraphs were traversed tofind the data object ancestors of a given file. Becausewe did not want ephemeral objects with limited ances-tries to skew our results, we only considered the resultsof objects with more than 50 ancestors.

In the worst case, which was a node that had 17,696ancestors, the query returned in just 21 milliseconds.Effectively, we were able to query object ancestries atline speed for network activity. We are confident thatthis approach can scale to large databases through pre-computation of expensive operations at data ingest, mak-ing it a promising strategy for provenance-aware dis-tributed systems; however, we note that these results arehighly dependent on the size of the graph. Our test graph,while large, would inevitably be dwarfed by the size ofthe provenance on long-lived systems. Fortunately, thereare also a variety of techniques for reducing these costs.Bates et al. show that the results from provenance graphtraversals can be extensively cached when using a fixedsecurity policy, which would allow querying to amortizeto a small constant cost [7]. LPM could also be extendedto support deduplication [63, 62] and policy-based prun-ing [6, 13], both of which could further improve perfor-mance by reducing the size of provenance graphs.

6.4 Message Commitment Protocol

Under each kernel configuration, we performed iperfbenchmarking to discover LPM’s impact on TCPthroughput. iperf was launched in both client and servermode over localhost. The client was launchedwith 16 threads, two per each CPU. Our results can befound in Figure 8. The Vanilla kernel reached 4490Mbps. While the LPM framework imposed negligiblecost compared to the vanilla kernel (4480 Mbps), Prov-mon’s DSA-based message commitment protocol re-duced throughput by an order of magnitude (482 Mbps).Through use of printk instrumentation, we found thatthe average overhead per packet signature was 1.2 ms.

This result is not surprising when compared to IPsecperformance. IPsec’s Authentication Header (AH) modeuses an HMAC-based approach to provide similar guar-antees as our protocol. AH has been shown to reducethroughput by as much as half [16]. An HMAC approachis a viable alternative to establish integrity and data ori-gin authenticity and would also fit into the options field,but would require the negotiation of IPsec security as-sociations. Our message commitment protocol has thebenefit of being fully interoperable with other hosts, anddoes not require a negotiation phase before communi-cation occurs. Another option for increasing through-put would be to employ CPU instruction extensions [30]and security co-processor [19] to accelerate the speed of

DSA. Yet another approach to reducing our impact onnetwork performance would be to employ a batch signa-ture scheme [11]. We tested this by transmitting a sig-nature over every 10 packets during TCP sessions, andfound that throughput increased by 3.3 times to approx-imately 1600 Mbps. Due to the fact that this overheadmay not be suitable for some environments, Provmoncan be configured to use Hi-Fi identifiers [48], which arevulnerable to network attack but impose negligible over-head. LPM’s impact on network performance is specificto the particular module, and can be tailored to meet theneeds of the system.

7 Discussion

Without the aid of provenance-aware applications, LPMwill struggle to accurately track dependencies throughworkflow layer abstractions. The most obvious exampleof such an abstraction is the copy/paste buffer in window-ing applications like Xorg. This is a known side channelfor kernel layer security mechanisms, one that has beenaddressed by the Trusted Solaris project [10], Trusted X[21, 22], the SELinux-aware X window system [37], Se-cureView 20, and General Dynamics’ TVE 21. Withoutprovenance-aware windowing, LPM will conservativelyassume that all files opened for reading are dependenciesof the paste buffer, leading to false dependencies. LPMis also unable to observe system side channels, such astiming channels or L2 cache measurements [51], a limi-tation shared by many other security solutions [18].

Although we have not presented a secure distributedprovenance-aware system in this work, LPM providesthe foundation for the creation of such a system. In thepresented modules, provenance is stored locally by thehost and retrieved on an as-needed basis from other hosts.This raises availability concerns as hosts inevitably beginto fail. Availability could be improved with minimal per-formance and storage overheads through Gehani et al.’sapproach of duplicating provenance at k neighbors witha limited graph depth d [27, 28].

Finally, LPM does not address the matter of prove-nance confidentiality; this is an important challenge thatis explored elsewhere in the literature [14, 46]. LPM’sRecorders provide interfaces that can be used to intro-duce an access control layer onto the provenance store.

8 Related Work

While myriad provenance-aware systems have been pro-posed in the literature, the majority disclose provenancewithin an application [32, 41, 65] or workflow [24, 58]. It

20See http://www.ainfosec.com/secureview21See http://gdc4s.com/tve.html

13


is difficult or impossible to obtain complete provenancein this manner. This is because systems events that occuroutside of the application, but still effect its execution,will not appear in the provenance record.

The alternative to disclosed systems are automaticprovenance-aware systems, which collect provenancetransparently within the operating system. Gehani etal.’s SPADE is a multi-platform system for eScience andgrid computing audiences, with an emphasis on low la-tency and availability in distributed environments [29].SPADE’s provenance reporters make use of familiar ap-plication layer utilities to generate provenance, such aspolling ps for process information and lsof for net-work information. This gives rise to the possibility of in-complete provenance due to race conditions. The PASSproject collects the provenance of system calls at the vir-tual filesystem (VFS) layer. PASSv1 provides base func-tions for provenance collection that observe processes’file I/O activity [43]. Because these basic functions aremanually placed around the kernel, there is no clearway to extend PASSv1 to support additional collectionhooks; we address this limitation in the modular designof LPM. PASSv2 introduces a Disclosed Provenance APIfor tighter integration between provenance collected atdifferent layers of abstraction, e.g., at the applicationlayer [44]. PASSv2 assumes that disclosing processesare benign, while LPM provides a secure disclosuremechanism for attesting the correctness of provenance-aware applications. Both SPADE and PASS are designedfor benign environments, making no attempt to protecttheir collection mechanisms from an adversary.

Previous work has considered the security of prove-nance under relaxed threat models relative to LPM’s. InSProv, Hasan et al. introduce provenance chains, cryp-tographic constructs that prevent the insertion or dele-tion of provenance inside of a series of events [32].SProv effectively demonstrates the authentication prop-erties of this primitive, but is not intended to serve as asecure provenance-aware system; attackers can still ap-pend false records to the end of the chain, delete thewhole chain, or disable the library altogether. Zhou etal. consider provenance corruption an inevitability, andshow that provenance can detect some malicious hosts indistributed environments provided that a critical mass ofcorrect hosts still exist [65]. They later strengthen theseassurances through use of provenance-aware software-defined networking [5]. These systems consider onlynetwork events, and are unable to speak to the internalstate of hosts. Lyle and Martin sketch the design fora secure provenance monitor based on trusted comput-ing [40]. However, they conceptualize provenance asa TPM-aided proof of code execution, overlooking in-terprocess communication and other system activity thatcould inform execution results, and therefore offer infor-

mation that is too coarse-grained to meet the needs ofsome applications. Moreover, to the best of our knowl-edge their system is unimplemented.

The most promising model to date for secure prove-nance collection is Pohly et al.’s Hi-Fi system [48]. Hi-Fiis a Linux Security Module (LSM) that collects whole-system provenance that details the actions of processes,IPC mechanisms, and even the kernel itself (which doesnot exclusively use system calls). Hi-Fi attempts to pro-vide a provenance reference monitor [42], but remainsvulnerable to the provenance-aware adversary that wedescribe in Section 3.2. Enabling Hi-Fi blocks the in-stallation of other LSM’s, such as SELinux or Tomoyo,or requires a third party patch to permit module stack-ing. This blocks the installation of MAC policy on thehost, preventing runtime integrity assurances. Hi-Fi isalso vulnerable to adversaries in the network, who canstrip the provenance identifiers from packets in transit,resulting in irrecoverable provenance. Unlike LPM, Hi-Fi does not attempt to provide layered provenance ser-vices, and therefore does not consider the integrity andauthenticity of provenance-aware applications.

Provenance collection is a form of information flowmonitoring that is related, but fundamentally distinct,from past areas of study. Due to space constraints, ourdiscussion of Information Flow Control (IFC) systemshas been relegated to our technical report [8].

9 Conclusion

In this work, we have presented LPM, a platform forthe creation of trusted provenance-aware execution en-vironments. Our system imposes as little as 2.7% per-formance overhead on normal system operation, and canrespond to queries about data object ancestry in tens ofmilliseconds. We have used LPM as the foundation ofa provenance-based data loss prevention system that canscan file transmissions to detect the presence of sensitiveancestors in just tenths of a second. The Linux Prove-nance Module Framework is an exciting step forward forboth provenance- and security-conscious communities.

Acknowledgements

We would like to thank Rob Cunningham, Alin Do-bra, Will Enck, Jun Li, Al Malony, Patrick McDaniel,Daniela Oliveira, Nabil Schear, Micah Sherr, and PatrickTraynor for their valuable comments and insight, as wellas Devin Pohly for his sustained assistance in workingwith Hi-Fi, and Mugdha Kumar for her help developingLPM SPADE support. This work was supported in partby the US National Science Foundation under grant num-bers CNS-1118046, CNS-1254198, and CNS-1445983.

14


Availability

The LPM code base, including all user space utilities andpatches for both Red Hat and the mainline Linux kernels,is available at http://linuxprovenance.org.

References[1] Symantec Data Loss Prevention Customer Brochure. http://

www.symantec.com/data-loss-prevention.

[2] What’s Yours is Mine: How Employees are PuttingYour Intellectual Property at Risk. https://www4.symantec.com/mktginfo/whitepaper/WP_WhatsYoursIsMine-HowEmployeesarePuttingYourIntellectualPropertyatRisk_dai211501_cta69167.pdf.

[3] R. Aldeco-Pérez and L. Moreau. Provenance-based Auditing ofPrivate Data Use. In Proceedings of the 2008 International Con-ference on Visions of Computer Science, VoCS’08, Sept. 2008.

[4] J. P. Anderson. Computer Security Technology Planning Study.Technical Report ESD-TR-73-51, Air Force Electronic SystemsDivision, 1972.

[5] A. Bates, K. Butler, A. Haeberlen, M. Sherr, and W. Zhou. LetSDN Be Your Eyes: Secure Forensics in Data Center Networks.In NDSS Workshop on Security of Emerging Network Technolo-gies, SENT, Feb. 2014.

[6] A. Bates, K. R. B. Butler, and T. Moyer. Take Only What YouNeed: Leveraging Mandatory Access Control Policy to ReduceProvenance Storage Costs. In 7th Workshop on the Theory andPractice of Provenance, TaPP’15, July 2015.

[7] A. Bates, B. Mood, M. Valafar, and K. Butler. Towards SecureProvenance-based Access Control in Cloud Environments. InProceedings of the 3rd ACM Conference on Data and Applica-tion Security and Privacy, CODASPY ’13, pages 277–284, NewYork, NY, USA, 2013. ACM.

[8] A. Bates, D. Tian, K. R. B. Butler, and T. Moyer. Linux Prove-nance Modules: Trustworthy Whole-System Provenance for theLinux Kernel. Technical Report REP-2015-578, University ofFlorida CISE Dept, 2015.

[9] M. Bellare, R. Canetti, and H. Krawczyk. Keyed Hash Functionsand Message Authentication. In Proceedings of Crypto’96, vol-ume 1109 of LNCS, pages 1–15, 1996.

[10] M. Bellis, S. Lofthouse, H. Griffin, and D. Kucukreisoglu.Trusted Solaris 8 4/01 Security Target. 2003.

[11] A. Bittau, D. Boneh, M. Hamburg, M. Handley, D. Mazieres,and Q. Slack. Cryptographic protection of TCP Streams(tcpcrypt). https://tools.ietf.org/html/draft-bittau-tcp-crypt-01.

[12] D. Boneh, B. Lynn, and H. Shacham. Short Signatures from theWeil Pairing. In C. Boyd, editor, Advances in Cryptology – ASI-ACRYPT 2001, volume 2248 of Lecture Notes in Computer Sci-ence, pages 514–532. Springer Berlin Heidelberg, 2001.

[13] U. Braun, S. L. Garfinkel, D. A. Holland, K.-K. Muniswamy-Reddy, and M. I. Seltzer. Issues in Automatic Provenance Col-lection. In International Provenance and Annotation Workshop,pages 171–183, 2006.

[14] U. Braun and A. Shinnar. A Security Model for Provenance.Technical Report TR-04-06, Harvard University Computer Sci-ence Group, 2006.

[15] D. Catalano, M. Di Raimondo, D. Fiore, and R. Gennaro. Off-line/On-line Signatures: Theoretical Aspects and ExperimentalResults. In PKC’08: Proceedings of the Practice and theory inpublic key cryptography, 11th international conference on Pub-lic key cryptography, pages 101–120, Berlin, Heidelberg, 2008.Springer-Verlag.

[16] S. Chaitanya, K. Butler, A. Sivasubramaniam, P. McDaniel, andM. Vilayannur. Design, Implementation and Evaluation of Secu-rity in iSCSI-based Network Storage Systems. In Proceedings ofthe Second ACM Workshop on Storage Security and Survivability,StorageSS ’06, pages 17–28, New York, NY, USA, 2006. ACM.

[17] D. D. Clark and D. R. Wilson. A Comparison of Commercial andMilitary Computer Security Policies. In IEEE S&P, Oakland,CA, USA, Apr. 1987.

[18] D. Cock, Q. Ge, T. Murray, and G. Heiser. The Last Mile: AnEmpirical Study of Some Timing Channels on seL4. In ACMConference on Computer and Communications Security, pages570–581, Scottsdale, AZ, USA, nov 2014.

[19] J. Dyer, M. Lindemann, R. Perez, R. Sailer, L. van Doorn, andS. Smith. Building the IBM 4758 Secure Coprocessor. Computer,34(10):57–66, Oct 2001.

[20] A. Edwards, T. Jaeger, and X. Zhang. Runtime Verification ofAuthorization Hook Placement for the Linux Security ModulesFramework. In Proceedings of the 9th ACM Conference on Com-puter and Communications Security, CCS’02, 2002.

[21] J. Epstein and J. Picciotto. Trusting X: Issues in Building TrustedX Window Systems -or- WhatâAZs Not Trusted About X. In Pro-ceedings of the 14th Annual National Computer Security Confer-ence, 1991.

[22] J. Epstein and M. Shugerman. A Trusted X Window SystemServer for Trusted Mach. In USENIX MACH Symposium, pages141–156, 1990.

[23] S. Even, O. Goldreich, and S. Micali. On-line/off-line DigitalSignatures. In Proceedings on Advances in cryptology, CRYPTO’89, pages 263–275, New York, NY, USA, 1989. Springer-VerlagNew York, Inc.

[24] I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao. Chimera: AVir-tual Data System for Representing, Querying, and AutomatingData Derivation. In Proceedings of the 14th Conference on Sci-entific and Statistical Database Management, SSDBM’02, July2002.

[25] V. Ganapathy, T. Jaeger, and S. Jha. Automatic placement ofauthorization hooks in the linux security modules framework. InProceedings of the 12th ACM Conference on Computer and Com-munications Security, CCS ’05, pages 330–339, New York, NY,USA, 2005. ACM.

[26] C.-z. Gao and Z.-a. Yao. A Further Improved Online/Offline Sig-nature Scheme. Fundam. Inf., 91:523–532, August 2009.

[27] A. Gehani, B. Baig, S. Mahmood, D. Tariq, and F. Zaffar. Fine-grained Tracking of Grid Infections. In Proceedings of the11th IEEE/ACM International Conference on Grid Computing,GRID’10, Oct 2010.

[28] A. Gehani and U. Lindqvist. Bonsai: Balanced Lineage Authen-tication. In Proceedings of the 23rd Annual Computer SecurityApplications Conference, ACSAC’07, Dec 2007.

[29] A. Gehani and D. Tariq. SPADE: Support for Provenance Audit-ing in Distributed Environments. In Proceedings of the 13th In-ternational Middleware Conference, Middleware ’12, Dec 2012.

[30] S. Gueron and V. Krasnov. Speed Up Big-Number MultiplicationUsing Single Instruction Multiple Data (SIMD) Architectures,June 7 2012. US Patent App. 13/491,141.

15


[31] C. Hanson. SELinux and MLS: Putting The Pieces Together. InIn Proceedings of the 2nd Annual SELinux Symposium, 2006.

[32] R. Hasan, R. Sion, and M. Winslett. The Case of the Fake Pi-casso: Preventing History Forgery with Secure Provenance. InProceedings of the 7th USENIX Conference on File and StorageTechnologies, FAST’09, San Francisco, CA, USA, Feb. 2009.

[33] B. Hicks, S. Rueda, L. St.Clair, T. Jaeger, and P. McDaniel.A Logical Specification and Analysis for SELinux MLS Policy.ACM Trans. Inf. Syst. Secur., 13(3):26:1–26:31, July 2010.

[34] T. Jaeger, A. Edwards, and X. Zhang. Consistency Analysis ofAuthorization Hook Placement in the Linux Security ModulesFramework. ACM Trans. Inf. Syst. Secur., 7(2):175–205, May2004.

[35] T. Jaeger, R. Sailer, and U. Shankar. PRIMA: Policy-reducedIntegrity Measurement Architecture. In Proceedings of the 11thACM Symposium on Access Control Models and Technologies,SACMAT ’06, pages 19–28, New York, NY, USA, 2006. ACM.

[36] S. Kent and R. Atkinson. RFC 2406: IP Encapsulating SecurityPayload (ESP). 1998.

[37] D. Kilpatrick, W. Salamon, and C. Vance. Securing the X Win-dow System with SELinux. Technical report, Jan. 2003.

[38] L. Lamport. Time, Clocks, and the Ordering of Events in a Dis-tributed System. Commun. ACM, 21(7):558–565, July 1978.

[39] H. Luhn. Computer for Verifying Numbers, Aug. 23 1960. USPatent 2,950,048.

[40] J. Lyle and A. Martin. Trusted Computing and Provenance: Bet-ter Together. In 2nd Workshop on the Theory and Practice ofProvenance, TaPP’10, Feb. 2010.

[41] P. Macko and M. Seltzer. A General-purpose Provenance Li-brary. In 4th Workshop on the Theory and Practice of Prove-nance, TaPP’12, June 2012.

[42] P. McDaniel, K. Butler, S. McLaughlin, R. Sion, E. Zadok, andM. Winslett. Towards a Secure and Efficient System for End-to-End Provenance. In Proceedings of the 2nd conference on The-ory and practice of provenance, San Jose, CA, USA, Feb. 2010.USENIX Association.

[43] K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer.Provenance-Aware Storage Systems. In Proceedings of the 2006USENIX Annual Technical Conference, 2006.

[44] K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko,D. Maclean, D. Margo, M. Seltzer, and R. Smogor. Layering inProvenance Systems. In Proceedings of the 2009 Conference onUSENIX Annual Technical Conference, ATC’09, June 2009.

[45] D. Nguyen, J. Park, and R. Sandhu. Dependency Path PatternsAs the Foundation of Access Control in Provenance-aware Sys-tems. In Proceedings of the 4th USENIX Conference on Theoryand Practice of Provenance, TaPP’12, pages 4–4, Berkeley, CA,USA, 2012. USENIX Association.

[46] Q. Ni, S. Xu, E. Bertino, R. Sandhu, and W. Han. An AccessControl Language for a General Provenance Model. In SecureData Management, Aug. 2009.

[47] J. Park, D. Nguyen, and R. Sandhu. A Provenance-Based AccessControl Model. In Proceedings of the 10th Annual InternationalConference on Privacy, Security and Trust (PST), pages 137–144,2012.

[48] D. Pohly, S. McLaughlin, P. McDaniel, and K. Butler. Hi-Fi: Col-lecting High-Fidelity Whole-System Provenance. In Proceedingsof the 2012 Annual Computer Security Applications Conference,ACSAC ’12, Orlando, FL, USA, 2012.

[49] J. Postel. RFC 791: Internet protocol. 1981.

[50] A. C. Revkin. Hacked E-mail is New Fodder for Climate Dispute.New York Times, 20, 2009.

[51] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, You,Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds. In Proceedings of the 16th ACM Con-ference on Computer and Communications Security (CCS’09),pages 199–212, Chicago, IL, USA, Oct. 2009. ACM.

[52] R. Sailer, X. Zhang, T. Jaeger, and L. van Doorn. Design andImplementation of a TCG-based Integrity Measurement Archi-tecture. In SSYM’04: Proceedings of the 13th conference onUSENIX Security Symposium, pages 16–16, Berkeley, CA, USA,2004. USENIX Association.

[53] C. Sar and P. Cao. Lineage File System.http://crypto.stanford.edu/~cao/lineage.html.

[54] J. Seibert, G. Baah, J. Diewald, and R. Cunningham. UsingProvenance To Expedite MAC Policies (UPTEMPO) (PreviouslyKnown as IPDAM). Technical Report USTC-PM-015, MIT Lin-coln Laboratory, October 2014.

[55] A. Shamir and Y. Tauman. Improved Online/Offline Signa-ture Schemes. In J. Kilian, editor, Advances in Cryptology —CRYPTO 2001, volume 2139 of Lecture Notes in Computer Sci-ence, pages 355–367. Springer Berlin / Heidelberg, 2001.

[56] D. Tariq, B. Baig, A. Gehani, S. Mahmood, R. Tahir, A. Aqil, andF. Zaffar. Identifying the Provenance of Correlated Anomalies. InProceedings of the 2011 ACM Symposium on Applied Computing,SAC ’11, Mar. 2011.

[57] The Netfilter Core Team. The Netfilter Project: Packet Manglingfor Linux 2.4. http://www.netfilter.org/, 1999.

[58] J. Widom. Trio: A System for Integrated Management of Data,Accuracy, and Lineage. Technical Report 2004-40, Stanford In-foLab, Aug. 2004.

[59] World Wide Web Consortium. PROV-Overview: An Overviewof the PROV Family of Documents. http://www.w3.org/TR/prov-overview/, 2013.

[60] C. Wright, C. Cowan, J. Morris, S. Smalley, and G. Kroah-Hartman. Linux Security Module Framework. In Ottawa LinuxSymposium, page 604, 2002.

[61] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah-Hartman. Linux security modules: General security support forthe linux kernel. In Proceedings of the 11th USENIX SecuritySymposium, pages 17–31, Berkeley, CA, USA, 2002. USENIXAssociation.

[62] Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, Y. Li, and D. D. E.Long. Evaluation of a Hybrid Approach for Efficient ProvenanceStorage. Trans. Storage, 9(4):14:1–14:29, Nov. 2013.

[63] Y. Xie, K.-K. Muniswamy-Reddy, D. D. E. Long, A. Amer,D. Feng, and Z. Tan. Compressing Provenance Graphs, June2011.

[64] X. Zhang, A. Edwards, and T. Jaeger. Using CQUAL for StaticAnalysis of Authorization Hook Placement. In Proceedings ofthe 11th USENIX Security Symposium, 2002.

[65] W. Zhou, Q. Fei, A. Narayan, A. Haeberlen, B. T. Loo, andM. Sherr. Secure Network Provenance. In Proceedings ofthe 23rd ACM Symposium on Operating Systems Principles,SOSP’11, Oct. 2011.

[66] W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. EfficientQuerying and Maintenance of Network Provenance at Internet-Scale. In ACM SIGMOD International Conference on Manage-ment of Data (SIGMOD), June 2010.

16