this message will self-destruct: self-easing covert

Purdue University Purdue University

Purdue e-Pubs Purdue e-Pubs

Department of Computer Science Technical Reports Department of Computer Science

2007

This message will self-destruct: Self-easing covert This message will self-destruct: Self-easing covert

communication communication

Mikhail J. Atallah Purdue University, [email protected]

Mercan Topkara

Umut Topkara

Report Number: 07-011

Atallah, Mikhail J.; Topkara, Mercan; and Topkara, Umut, "This message will self-destruct: Self-easing covert communication" (2007). Department of Computer Science Technical Reports. Paper 1675. https://docs.lib.purdue.edu/cstech/1675

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information.

https://docs.lib.purdue.edu/

https://docs.lib.purdue.edu/cstech

https://docs.lib.purdue.edu/cstech

https://docs.lib.purdue.edu/comp_sci

THIS MESSAGE WILL SELF-DESTRUCT: SELF-ERASING COVERT COMMUNICATION

Mikhail J. Atallah Mercan Topkara Umut Topkara

Department of Computer Science Purdue University

West Lafayette, IN 47907

CSD TR #07-011 May 2007

TillS MESSAGE WILL SELF-DESTRUCT:SELF-ERASING COVERT COMMUNICATION

Mikhail J. AtallahMercan TopkaraUmut Topkara

Department of Computer SciencePurdue University

West Lafayette, IN 47907

CSD TR #07-011May 2007

This message will self-destruct: Self-erasing covert communication*

Mikhail J. Atallah Mercan Topkara Umut Topkara Department of Computer Sciences

Purdue University West Lafayette, IN, 47906, USA

mja,mkarahan,utopkara @ cs.purdue.edu

May 1 1,2007

Abstract

The WWW increasingly allows people to create and update content for public access. Some of this information is collaboratively owned (created and maintained), while other information is privately owned and maintained (but still publicly accessible). Whereas it is unethical to modify the former for covert communication, it is quite legitimate to do so with the latter, and this paper gives a design for doing so while achieving both plausible deniability and automatic perishability of the covert message (the message disappears unless periodically refreshed by the encoder). Traditional information-hiding has looked at the problem of embedding a message in a static version of an online document, the problem of doing so for rapidly evolving document collections has not been considered in the past. This paper shows that it is possible to do so, and in a manner that actually makes use of the rapidly evolving nature of the documents to achieve the above-mentioned property of evanescence: That the message decays over time and eventually becomes completely erased unless it is refreshed. Therefore the mark needs to be continuously maintained as the document evolves, in a manner that prevents the adversary from knowing who is doing the refreshing yet that allows the intended reader of the mark to recover it without any form of explicit communication. One advantage of our scheme is that the mark's reach is now unbounded: It can be read by any authorized entity on the web (anyone with the secret key), and the reading of it is indistinguishable from normal web access patterns. Another advantage is the "hiding in the crowd" effect: Many people are updating the documents, thereby providing a cover for the one person surreptitiously injecting and refreshing the mark, or replacing it with another mark message. We have also demonstrated the feasibility of the proposed technique, and shown that remarkably little effort is required to implement our scheme over today's web.

"Portions of this work were supported by Grants IIS- 0325345, IIS-0219560, IIS-0312357, and IIS-0242421 from the National Science Foundation, and by sponsors of the Center for Education and Research in Information Assurance and Security.

This message will self-destruct: Self-erasing covert communication*

Mikhail J. Atallah Mercan Topkara Umut TopkaraDepartment of Computer Sciences

Purdue UniversityWest Lafayette, IN, 47906, USA

mja,mkarahan,[email protected]

May 11,2007

Abstract

The WWW increasingly allows people to create and update content for public access. Some ofthis information is collaboratively owned (created and maintained), while other information is privatelyowned and maintained (but still publicly accessible). Whereas it is unethical to modify the former forcovert communication, it is quite legitimate to do so with the latter, and this paper gives a design fordoing so while achieving both plausible deniability and automatic perishability of the covert message(the message disappears unless periodically refreshed by the encoder). Traditional information-hidinghas looked at the problem of embedding a message in a static version of an online document, the problemof doing so for rapidly evolving document collections has not been considered in the past. This papershows that it is possible to do so, and in a manner that actually makes use of the rapidly evolving natureof the documents to achieve the above-mentioned property of evanescence: That the message decaysover time and eventually becomes completely erased unless it is refreshed. Therefore the mark needsto be continuously maintained as the document evolves, in a manner that prevents the adversary fromknowing who is doing the refreshing yet that allows the intended reader of the mark to recover it withoutany form of explicit communication. One advantage of our scheme is that the mark's reach is nowunbounded: It can be read by any authorized entity on the web (anyone with the secret key), and thereading of it is indistinguishable from normal web access patterns. Another advantage is the "hidingin the crowd" effect: Many people are updating the documents, thereby providing a cover for the oneperson surreptitiously injecting and refreshing the mark, or replacing it with another mark message. Wehave also demonstrated the feasibility of the proposed technique, and shown that remarkably little effortis required to implement our scheme over today's web.

'Portions of this work were supported by Grants IIS- 0325345, IIS-0219560, IIS-0312357, and IIS-0242421 from the NationalScience Foundation, and by sponsors of the Center for Education and Research in Information Assurance and Security.

1 Introduction

Although messages that self-destruct were featured in spy movies, their potential usefulness is not limited to the original purpose of their self-destruction in such movies, which was to prevent their being read by a hostile adversary after they had served their useful purpose of communicating the next mission to Mr Phelps. The case for making such messages perishable includes many possible reasons: (i) Such messages can become stale and thereby convey misleading information to their intended recipient (e.g., the message "Alice and Bob are both doing fine" after one of them ceases to be doing fine) - more generally, because information is perishable and becomes stale, useless, or even dangerous as the world changes, it stands to reason that stealthy messages that convey that information should be similarly perishable and self-efface; (ii) if the message involves a secret key, then the longer it lingers, the more it is likely that it (and the key used to hide it) may eventually be compromised by an adversary who has enough computational resources (or will have such resources in the future - systems have to be resilient not only against the computational power of today but also against that of the future); (iii) the desired updating of the message (either refreshing it or replacing it with another message) by the person who wrote it may become infeasible through that person's accidental loss of the secret key, loss of access to the online world, or other physical inability to take such action; (iv) the automatic disappearance of the message can be used to communicate the very fact that (iii) has occurred; (v) the person in charge of removing the message may be negligent, or may erroneously believe that someone else was supposed to do the removal, etc.

Providing privacy preserving Web-based communication is an active research area [2 ,7, 151. Achieving this is hard because many players (such as authoritarian governments, aggressive marketers, etc) want to have a complete profile of Web users and a log of their actions on the Web. We propose a private communication channel, that ensures plausible deniability, and automatic perishability of the messages. We achieve this goal through the use of collaborative web content available on the Internet, without unethically interfering with the functionality of these valuable services, and without any need for modification of publicly available data (defined as data not owned by the sender of the message). In a nutshell, our scheme is based on pairing a privately owned web page with a collaboratively owned web page, and to use this pair as a cover document. The embedding changes are only performed on the privately owned web page. A more detailed summary of this scheme is given in Section 3.

There are many challenges involved in designing such a system:

how to use, for marking purposes, the content that we cannot modify either because we have no control over it (e.g., news portals), or because it is unethical to use the ability to modify it for marking purposes (e.g., wikis, forums)

heterogeneous content; most published marking schemes assume one type of content, whereas we are now faced with a semi-structured collection of different content types (text, images, audio, video, annotations, etc)

not interfering with the proper functioning of the publicly owned covers used

providing controlled perishability

being stealthy while using a publicly accessible cover document (as is customarily required in information hiding)

providing plausible deniability of the covert communication

providing high covert communication bandwidth (especially challenging when the document consists of data that has low embedding capacity, such as text)

1 Introduction

Although messages that self-destruct were featured in spy movies, their potential usefulness is not limitedto the original purpose of their self-destruction in such movies, which was to prevent their being read bya hostile adversary after they had served their useful purpose of communicating the next mission to MrPhelps. The case for making such messages perishable includes many possible reasons: (i) Such messagescan become stale and thereby convey misleading information to their intended recipient (e.g., the message"Alice and Bob are both doing fine" after one of them ceases to be doing fine) - more generally, becauseinformation is perishable and becomes stale, useless, or even dangerous as the world changes, it stands toreason that stealthy messages that convey that information should be similarly perishable and self-efface; (ii)ifthe message involves a secret key, then the longer it lingers, the more it is likely that it (and the key used tohide it) may eventually be compromised by an adversary who has enough computational resources (or willhave such resources in the future - systems have to be resilient not only against the computational powerof today but also against that of the future); (iii) the desired updating of the message (either refreshing it orreplacing it with another message) by the person who wrote it may become infeasible through that person'saccidental loss of the secret key, loss of access to the online world, or other physical inability to take suchaction; (iv) the automatic disappearance of the message can be used to communicate the very fact that (iii)has occurred; (v) the person in charge of removing the message may be negligent, or may erroneouslybelieve that someone else was supposed to do the removal, etc.

Providing privacy preserving Web-based communication is an active research area [2, 7, 15]. Achievingthis is hard because many players (such as authoritarian governments, aggressive marketers, etc) want tohave a complete profile of Web users and a log of their actions on the Web. We propose a private communication channel, that ensures plausible deniability, and automatic perishability of the messages. We achievethis goal through the use of collaborative web content available on the Internet, without unethically interfering with the functionality of these valuable services, and without any need for modification of publiclyavailable data (defined as data not owned by the sender of the message). In a nutshell, our scheme is basedon pairing a privately owned web page with a collaboratively owned web page, and to use this pair as a coverdocument. The embedding changes are only performed on the privately owned web page. A more detailedsummary of this scheme is given in Section 3.

There are many challenges involved in designing such a system:

• how to use, for marking purposes, the content that we cannot modify either because we have nocontrol over it (e.g., news portals), or because it is unethical to use the ability to modify it for markingpurposes (e.g., wikis, forums)

• heterogeneous content; most published marking schemes assume one type of content, whereas weare now faced with a semi-structured collection of different content types (text, images, audio, video,annotations, etc)

• not interfering with the proper functioning of the publicly owned covers used

• providing controlled perishability

• being stealthy while using a publicly accessible cover document (as is customarily required in information hiding)

• providing plausible deniability of the covert communication

• providing high covert communication bandwidth (especially challenging when the document consistsof data that has low embedding capacity, such as text)

2

The difficulty of these challenges is exacerbated by the increasing power that a potential adversary can muster - the repertoire of information sources for such an adversary now includes forum or blog boards (most or least accessed pages, most active member etc.), web-bots, ISP logs, search engines, web page tracking engines[9,6], to mention a few.

Our system fulfills most of the requirements listed for a Web based publishing system listed by Wald- man et al. [15]: censorship-resistant, tamper evident, source anonymous, updatable, deniable, fault tolerant, persistent (i.e., no expiration date), extensible, freely available. As will become clear later, the only requirement we do not provide is persistency over time (which is inherently incompatible with a self-destructing message).

Previously proposed private communication systems use a third party distributor (e.g., e-mail services [2]) to store and distribute the message to intended receivers. In our system, the sender does not use a third party distributor and therefore has a greater degree of control. The sender also has the option of privately storing the cover document (until it is cached by another Internet company [ l 1, 81) if shehe wishes to maintain the privately owned web page that is used as part of the cover document.

Refer to Section 5 for more information about available systems. In Section 2, we introduce the model of Web content used in the communication channel described in

this paper. The system overview is provided in Section 3. The experimental framework and results are discussed in Section 4.

2 Document Models

Before going into the details, we give a brief overview of how our system works. In what follows, by referring to something as "secret" we mean that knowing it requires the key that is shared by the encoder and decoder of the message.

Perishability is achieved by a secret pairing of every encoder-controlled document (say, d) with a document p(d) that is outside of the encoder's control; we consider a document to be outside the encoder's control if it is unethical or against the accepted etiquette for the encoder to use it for stealthy communication (so although a user may, for example, physically be capable of modifying collaborative content, it would be inappropriate to modify such content for the purpose of encoding). A document d that is within the encoder's control contains elements that we refer to as e l , . . . , el,; if d contains too many elements for us to use, then we choose only a number l d of them (less than the total available). What determines an ei7s role (e.g., pair-selection, mark-encoding, etc) is the keyed hash of it, denoted as H(ei) . Some types of elements ei have a payload, e.g., for a mark-encoding ei the payload consists of the mark's bits that it encodes. Let el be to p(d) what ei is to d (actually there is more to e!, than that, but we defer this discussion to the section 3.2 on implementation details). If ei has a payload, then that payload is determined by the keyed hash of the concatenation of ei with with el: H(ei, el). A change in enough of the elements of p(d)'s will, over time, erase the message. See Figure 1.

Encodability is achieved through the latitude we have to modify the individual ei's so they encode both their intended functionality and (if applicable) their appropriate payloads. This latitude includes, for rich content, a simple selection and re-ordering of content types, that typically can achieve 22 bits of encoding even without any modification to any of these rich types (see Section 2.1.1 for more details on this). If subset selection and ordering information for rich types is not enough, or not feasible because some sites disallow rich content or enforce rigid templates for it, then we resort to modification of the ei's actual contents rather then their presence and sequence order: We modify some pixels of an image [4], replace words by synonyms using robust synonym substitution [14], judiciously inject typos [12] in domains where they are common enough (blogs, newsgroups), etc. We will later discuss desirable properties for p(d).

There are 2 models for obtaining the ei's from d, each applicable in a different domain. In free-format

The difficulty of these challenges is exacerbated by the increasing power that a potential adversary canmuster - the repertoire of information sources for such an adversary now includes forum or blog boards(most or least accessed pages, most active member etc.), web-bots, ISP logs, search engines, web pagetracking engines[9, 6], to mention a few.

Our system fulfills most of the requirements listed for a Web based publishing system listed by Waldman et al. [15]: censorship-resistant, tamper evident, source anonymous, updatable, deniable, fault tolerant,persistent (i.e., no expiration date), extensible, freely available. As will become clear later, the only requirement we do not provide is persistency over time (which is inherently incompatible with a self-destructingmessage).

Previously proposed private communication systems use a third party distributor (e.g., e-mail services [2])to store and distribute the message to intended receivers. In our system, the sender does not use a third partydistributor and therefore has a greater degree of control. The sender also has the option of privately storingthe cover document (until it is cached by another Internet company [11,8]) if she/he wishes to maintain theprivately owned web page that is used as part of the cover document.

Refer to Section 5 for more information about available systems.In Section 2, we introduce the model of Web content used in the communication channel described in

this paper. The system overview is provided in Section 3. The experimental framework and results arediscussed in Section 4.

2 Document Models

Before going into the details, we give a brief overview of how our system works. In what follows, byreferring to something as "secret" we mean that knowing it requires the key that is shared by the encoderand decoder of the message.

Perishability is achieved by a secret pairing of every encoder-controlled document (say, d) with a document p(d) that is outside of the encoder's control; we consider a document to be outside the encoder'scontrol if it is unethical or against the accepted etiquette for the encoder to use it for stealthy communication(so although a user may, for example, physically be capable of modifying collaborative content, it wouldbe inappropriate to modify such content for the purpose of encoding). A document d that is within theencoder's control contains elements that we refer to as el, ... ,el

d; if d contains too many elements for us to

use, then we choose only a number ld of them (less than the total available). What determines an e/s role(e.g., pair-selection, mark-encoding, etc) is the keyed hash of it, denoted as H(ei)' Some types of elementsei have a payload, e.g., for a mark-encoding ei the payload consists of the mark's bits that it encodes. Let e~

be to p(d) what ei is to d (actually there is more to e~ than that, but we defer this discussion to the section 3.2on implementation details). If ei has a payload, then that payload is determined by the keyed hash of theconcatenation of ei with with e~: H(ei, eD. A change in enough of the elements of p(d)'s will, over time,erase the message. See Figure 1.

Encodability is achieved through the latitude we have to modify the individual ei's so they encode boththeir intended functionality and (if applicable) their appropriate payloads. This latitude includes, for richcontent, a simple selection and re-ordering of content types, that typically can achieve 22 bits of encodingeven without any modification to any of these rich types (see Section 2.1.1 for more details on this). If subsetselection and ordering information for rich types is not enough, or not feasible because some sites disallowrich content or enforce rigid templates for it, then we resort to modification of the ei's actual contents ratherthen their presence and sequence order: We modify some pixels of an image [4], replace words by synonymsusing robust synonym substitution [14], judiciously inject typos [12] in domains where they are commonenough (blogs, newsgroups), etc. We will later discuss desirable properties for p(d).

There are 2 models for obtaining the ei's from d, each applicable in a different domain. In free-format

3

Pairing Relation \ Cover Document Paired Document

Stego Message

Figure 1 : Pairing the elements of privately owned web page with the collaboratively updated web page for information hiding.

domains we can choose and modify the ei's dynamically so as to maintain the desired pairing and encoding properties. But in append-only domains (e.g., forum posts, opinion pieces, etc) no modifications are allowed of the old versions of content: In these, we achieve the net effect of modifying a now-obsolete ei roughly as follows. We append a new element e to d, making sure that H(e) encodes for e the exact same functionality as ei, thereby signaling to the decoder that the chronologically prior ei is to be ignored and e used in its stead. That is, if successive hashes of all the e j 's of a particular d give more than one of the same functionality as another, the decoder always breaks the tie in favor of the more recent one.

As stated earlier, what constitutes the sequence of ei elements is the chronological sequence of (possibly append-only, possibly update-able) posts, comments, essays, etc; so the next item posted would be any one of these logical units, let's call it ei. Even in the most constrained case for us, when the ei do not have rich content (e.g., only text and tags) we were always able to make that ei (and, if applicable, its concatenation with p ( d ) ) hash into what we want; in some sense, we "torture ei until it confesses". In rare cases we resort to typos, but these are very common (almost expected) in such text-only informal material.

There might be cases (relatively rare) where the append operations have no provision for a chronological date and time cannot be handled as above; this occurs if, for example, free-flowing text is added without any information about which paragraph was added on which day (so that the web site shows only the current snapshot of the document). In that case we use a fixed number of items as our definition of what constitutes the next ei, e.g., it could be the next k basic items from the available categories, where a basic item is a paragraph, or an image, etc. (so the next ei could consist of the next k paragraphs of texts that we append, if only text is added). A too small value for k runs the risk of not being able to carry out enough modifications on ei, whereas a too large value is wasteful.

\.~---~ ----~)yPaired Document

p(d)

····EB························------Ii

} an element

.---Pairing Relation---m49'\

~~

Cover Documentd

Stego Message

Figure I: Pairing the elements of privately owned web page with the collaboratively updated web page forinformation hiding.

domains we can choose and modify the ei's dynamically so as to maintain the desired pairing and encodingproperties. But in append-only domains (e.g., forum posts, opinion pieces, etc) no modifications are allowedof the old versions of content: In these, we achieve the net effect of modifying a now-obsolete ei roughly asfollows. We append a new element e to d, making sure that H (e) encodes for e the exact same functionalityas ei, thereby signaling to the decoder that the chronologically prior ei is to be ignored and e used in its stead.That is, if successive hashes of all the ej's of a particular d give more than one of the same functionality asanother, the decoder always breaks the tie in favor of the more recent one.

As stated earlier, what constitutes the sequence of ei elements is the chronological sequence of (possiblyappend-only, possibly update-able) posts, comments, essays, etc; so the next item posted would be anyoneof these logical units, let's call it ei. Even in the most constrained case for us, when the ei do not have richcontent (e.g., only text and tags) we were always able to make that ei (and, if applicable, its concatenationwith p(d)) hash into what we want; in some sense, we "torture ei until it confesses". In rare cases we resortto typos, but these are very common (almost expected) in such text-only informal material.

There might be cases (relatively rare) where the append operations have no provision for a chronologicaldate and time cannot be handled as above; this occurs if, for example, free-flowing text is added without anyinformation about which paragraph was added on which day (so that the web site shows only the currentsnapshot of the document). In that case we use a fixed number of items as our definition of what constitutesthe next ei, e.g., it could be the next k basic items from the available categories, where a basic item is aparagraph, or an image, etc. (so the next ei could consist of the next k paragraphs of texts that we append, ifonly text is added). A too small value for k runs the risk of not being able to carry out enough modificationson ei, whereas a too large value is wasteful.

4

2.1 Embedding Channels

As mentioned above, an ei has an encoding capacity through making changes to it that modify H(e i ) and (if it is used) H(ei , eb). We now briefly review what kinds of changes to ei are made. This depends on the domain, specifically whether it allows free-format rich content or not. We begin with the former.

2.1.1 Rich Content

Wikis are a popular way for collaborative content creation, which allow rich content creation. Inside a wilu website usually users can both modify the existing content of individual pages, and add new pages. The collective review and editing through such liberal content modification allows wiki sites to produce quality documents.

A rich content creation web site may offer a wide range of element types to contribute content:

text in different styles

headers of different levels

figures

quotes

lists

external links

internal links

tables

references

tags or categories

In high quality wikis there are style guides that sketch the general good editing practices. Clearly a wiki page that consists of several types of content can be typeset in a variety of ways while complying with the respective style guidelines. A good wiki page can consist of one of the more than 222 different ordered permutations of the 10 types of content that were listed above. The 22 bits that were encoded while staying strictly within quality standards of the wiki can be used as a covert channel.

Another popular way of rich content creation is blogs. The blogs can be hosted by a large collection web site like wiki pages, but unlike wiki pages blogs are usually not strictly monitored for their content and style. Moreover individual blogs are usually under the control of one user.

This kind of content creation web sites provide an extra latitude for covert communication, since it is possible to communicate 22 bits of information just through the use of simple selection and re-ordering of content types even without any modification to any of these rich types.

2.1.2 Restricted Content

Some online content creation sites offer only one or a very limited number of different content types of contribution:

tags (product tags)

text (forum messages, comments, etc.)

2.1 Embedding Channels

As mentioned above, an ei has an encoding capacity through making changes to it that modify H( ei) and(if it is used) H(ei, e~). We now briefly review what kinds of changes to ei are made. This depends on thedomain, specifically whether it allows free-format rich content or not. We begin with the former.

2.1.1 Rich Content

Wikis are a popular way for collaborative content creation, which allow rich content creation. Inside a wikiwebsite usually users can both modify the existing content of individual pages, and add new pages. Thecollective review and editing through such liberal content modification allows wiki sites to produce qualitydocuments.

A rich content creation web site may offer a wide range of element types to contribute content:

• text in different styles

• headers of different levels

• figures

• quotes

• lists

• external links

• internal links

• tables

• references

• tags or categories

In high quality wikis there are style guides that sketch the general good editing practices. Clearly a wikipage that consists of several types of content can be typeset in a variety of ways while complying with therespective style guidelines. A good wiki page can consist of one of the more than 222 different orderedpermutations of the 10 types of content that were listed above. The 22 bits that were encoded while stayingstrictly within quality standards of the wiki can be used as a covert channel.

Another popular way of rich content creation is blogs. The blogs can be hosted by a large collectionweb site like wiki pages, but unlike wiki pages blogs are usually not strictly monitored for their content andstyle. Moreover individual blogs are usually under the control of one user.

This kind of content creation web sites provide an extra latitude for covert communication, since it ispossible to communicate 22 bits of information just through the use of simple selection and re-ordering ofcontent types even without any modification to any of these rich types.

2.1.2 Restricted Content

Some online content creation sites offer only one or a very limited number of different content types ofcontribution:

• tags (product tags)

• text (forum messages, comments, etc.)

5

images (photographs)

audio

links (related links, bookmarks)

These sites are usually in the form of collections, and they restrict the users to appending new content and do not allow modification of existing content. Examples of these include online forums, newsgroups, product reviews, and product tags. History pages (pages that list the log of modifications on a page) of individual pages in rich content web sites (e.g. wikis) also exhibit an append-only characteristic, since older snapshots of the pages are fixed, and these history pages monotonically grow in size as new changes are made to the respective documents.

The append-only characteristic of these sites constitute an interesting challenge for covert communication. In Section 3 we explore this model further and give an algorithm for encoding and decoding using a covert channel based on it.

3 System Overview

The system consists of an algorithm that determines which of a document's n links are relevant to the mark and how they encode their relevant bit(s) of the mark (e.g. links could be different personal pages in a community web site). In our implementation, each message bit is encoded in a pair of key-selected links, only one of which is controlled by the message encoder: This pairing is necessary for the simultaneous achievement of the desired properties of encodability and perishability (self-effacing mark). Encoding a mark's bit using only a link under the control of the encoder would not achieve the desired perishability property, whence this "mixing" of what is under the encoder's control with a sibling that is beyond such control. Moreover, the rate of the perishability (does the mark vanish in days, or in weeks?) can be controlled by selecting a sibling that is rapidly changing (for faster decay) or slow changing (for slower decay). See Section 3.2 for further discussion on controlling the speed of message decay.

For the sake of definiteness we describe the algorithm for B bits per pair. In this section, we will use the append-only content creation model to describe our system. However it is

possible to generalize it tofree-jorm modiJication, where the contributors are allowed to delete and modify existing content (see Section 3.4).

The algorithm takes as input (i) a document that has a set of n links (call this set U), m of which are in D and are designated (as being under the encoder's control), m 5 n/2; (ii) a keyed hash function H; and (iii) a r-bit mark message w = bl . . . b,. Each link points to a document d with el, elements (their tags, content or combination thereof). The algorithm outputs a judicious modification of the documents pointed to by designated links, such that a pairing operation using H only results in m pairs of selected links: One link of each pair is a designated link from our set, the other is its sibling, p ( d ) , chosen from one of the n - m non-designated links. A decoder who has H but does not know the designated links can find out the same pairing, by applying H to all links in the document. The process of producing these pairs by the encoder takes place as follows (keep in mind that the pairing has to be producible by the decoder as well, who has the keyed hash function but otherwise has no a priori knowledge of which links are designated and which are not):

1. All of the designated links are made to acquire new elements from categories SELECTION, PAIR- ING, EMBEDDING and CANCEL according to their roles. The categorization of elements can be achieved by partitioning the 2 most significant bits of the hash values of the elements, e.g. 00 for SELECTION, 01 for PAIRING, etc.

• images (photographs)

• audio

• links (related links, bookmarks)

These sites are usually in the form of collections, and they restrict the users to appending new content and donot allow modification of existing content. Examples of these include online forums, newsgroups, productreviews, and product tags. History pages (pages that list the log of modifications on a page) of individualpages in rich content web sites (e.g. wikis) also exhibit an append-only characteristic, since older snapshotsof the pages are fixed, and these history pages monotonically grow in size as new changes are made to therespective documents.

The append-only characteristic of these sites constitute an interesting challenge for covert communication. In Section 3 we explore this model further and give an algorithm for encoding and decoding using acovert channel based on it.

3 System Overview

The system consists of an algorithm that determines which of a document's n links are relevant to the markand how they encode their relevant bites) of the mark (e.g. links could be different personal pages in acommunity web site). In our implementation, each message bit is encoded in a pair of key-selected links,only one of which is controlled by the message encoder: This pairing is necessary for the simultaneousachievement of the desired properties of encodability and perishability (self-effacing mark). Encoding amark's bit using only a link under the control of the encoder would not achieve the desired perishabilityproperty, whence this "mixing" of what is under the encoder's control with a sibling that is beyond suchcontrol. Moreover, the rate of the perishability (does the mark vanish in days, or in weeks?) can be controlledby selecting a sibling that is rapidly changing (for faster decay) or slow changing (for slower decay). SeeSection 3.2 for further discussion on controlling the speed of message decay.

For the sake of definiteness we describe the algorithm for B bits per pair.In this section, we will use the append-only content creation model to describe our system. However it is

possible to generalize it to free-form modification, where the contributors are allowed to delete and modifyexisting content (see Section 3.4).

The algorithm takes as input (i) a document that has a set of n links (call this set U), m of which are inD and are designated (as being under the encoder's control), m :S n/2; (ii) a keyed hash function H; and(iii) a r-bit mark message w = bI ... br . Each link points to a document d with el

delements (their tags,

content or combination thereof). The algorithm outputs a judicious modification of the documents pointedto by designated links, such that a pairing operation using H only results in m pairs of selected links: Onelink of each pair is a designated link from our set, the other is its sibling, p(d), chosen from one of the n - mnon-designated links. A decoder who has H but does not know the designated links can find out the samepairing, by applying H to all links in the document. The process of producing these pairs by the encodertakes place as follows (keep in mind that the pairing has to be producible by the decoder as well, who hasthe keyed hash function but otherwise has no a priori knowledge of which links are designated and whichare not):

1. All of the designated links are made to acquire new elements from categories SELECTION, PAIRING, EMBEDDING and CANCEL according to their roles. The categorization of elements can beachieved by partitioning the 2 most significant bits of the hash values of the elements, e.g. 00 forSELECTION, 01 for PAIRING, etc.

6

In this section, for the sake of simplicity, we use a concatenation of the least significant bits of elements from the same type to generate larger bit strings. In practice we use more sophisticated information hiding methods to embed larger amounts of information into one element (e.g., image, audio, video [4], text [13, 121).

2. The subset T of IT1 = T / B links out of D are marked by adding new SELECTION elements (until the probability that a non-designated link contains more SELECTION elements before the message decays is within acceptable limits). These new SELECTION elements signal the selected subset of links in U that encode the message to a decoder that only has the information of H. Let the set of non-selected links be F = U - T .

3. We modify PAIRING elements of each link 1 in T , and select a sibling for 1 from non designated links with the bit string they generate (i.e., by using the concatenation of the least significant bits of their hash value as a pointer or id). If a link in T is involved in a collision between its choice of sibling and the choice made by another link in T , we change some of the PAIRING elements until collision disappears and another desired sibling is selected.

4. The sorted order of the hash of concatenation of the SELECTION elements of links in T also provides a way of knowing which ones will encode which bits: The smallest in the sorted order will encode the first B bits of the mark, the second will encode the next B bits, etc, until all B x IT1 bits of the mark are encoded (we consider the redundancy and error correction as being included in these B x IT( bits).

5. Go through each selected link in T according to the above-mentioned sorted order. For each of them do the following: add a set of EMBEDDING elements to the link so that the concatenation of the least significant bits of H(ei , el) (i.e., hash of an EMBEDDING element ei and a counterpart element e: of sibling) is equal to the B bits of the mark that it seeks to encode.

6. Later, if the sibling of a link 1 of T unexpectedly turns out to have the wrong mean-time-betweenmodifications characteristics (too small for a desired slow decay, or too large for a desired fast decay due to changes in public interest to the sibling document), then there are two options: i) the hash of the PAIRING elements are modified by appending new PAIRING elements such that their hash, as a seed, selects a suitable item from F (one that has the desired characteristic). Of course this has to be done subject to not creating a collision with the choice of another link in T . ii) if reusing 1 to accommodate a new pairing will cause exceeding a stealthiness threshold for its number of elements, 1 is de-selected using CANCEL elements. A new designated link can be selected to replace its place in encoding the bits of the message using the same SELECTION tags of 1 if desired.

The decoder operation consists of the following steps:

1. Compute T and F using the secret key and H, and then compute the sibling information (i.e pairing) p(d) for each d in T .

2. Sort the documents in T according to the hash of their SELECTION elements.

3. Use the sorted order to read from each of the selected documents B bits that it encodes jointly with its sibling by computing H(ei , el) for each EMBEDDING element.

SELECTION determines which of the given n links are used for encoding. PAIRING determines which one of the non-designated documents is paired with each document selected by SELECTION process. EM- BEDDING determines the encoded message bit string by the selected link and its pair. CANCELING is used when there is a need for signaling the un-selected status of a previously selected document.

In this section, for the sake of simplicity, we use a concatenation of the least significant bits of elements from the same type to generate larger bit strings. In practice we use more sophisticated information hiding methods to embed larger amounts of information into one element (e.g., image, audio,video [4], text [13, 12]).

2. The subset T of ITI = T / B links out of D are marked by adding new SELECTION elements (untilthe probability that a non-designated link contains more SELECTION elements before the messagedecays is within acceptable limits). These new SELECTION elements signal the selected subset oflinks in U that encode the message to a decoder that only has the information of H. Let the set ofnon-selected links be F = U - T.

3. We modify PAIRING elements of each link l in T, and select a sibling for l from non designated linkswith the bit string they generate (i.e., by using the concatenation of the least significant bits of theirhash value as a pointer or id). If a link in T is involved in a collision between its choice of siblingand the choice made by another link in T, we change some of the PAIRING elements until collisiondisappears and another desired sibling is selected.

4. The sorted order of the hash of concatenation of the SELECTION elements of links in T also providesa way of knowing which ones will encode which bits: The smallest in the sorted order will encode thefirst B bits of the mark, the second will encode the next B bits, etc, until all B x ITI bits of the markare encoded (we consider the redundancy and error correction as being included in these B x ITI bits).

5. Go through each selected link in T according to the above-mentioned sorted order. For each of themdo the following: add a set of EMBEDDING elements to the link so that the concatenation of the leastsignificant bits of H(ei, e~) (i.e., hash of an EMBEDDING element ei and a counterpart element e~ ofsibling) is equal to the B bits of the mark that it seeks to encode.

6. Later, if the sibling of a link l of T unexpectedly turns out to have the wrong mean-time-betweenmodifications characteristics (too small for a desired slow decay, or too large for a desired fast decaydue to changes in public interest to the sibling document), then there are two options: i) the hash ofthe PAIRING elements are modified by appending new PAIRING elements such that their hash, asa seed, selects a suitable item from F (one that has the desired characteristic). Of course this hasto be done subject to not creating a collision with the choice of another link in T. ii) if reusing l toaccommodate a new pairing will cause exceeding a stealthiness threshold for its number of elements,l is de-selected using CANCEL elements. A new designated link can be selected to replace its placein encoding the bits of the message using the same SELECTION tags of l if desired.

The decoder operation consists of the following steps:

1. Compute T and F using the secret key and H, and then compute the sibling information (i.e pairing)p(d) for each d in T.

2. Sort the documents in T according to the hash of their SELECTION elements.

3. Use the sorted order to read from each of the selected documents B bits that it encodes jointly withits sibling by computing H(ei' e~) for each EMBEDDING element.

SELECTION determines which of the given n links are used for encoding. PAIRING determines whichone of the non-designated documents is paired with each document selected by SELECTION process. EMBEDDING determines the encoded message bit string by the selected link and its pair. CANCELING isused when there is a need for signaling the un-selected status of a previously selected document.

7

3.1 How the Mark Decays

Note the following characteristic of our system: Extensive changes to a non-designated link does not do more damage to the mark than more modest changes to that same link. Rather, the damage occurs from modifications done to many links (even if each of these modifications is modest) that, over time, eventually overwhelm the built-in redundancy and error-correction (which are needed for a practical operational consideration discussed below). The primary mechanism we use for controlling the lifetime of the mark (when it is not refreshed) is therefore not the amount of the redundancy and error-correction we introduce, but rather our choice of the characteristics of the sibling links - the frequency with which they are likely to be updated based on past observations of their behavior. What if that previously observed behavior changes against our expectation, e.g., a link that was rapidly changing when we selected it as sibling becomes largely dormant after a period of time. To avoid this damaging the perishability property of our mark, we refresh the mark so that a new sibling gets selected instead (one with a more appropriate mean-time-between-updates value).

3.2 Controlling the pace of perishability

It is quite likely that we may not find public documents that undergo changes at exactly the rate we need to achieve our targeted pace of perishability. This subsection discusses how we overcome a possible mismatch between our needs and what is available. We use two basic techniques: A slowdown technique for the case where the paired documents are too fast-changing for our needs, and a speedup technique for the opposite case where the paired documents are too slow-changing for our needs. We now discuss each of these in turn.

We made the assumption that changes to paired non-designated pages will localize at element level (e.g., a paragraph of text). This assumption is supported by analysis of Buriol et. al. [ I] that the amount of changes to a page in Wikipedia have stayed roughly the same throughout the Wikipedia history at the level of one short paragraph at a time.

3.2.1 The slowdown technique

We temporarily assume that the number of embedding elements of d is no greater than lp (d) (the number of elements in p ( d ) ) ; although this is a reasonable assumption in view of the fact that public documents tend to be larger than private ones (e.g., Wikipedia grows over time, whereas a student's web page tends to stay relatively small), we will later on discuss how to avoid this assumption.

Each of the (say, t) embedding eis of d will be paired with the concatenation of one or more elements of p ( d ) , and it is this concatenation that we called e: in the previous sections. The number of elements (call it A) of p ( d ) that make up one e;, is determined by the desired slowdown factor s, by t, and also by the total number of elements l P ( d ) in p ( d ) . Specifically we choose

By way of example, assume the number of embedding eis in d is t = 5. Suppose we have a p ( d ) that changes 10 times faster than we like, and that it consists of (say) lp (d) = 50 elements (e.g., paragraphs). In that case we need a A of

Using the keyed hash function of the 5 embedding ei's in d, we select for each ei a single element of p ( d ) to act as its e:. As only 5 of the 50 elements of p ( d ) impact the embedded message, the probability that a change in p ( d ) causes damage to the message is only 0.1. Notice that the use of the keyed hash function to

3.1 How the Mark Decays

Note the following characteristic of our system: Extensive changes to a non-designated link does not domore damage to the mark than more modest changes to that same link. Rather, the damage occurs frommodifications done to many links (even if each of these modifications is modest) that, over time, eventually overwhelm the built-in redundancy and error-correction (which are needed for a practical operationalconsideration discussed below). The primary mechanism we use for controlling the lifetime of the mark(when it is not refreshed) is therefore not the amount of the redundancy and error-correction we introduce,but rather our choice of the characteristics of the sibling links - the frequency with which they are likely tobe updated based on past observations of their behavior. What if that previously observed behavior changesagainst our expectation, e.g., a link that was rapidly changing when we selected it as sibling becomes largelydormant after a period of time. To avoid this damaging the perishability property of our mark, we refresh themark so that a new sibling gets selected instead (one with a more appropriate mean-time-between-updatesvalue).

3.2 Controlling the pace of perishability

It is quite likely that we may not find public documents that undergo changes at exactly the rate we need toachieve our targeted pace of perishability. This subsection discusses how we overcome a possible mismatchbetween our needs and what is available. We use two basic techniques: A slowdown technique for the casewhere the paired documents are too fast-changing for our needs, and a speedup technique for the oppositecase where the paired documents are too slow-changing for our needs. We now discuss each of these in tum.

We made the assumption that changes to paired non-designated pages will localize at element level (e.g.,a paragraph of text). This assumption is supported by analysis of Buriol et. al. [1] that the amount of changesto a page in Wikipedia have stayed roughly the same throughout the Wikipedia history at the level of oneshort paragraph at a time.

3.2.1 The slowdown technique

We temporarily assume that the number of embedding elements of d is no greater than lp(d) (the number ofelements in p(d)); although this is a reasonable assumption in view of the fact that public documents tendto be larger than private ones (e.g., Wikipedia grows over time, whereas a student's web page tends to stayrelatively small), we will later on discuss how to avoid this assumption.

Each of the (say, t) embedding eis of d will be paired with the concatenation of one or more elementsof p(d), and it is this concatenation that we called e~ in the previous sections. The number of elements (callit ).) of p(d) that make up one e~, is determined by the desired slowdown factor s, by t, and also by the totalnumber of elements lp(d) in p(d). Specifically we choose

). = lp(d)

tsBy way of example, assume the number of embedding eiS in d is t = 5. Suppose we have a p(d) that

changes 10 times faster than we like, and that it consists of (say) lp(d) = 50 elements (e.g., paragraphs). Inthat case we need a ). of

50).=--=1

5 * 10Using the keyed hash function of the 5 embedding ei's in d, we select for each ei a single element of

p(d) to act as its e~. As only 5 of the 50 elements of p(d) impact the embedded message, the probability thata change in p(d) causes damage to the message is only 0.1. Notice that the use of the keyed hash function to

8

select the eis has the effect that, irrespective of the likelihood of change in the 50 elements of p(d) (e.g., even if that distribution is nonuniform), the probability that a change in an element of p(d) impacts our embedded message is now 0.1, i.e., we have achieved the desired slowdown of a factor of 10 not only through the judicious selection of A, but crucially through the randomization done by the use of the keyed hash.

3.2.2 The speedup technique

Suppose that in the previous sections, instead of H(e l , e;) . . . H(e t , ei) being used as the encoding mechanism for the pair d,p(d), it was the following:

Although this would not change the probability that a modification in one of the elements of p(d) will damage the mark, it does amplify t-fold the number of mark bits damaged in that case. This is indeed one mechanism for achieving a speedup of the decay.

The speedup can be further enhanced by increasing the size of the portion of p(d) that can affect the mark until in eventually includes all lp (d ) elements of p(d).

The above mechanism is probably enough in many situations, but it could be the case that an even faster decay is needed than what can be achieved by the above damage-amplification process. In that case we resort to the following method. We use a one-to-many mapping for the pairing function p(.), i.e., p(d) is no longer a single document but a concatenation of multiple documents. In fact this method can even be used in the slowdown technique in case the assumption o f t < lp (d ) does not hold: It can be forced to hold by using a large enough multiplicity for the p(.) function (although as we stated above, this will occur very rarely).

3.3 System Operation Issues

A burst of updates could be done to many nodes not under the control of the encoder, just prior to a read attempt by the decoder and before the encoder has a chance to react to the change and refresh the mark. Although redundancy and error-correction can mitigate the effects of this (i.e., the mark can survive), it is possible that the sudden updates are so extensive as to overwhelm the mark. What happens in such a case ? The reader (i) recovers the wrong mark, that he is therefore unable to decrypt into something that makes sense (it decrypts to random gibberish); then (ii) the reader backs off from the read attempt and tries to read again at a later time. We believe this to be preferable to a solution that involves a rendez-vous time between the encoder and the decoder (e.g., one refreshes at 4:55PM and the other reads at 4:57PM), which would be awkward and unnecessarily constrain the system's operation. Such a rendez-vous mechanism has its uses, however, if one wants to deliberately create an evanescent mark, which is a mark that fleetingly appears then is promptly read and erased immediately thereafter by the encoder. Such a rendez-vous mechanism also increases the communication bandwidth because the encoder relies on fleeting changes that will not have to stand up to the scrutiny of the other readers of the public information forum being used (the "problem" introduced in encoding gets fixed relatively fast, before the complaints roll in).

3.4 The Case of Free-Form Modification

Recall that this is when we are not constrained to operate in append-only mode when modifying d, i.e., we can modify the individual ei's so as to cany out the modifications we seek.

Recall that the functionality of an ei is governed by the two most significant bits of H(ei), whereas its payload is governed by the !least significant bits of H (ei,p(d)) wherep(d) is the sibling of d and ! depends on the functionality (e.g., if ei encodes 20 bits of the mark then ! = 20). We distinguish two cases.

select the e~s has the effect that, irrespective of the likelihood of change in the 50 elements of p(d) (e.g., evenif that distribution is nonuniform), the probability that a change in an element of p(d) impacts our embeddedmessage is now 0.1, i.e., we have achieved the desired slowdown of a factor of 10 not only through thejudicious selection of >., but crucially through the randomization done by the use of the keyed hash.

3.2.2 The speedup technique

Suppose that in the previous sections, instead of H(el' e~) ... H(et, e~) being used as the encoding mechanism for the pair d, p(d), it was the following:

Although this would not change the probability that a modification in one of the elements of p(d) willdamage the mark, it does amplify t-fold the number of mark bits damaged in that case. This is indeed onemechanism for achieving a speedup of the decay.

The speedup can be further enhanced by increasing the size of the portion of p(d) that can affect themark until in eventually includes alllp(d) elements of p(d).

The above mechanism is probably enough in many situations, but it could be the case that an even fasterdecay is needed than what can be achieved by the above damage-amplification process. In that case weresort to the following method. We use a one-to-many mapping for the pairing function p(.), i.e., p(d) is nolonger a single document but a concatenation of multiple documents. In fact this method can even be used inthe slowdown technique in case the assumption of t < lp(d) does not hold: It can be forced to hold by usinga large enough multiplicity for the p(.) function (although as we stated above, this will occur very rarely).

3.3 System Operation Issues

A burst of updates could be done to many nodes not under the control of the encoder, just prior to a readattempt by the decoder and before the encoder has a chance to react to the change and refresh the mark.Although redundancy and error-correction can mitigate the effects of this (i.e., the mark can survive), it ispossible that the sudden updates are so extensive as to overwhelm the mark. What happens in such a case? The reader (i) recovers the wrong mark, that he is therefore unable to decrypt into something that makessense (it decrypts to random gibberish); then (ii) the reader backs off from the read attempt and tries to readagain at a later time. We believe this to be preferable to a solution that involves a rendez-vous time betweenthe encoder and the decoder (e.g., one refreshes at 4:55PM and the other reads at 4:57PM), which would beawkward and unnecessarily constrain the system's operation. Such a rendez-vous mechanism has its uses,however, if one wants to deliberately create an evanescent mark, which is a mark that fleetingly appears thenis promptly read and erased immediately thereafter by the encoder. Such a rendez-vous mechanism alsoincreases the communication bandwidth because the encoder relies on fleeting changes that will not haveto stand up to the scrutiny of the other readers of the public information forum being used (the "problem"introduced in encoding gets fixed relatively fast, before the complaints roll in).

3.4 The Case of Free-Form Modification

Recall that this is when we are not constrained to operate in append-only mode when modifying d, i.e., wecan modify the individual ei's so as to carry out the modifications we seek.

Recall that the functionality of an ei is governed by the two most significant bits of H(ei), whereas itspayload is governed by the £least significant bits of H(ei,p(d)) where p(d) is the sibling of d and £dependson the functionality (e.g., if ei encodes 20 bits of the mark then £ = 20). We distinguish two cases.

9

The first case is when the update is not supposed to change the functionality of an ei, only its payload. This is the most frequent update and happens when, e.g., we still want ei's functionality to be an encoding one, but we want to change ei so that the e least significant bits of H(ei , e:) are restored to their original value, which they now deviate from because of a modification that occurred to p(d). Recall that such changes that occur to p(d) are beyond the encoder's control - the encoder merely responds to them by modifying ei so as to restore the correct value to the e least significant bits of H(ei, e!,). The encoder will, on average, need to apply 22+e modifications to ei before he manages to give their target value to the 2 f e bits that he is trying to control. The encoder achieves that capacity by a multiplicity of methods. One of these involves trying different subsets of rich content types for inclusion in the new ei, and for each subset trying all possible orderings of these types. This usually gives the power to carry out the job, without the need to apply different modifications to these types. Specifically, if there are x types, then the number of different orderings of all their different subsets is

5 ! / ( x - i)!

For typical rich content, the log2 of the above gives 22.5, hence an encoding capacity of around 22 bits, making possible an e = 20 without the need to modify each individual type when refreshing the mark (capacity can be much increased if we are willing to make such changes to text, images, etc).

We have implemented our system to work as a plugin through a browser based interface, where the user is provided with a Javascript based code that can be run by cliclung on a bookmark in the toolbar (See the bookmark labeled as "WaneMark" in the screen shot shown in Figure 2). When the bookmark is clicked a text area is dynamically created inside the browser window. The user enters the secret key, the paired URL (In the current implementation this is just a label of a Wikipedia page.), and a secret message. Encoding is performed in three steps. At every step user highlights a part of the text in the text preview window, and hits the 'Try Encode" button. WaneMark modifies the highlighted text. The current system leaves the trailing piece of the highlighted text that does not contribute to the encoding intact; the remaining portion can be used for encoding other parts of the message. The user approves the changes by hitting the "Accept Changes" button and proceeds to the next step.

See Figure 2 for a screen shot of the current system operating on a third party blogging interface. [More detailed information about our experiments will be provided in the camera ready copy of our paper including the traffic analysis of dynamically changing pages. We can characterize the updates to non-designated links by their mean-time-between-updates.].

A demo of our implemented system is available at http://www.cs.purdue.edu/homes/utopkara~wanemark.

Note that WaneMark can use different steganography algorithms as plugins. In our proof-of-concept implementation, we have used a typographical-error based information hiding technique that mimics errors of a human for modifying the text elements on web pages [12]. However, there are many other methods available for performing the same task [14, 13,3, 10, 17, 161. Non-text elements can be used for information carrying using image, audio and video steganography techniques available in the literature [4].

The current implementation is a bookmarklet, and does not have permissions to operate on sites other than the test site.

We use 16-bit labels to mark SELECTION, PAIRING and EMBEDDING elements.

The first case is when the update is not supposed to change the functionality of an ei, only its payload.This is the most frequent update and happens when, e.g., we still want ei's functionality to be an encodingone, but we want to change ei so that the £ least significant bits of H(ei, eD are restored to their originalvalue, which they now deviate from because of a modification that occurred to p(d). Recall that such changesthat occur to p(d) are beyond the encoder's control- the encoder merely responds to them by modifying eiso as to restore the correct value to the £ least significant bits of H(ei, eD. The encoder will, on average,need to apply 22+£ modifications to ei before he manages to give their target value to the 2 + £ bits thathe is trying to control. The encoder achieves that capacity by a multiplicity of methods. One of theseinvolves trying different subsets of rich content types for inclusion in the new ei, and for each subset tryingall possible orderings of these types. This usually gives the power to carry out the job, without the need toapply different modifications to these types. Specifically, if there are x types, then the number of differentorderings of all their different subsets is

x

I: x!j(x - i)!i=l

For typical rich content, the log2 of the above gives 22.5, hence an encoding capacity of around 22bits, making possible an £ = 20 without the need to modify each individual type when refreshing the mark(capacity can be much increased if we are willing to make such changes to text, images, etc).

4 Experiments:Making of WaneMark

We have implemented our system to work as a plugin through a browser based interface, where the useris provided with a Javascript based code that can be run by clicking on a bookmark in the toolbar (See thebookmark labeled as "WaneMark" in the screen shot shown in Figure 2). When the bookmark is clicked atext area is dynamically created inside the browser window. The user enters the secret key, the paired URL(In the current implementation this is just a label of a Wikipedia page.), and a secret message. Encodingis performed in three steps. At every step user highlights a part of the text in the text preview window,and hits the "Try Encode" button. WaneMark modifies the highlighted text. The current system leaves thetrailing piece of the highlighted text that does not contribute to the encoding intact; the remaining portioncan be used for encoding other parts of the message. The user approves the changes by hitting the "AcceptChanges" button and proceeds to the next step.

See Figure 2 for a screen shot of the current system operating on a third party blogging interface. [Moredetailed information about our experiments will be provided in the camera ready copy of our paper includingthe traffic analysis of dynamically changing pages. We can characterize the updates to non-designated linksby their mean-time-between-updates.].

A demo of our implemented system is available athttp://www.cs.purdue.edu/homes/utopkara/wanemark.

Note that WaneMark can use different steganography algorithms as plugins. In our proof-of-conceptimplementation, we have used a typographical-error based information hiding technique that mimics errorsof a human for modifying the text elements on web pages [12]. However, there are many other methodsavailable for performing the same task [14, 13,3, 10, 17, 16]. Non-text elements can be used for informationcarrying using image, audio and video steganography techniques available in the literature [4].

The current implementation is a bookmarklet, and does not have permissions to operate on sites otherthan the test site.

We use 16-bit labels to mark SELECTION, PAIRING and EMBEDDING elements.

10

Ble Ed4 View Hisory Bookmark. 1001s Help bletr G m l Nekr DM

-13- -, E@I c B -'1~:..., . 9' ~~f111010101100111

~ a u ~ ~ ~ j ~ s e n u r: My Homemade Blog ~ e s s a ~ c ~ l m o o i l m l a l m o I*:

of t h i s mfDrmecion 15 ~ 0 1 1 e b l ) I a c ~ v e l y omed (c rea ted end TNEnmds .qeiea%iuMle ocher ~nEom&tlOn 15 p r 1 v ~ C e l y omed and aemcalned (bu t

-1yac~esslhle). Whereas z t 1s une th ica l co m d l f y rhe former tor ~sageloaded21344;l~~t1~n, ~t 1s wire leplt lmace t o do s o vlth the l a t c e r , end us

I . . a deslgn f o r d o n g so u b l e achlwmg both plausible d?mabxllcy end aucomauc perl?hab:L~ty of the cover t message ( t h e message disappears unless periodically refreshed by the encoder).

I Tradlclonal infomaclon-hiding has looked a t the problem of embedding a message I n a s t a t i c versloe Of en onllne document, t h e Droblem of dolnq s o f o r rap ld lv ' & The VlWW mcreasmgly allows peope ro crealc and updare conlmt for pubhc a c s s Some of tlus m f o m u o n IS

cohboratlvcy owned (creakd and m t m c d ) . wrule o tha m f o m o n 1s pnvatcij o o m d and m m e d (but slrll pubhcly access~ble) Whu%s 11 1s uneUucal lo modfy the formu for covat commurncauon, n:s guu l c g m l e lo do so h thc tatla. and &us DaDU sves a deslen for doma so wMe achnnnn both vlauslble dauabllv and automauc omshabltv of the covert message (the message d ippears unless p e n o d d l y regeshedby the mcoder) '

Tradtlonal ulfonnatan-hhg h s looked at the p rob lw of embed@ a messag m a statlc verston of an o h e docunmt, theproblem of domg so for rapdly wolvmg docwnmr collectms has not hem considered mthe past T h s paper shows Lhat 11 is posslble to do so, and m a manner Lhat actualy d e s use of the rapidly evolmg nature of the docummts to acheve t h e a b o v e - m o n e d propaiy of w e s c m c e That the message decays o v a clme and wmtua!ly becomes completely aased unless ~t 1s refreshed Therefore the mark needs to be contmuously m t m e d as the document evolves. m a manna Lhat prwmts the advasar j fmm knovnng who u domg the refislung yet h t allows the mtended readu of the mark to IeCOVK d mthout any form of exphcit commmcatlon

One advantage of our s c h w e IS that the nark's rmch a now unbounded It can be read by any authorzed mklty on the web (anyone m i h the scret key), and the r e a h g of 11 is mdstmwshable from n o d web a c s s pattans Anotha ad6anIage a thc h d n g m the crowo" c f fec~ Many peoplc are "idatlogthe docwnmts, theeby provl~lvl a c l v a for thc one pcrjon surtphuously mechug and rehestununa the mark. or rcptacmg 11 d anotha nrark message We bave also danonslrated the feaslbity of the proposed tec&ue, and shown Lhat&mrkably htUe effort 1s re&red to m p l w m t oru s c h w r o v e tohy's web

S(aunlr-1

Done 0 310s @a OpepNoiebook Jdbl~i(

Figure 2: Screen shot of the WaneMark system operating on a test interface. Each paragraph helps carrying 16 bits for SELECTION, PAIRING and EMBEDDING.

5 Related Work

Our system aims to mitigate the problem of establishing a truly private web based communication channel. As also mentioned by Butler et al., users of web based communication services have the right to expect that their messages will only be read by the intended recipients, and should be guaranteed of their privacy [2]. As a solution, they propose a scheme that uses several communication channels (different online e-mail services) simultaneously to deliver a steganographically marked text message that carries the secret message embedded in it and the cryptographic keys that are required to decode and decipher the message. Similarly, a form of private communication can be provided by the use of secure email tools such as Privacy Enhanced Mail (PEM), Secure Multipurpose Internet Mail Extension (S/MIME) and Pretty Good Privacy (PGP) [5]. But none of these tools can hide the fact that there is a communication going on between the sender and the receiver.

Waldman et al. [15] listed the required properties of a Web based publishing system as follows: censorship- resistant, tamper evident, source anonymous, updatable, deniable, fault tolerant, persistent (i.e. no expiration date), extensible, freely available. In the same paper, they introduce Publius, a web based publishing system that provides these requirements. Publius spreads the data of an anonymous publisher to several servers, such as each server receives an encrypted Publius content (some random looking data) and one of the key shares that was generated by splitting the publisher's key, K, into several keys. Any k of n key shares can reproduce the original K. The publishing process produces a special URL that is used to recover the data

My HomemadeBlog~~~""·'W':eaSint;11Y 8.110'L13 people co c:r:eate and update cont.ent for: public

f this infpI:ID.e.t1on .is collebo:r:atively. owned (ct:eat.ed end ll".:

"Ie othe:r: informat.ion is privately otmed and ma.int.ained (but :i." .:..£'~.!;I,~JEA!;,h ~ereas it is un,et:h1cal to modify the,' fo:t:mer -for:

"t is quite leoitim.ate to do 30 'With the latter:, and this ~

.. a design fot: doing -30 while achieving ~oth..p.laU3ible ~,~.~~.~.H.~Y andlautomatic P_~.:r:;.;.~~~_:j.J1~Y of the covet t m.e33age (the me5S8.g'e disappea.I:S unlege; ~pe:r:iodicelly :r:efreshed by the ene-odet).

1<4~)0;,.@~

KPair

¥.

TJ::e.ditiona! 1nform.ation-:-h1ding has looked at. the problem. of embeddinq e. m.essage·in a st.atic veI:3ion of an online docume,nt •. the problem. of d.oinq so for t".8.pidl.V

:.p.~ ..1The WWW increasingly allows peope ~ c~eate and.up~te content for public acess .. Some of this information iscollaboratively owned (cr.ea~ and maintaIned). while other information is pri~lely ~_wiledand ~tained (but __sitll publiclyaccessible)., 'Whereas it is' unethical to Inodify the fonner fOr'·covertconununication. it is quite legitimate to. do so with thelatter~ 'and this Pap~-~ves a design fOf doing so ·wmte achieving both plausible deciability and automatic'penshability of thecov'ert message (the.message.disappea:rs unless periodically refreshed by the encoder).

Traditiotlal infcirma.t.On.hiding has looke.d at the problem of embedding a messag in a static version, of an online dcicunent;the problem ~f doing so for rapidly evolving document·collectins has not been __considered in the past. This paper showsthat·it is possible to do so, ,and in a manner that actualy makes use of the rapidly e;rol~g.nature of the documents toachieve the above·.mentiom:d propert~l of evanescence: That the'message decays over tiaie and eventually becomescornpletelyerasedunless itis refreshed. Therefore the mark.ne.eds to:be contin~ousty maintained as.the doc~tevo1ves'.in· a mariner that prevents ·the adversary from knowing who is doing· the refreshing yet that allows tb.e intended reader ofthe mark to recover it without any foim of eJqIlicit comm~cation.

One ad.vahtage of our-scheme is that the. nark's r~ch is' now 1lnbounded: It can' be read by any authorized ents~ty on theweb (an}Torie 'with the streL key). and the·reading of it is indistinguishable from nonnal web acess patterns'. Anotheradvantage is the "hiding in.the crQV{d" e~fect IvIanyjleopl~ are updatin_g_~e documents. theeby,providin a cover for the~ne person Sllt¢ptitiouSIy injecting and refreshing the'maik, or replacing -It willi .3?0ther mark message. We have alsodemonstrated the feasibility of the proposed technique, and shown that remarkably little effort· is required to implement ourscheme over today's web. .

t~GBm~A

Figure 2: Screen shot of the WaneMark system operating on a test interface. Each paragraph helps carrying16 bits for SELECTION, PAIRING and EMBEDDING.

5 Related Work

Our system aims to mitigate the problem of establishing a truly private web based communication channel.As also mentioned by Butler et aI., users of web based communication services have the right to expect thattheir messages will only be read by the intended recipients, and should be guaranteed of their privacy [2].As a solution, they propose a scheme that uses several communication channels (different online e-mailservices) simultaneously to deliver a steganographically marked text message that carries the secret messageembedded in it and the cryptographic keys that are required to decode and decipher the message. Similarly,a form of private communication can be provided by the use of secure email tools such as Privacy Enhanced¥ail (PEM), Secure Multipurpose Internet Mail Extension (SIMIME) and Pretty Good Privacy (PGP) [5].But none of these tools can hide the fact that there is a communication going on between the sender and thereceiver.

Waldman et al. [15] listed the required properties of a Web based publishing system as follows: censorshipresistant, tamper evident, source anonymous, updatable, deniable, fault tolerant, persistent (i.e. no expirationdate), extensible, freely available. In the same paper, they introduce Publius, a web based publishing systemthat provides these requirements. Publius spreads the data of an anonymous publisher to several servers,such as each server receives an encrypted Publius content (some random looking data) and one of the keyshares that was generated by splitting the publisher's key, K, into several keys. Any k of n key shares canreproduce the original K. The publishing process produces a special URL that is used to recover the data

11

and the shares. To browse content, a retriever must get the encrypted Publius content from some server and k of the key shares. Any modification to the encrypted content or the special URL, that is cryptographically tied to the content, will make it impossible to retrieve the unencrypted data. Publius also provides a mechanism for the publishers to update or delete their content. Unlike Publius, where the persistence of the messages and reaching to public are driving goals, our system is based on the design goal of perishability of the messages, and providing a secure communication only with the intended recipient of the message.

Readers are referred to [7] for a survey of privacy enhancing technologies for the Internet.

6 Future Work

There is room for wringing some inefficiencies out of our scheme. We briefly discuss some of these below. In the append-only mode, we need to design a finer granularity of modifications for the ei that have

an encoding functionality. The difficulty is that, if we were to do so in the natural way (by spreading the encoding task of ei into a number of smaller elements) then we need a mechanism for specifying which one of the smaller elements (and possibly even which specific bit in it) we are modifying. This "random access" feature achieves a substantial saving only if the number of bits encoded is large enough. It also complicates implementation.

In the case of free-flowing text without chronological time-stamp information, the choice of k (how many basic units form an ei) can be made more flexible and dependent on the nature of the basic units appended rather than their number. This would recognize the fact that some items (like images) have larger potential for modification than others (like tags or text). One could assign a set of weights to each such type and I% would be the threshold accumulated weight at which an ei is considered to have been created. In such a scheme an image could, literally, be worth a thousand words.

7 Conclusion

Steganograpy has a long and distinguished history, dating back to five centuries BC. It is useful whenever the very existence of the secret message is to remain hidden, as in authoritarian countries that severely limit and censor speech. This paper describes a keyed scheme for posting secret messages on the web such that anyone with the secret key can retrieve the messages, but even a computationally powerful adversary (e.g., a government with supercomputers) who does not have the key is not able to know (let alone prove) the message's existence: The message depends on web-wide content, not only on what is under the sender's control. If the sender is prevented from erasing the message, the actions of others on the web automatically take care of doing the erasing job on the sender's behalf. Moreover, the sender can control, a priori, the likely life of the secret message before it automatically disappears, by a judicious design of the "pairing" function with the web's sibling information.

Our scheme is primarily destined for thwarting censorship. There have been many rumors, none substantiated, of evil-doers using steganography for communication - investigations in that direction have turned up no evidence of this. This is probably because evil entities like organized crime (including drug cartels and terror networks) have other and more effective ways of communicating, ways that are not available to an isolated individual, with limited resources, living in a repressive country.

References

[ l ] L. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Millozzi. Temporal evolution of the wikigraph. In Proceedings of Web Intelligence, pages 45-5 1 , Hong Kong, December 2006. IEEE CS Press.

and the shares. To browse content, a retriever must get the encrypted Publius content from some server andk of the key shares. Any modification to the encrypted content or the special URL, that is cryptographically tied to the content, will make it impossible to retrieve the unencrypted data. Publius also provides amechanism for the publishers to update or delete their content. Unlike Publius, where the persistence of themessages and reaching to public are driving goals, our system is based on the design goal of perishability ofthe messages, and providing a secure communication only with the intended recipient of the message.

Readers are referred to [7] for a survey of privacy enhancing technologies for the Internet.

6 Future Work

There is room for wringing some inefficiencies out of our scheme. We briefly discuss some of these below.In the append-only mode, we need to design a finer granularity of modifications for the ei that have

an encoding functionality. The difficulty is that, if we were to do so in the natural way (by spreading theencoding task of ei into a number of smaller elements) then we need a mechanism for specifying which oneof the smaller elements (and possibly even which specific bit in it) we are modifying. This "random access"feature achieves a substantial saving only if the number of bits encoded is large enough. It also complicatesimplementation.

In the case of free-flowing text without chronological time-stamp information, the choice of k (howmany basic units form an ei) can be made more flexible and dependent on the nature of the basic unitsappended rather than their number. This would recognize the fact that some items (like images) have largerpotential for modification than others (like tags or text). One could assign a set of weights to each such typeand k would be the threshold accumulated weight at which an ei is considered to have been created. In sucha scheme an image could, literally, be worth a thousand words.

7 Conclusion

Steganograpy has a long and distinguished history, dating back to five centuries BC. It is useful wheneverthe very existence of the secret message is to remain hidden, as in authoritarian countries that severely limitand censor speech. This paper describes a keyed scheme for posting secret messages on the web such thatanyone with the secret key can retrieve the messages, but even a computationally powerful adversary (e.g.,a government with supercomputers) who does not have the key is not able to know (let alone prove) themessage's existence: The message depends on web-wide content, not only on what is under the sender'scontrol. If the sender is prevented from erasing the message, the actions of others on the web automaticallytake care of doing the erasing job on the sender's behalf. Moreover, the sender can control, a priori, thelikely life of the secret message before it automatically disappears, by a judicious design of the "pairing"function with the web's sibling information.

Our scheme is primarily destined for thwarting censorship. There have been many rumors, none substantiated, of evil-doers using steganography for communication - investigations in that direction have turnedup no evidence of this. This is probably because evil entities like organized crime (including drug cartelsand terror networks) have other and more effective ways of communicating, ways that are not available toan isolated individual, with limited resources, living in a repressive country.

References

[1] L. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Millozzi. Temporal evolution of the wikigraph.In Proceedings of Web Intelligence, pages 45-51, Hong Kong, December 2006. IEEE CS Press.

12

[2] K. Butler, W. Enck, J. Plasten; P. Traynor, and P. McDaniel. Privacy-preserving web-based email. In International Conference on Information Systems Security, 2006.

[3] M. Chapman and G. Davida. Hiding the hidden: A software system for concealing ciphertext in innocuous text. In Proceedings of the International Conference on Information and Communications Security, volume LNCS 1334, Beijing, China, 1997.

[4] I. Cox, M. Miller, and J. A. Bloom. Digital Watermaking. Morgan Kaufmann Publishers, 2002.

[5] C. Ellison and B. Schneier. Inside risks: Risks of PKI: secure email. Communications of the ACM, 43(1): 160, 2000.

[6] Extreme Tracking. http://extremetracking.com.

[7] I. Goldberg. Privacy-enhancing technologies for the internet ii: Five years later. In Workshop on Privacy-Enhancing Technologies, 2002.

[8] Google. http://www.google.com/.

[9] Google Analytics. http://www.google.com/analytics.

[lo] R. Stutsman, M. Atallah, C. Grothoff, and K. Grothoff. Lost in just the translation. In Proceedings of the 2006 ACM Symposium on Applied Computing, pages 338-345. ACM, 4 2006. steganography translation machine statistical information hiding text natural language.

[ 1 I] The Internet Archive. http://www.archive.org .

[12] M. Topkara, U. Topkara, and M. J. Atallah. Information hiding through errors: A confusing approach. In Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, 2007.

[13] M. Topkara, U. Topkara, and M. J. Atallah. Words are not enough: Sentence level natural language watermarking. In Proceedings of ACM Workshop on Content Protection and Security (in conjuction with ACM Multimedia), Santa Barbara, CA, October 27, 2006.

[14] U. Topkara, M. Topkara, and M. J. Atallah. The hiding virtues of ambiguity: Quantifiably resilient watermarking of natural language text through synonym substitutions. In Proceedings of ACM Multi- media and Security Workshop, Geneva, Switzerland, September 26-27, 2006.

[15] M. Waldman, A. D. Rubin, and L. F. Cranor. Publius: A robust, tamper-evident, censorship-resistant, web publishing system. In Proceedings of 9th USENIX Security Symposium, pages 59-72, August 2000.

[16] P. Wayner. Mimic functions. CRYPTOLOGIA, XVI(3): 193-214, July 1992.

[17] K. Winstein. Lexical steganography through adaptive modulation of the word choice hash. In http://www. imsa. a d d keithw/tlex/, 1998.

[2] K. Butler, W. Enck, J. Plasterr, P. Traynor, and P. McDaniel. Privacy-preserving web-based email. InInternational Conference on Information Systems Security, 2006.

[3] M. Chapman and G. Davida. Hiding the hidden: A software system for concealing ciphertext ininnocuous text. In Proceedings of the International Conference on Information and CommunicationsSecurity, volume LNCS 1334, Beijing, China, 1997.

[4] I. Cox, M. Miller, and J. A. Bloom. Digital Watermaking. Morgan Kaufmann Publishers, 2002.

[5] C. Ellison and B. Schneier. Inside risks: Risks of PKI: secure email. Communications of the ACM,43(1):160,2000.

[6] Extreme Tracking. http://extremetracking.com.

[7] I. Goldberg. Privacy-enhancing technologies for the internet ii: Five years later. In Workshop onPrivacy-Enhancing Technologies, 2002.

[8] Google. http://www.google.com/.

[9] Google Analytics. http://www.google.com/analytics.

[10] R. Stutsman, M. Atallah, C. Grothoff, and K. Grothoff. Lost in just the translation. In Proceedingsof the 2006 ACM Symposium on Applied Computing, pages 338-345. ACM, 42006. steganographytranslation machine statistical information hiding text natural language.

[11] The Internet Archive. http://www.archive.org.

[12] M. Topkara, U. Topkara, and M. 1. Atallah. Information hiding through errors: A confusing approach.In Proceedings of the SPIE International Conference on Security, Steganography, and WatermarkingofMultimedia Contents, 2007.

[13] M. Topkara, U. Topkara, and M. J. Atallah. Words are not enough: Sentence level natural languagewatermarking. In Proceedings of ACM Workshop on Content Protection and Security (in conjuctionwith ACM Multimedia), Santa Barbara, CA, October 27, 2006.

[14] U. Topkara, M. Topkara, and M. J. Atallah. The hiding virtues of ambiguity: Quantifiably resilientwatermarking of natural language text through synonym substitutions. In Proceedings ofACM Multimedia and Security Workshop, Geneva, Switzerland, September 26-27,2006.

[15] M. Waldman, A. D. Rubin, and L. F. Cranor. Publius: A robust, tamper-evident, censorship-resistant,web publishing system. In Proceedings of 9th USENIX Security Symposium, pages 59-72, August2000.

[16] P. Wayner. Mimic functions. CRYPTOLOGIA, XVI(3):193-2l4, July 1992.

[17] K. Winstein. Lexical steganography through adaptive modulation of the word choice hash. Inhttp://www.imsa.edu/ keithw/tlex/, 1998.

13

this message will self-destruct: self-easing covert

Documents