automating hoarding prasun dewan department of computer science university of north carolina...

Automating Hoarding

Prasun Dewan

Department of Computer Science University of North Carolina

[email protected]

2

Manual HoardingPer user, per workstation hoard profiles, specifying

Files to be added or deleted Current and future (+)

children (c) or descendents(d)

Priority

a /coda/usr/jjk d+

a /coda/usr/jjk/papers 100:d+

Personal Files

a /usr/X11/bin/xterm

a /usr/X11/bin/xinit

Executables

Source Files

a /coda/src/venus 100:c+

a /coda/include 100:c+

3

LRU Works well when

activity remains same not if context switch occurs after cache fill context switch occurs after disconnection application used in disconnected state may not be

the same as the ones running when hoarding occurs

can afford to have misses need to keep entire working set in disconnected

state some dynamically accessed files may not have

been referenced recently Problems addressed by program trace approach

4

Per-program Hoarding Fixing activity switch problem

Per program traces User specifies which programs will be used

/a/b/c

/a/b/c1 /a/b/c2 /a/b/c3

/me/file1

/a/b/x/f /a/b/x/g

/me/file2

exec()

open()

5

Uniting Traces Fixing cache miss problem

Look at multiple executions of program (< n)

Unite accesses of all traces

chance of dynamically accessed file being missed lowered

may not want do do so for (execution-specific) data

• distinguish data from program

• data in directory with different root directory and has different extension

/a/b/c

/a/b/c1 /a/b/c2 /a/b/c3

/me/file1

/a/b/x/f /a/b/x/g

/me/file2

6

Aggregation Choice

Possible to choose: Most recent trace Trace unification

7

Data file choice

Possible to choose: Data files of all executions. Data files of all executions by specific user Data files of most recent execution by

specific user.

8

Multi-Program Activities

Bookends Snapshot spying User specifies start and end of spying period Associates it with a bookend name

For each bookend, can ask for hoarding of: All accesses recorded Accesses in traces of each program executed

Data file filtering

9

Program Trace Limitations

User involvement Bookend definition Hoarding decisions

Data file filtering Most recent vs. aggregated

Fixed by semantic distance approach

10

Example of Program Trace Limitations

Wish to hoard all chapters of book written using tex Define a bookend for this project

Get all files accessed by programs (tex) executed during bookend definition

Scheme will get tex and all dynamic files accessed by it Data file choices:

Get all my data files accessed during bookend spying Must access all chapters in snapshot

Get all of my data files accessed by tex Will get more than I want

Get all of my data files accessed during last trace of tex May not have accessed book recently

11

Semantic Distance Concept

Between files Low if they belong to same project High if they do not Use it to determine files in a project Hoard all or no files of a project (working

set)

12

Temporal Semantic Distance

Clock time elapsed between most recent opens/ execs of the files Clock time not good indicator

Coffee break between references to related files

13

Sequence-based Semantic DistanceNumber of intervening references (including open of

first file) between the most recent opens/execs of the files

A: source file

B: include C: include

B?

Non commutative

Looks only at first reference time (open) Files accessed during reference lifetime (open to

close) should have equal semantic distance

1 2

3open close

14

Lifetime-based Semantic Distance SD(F1, F2)

0 if F2 opened before F1 closed # intervening opens otherwise

Consider an exec as open immediately followed by close

Considers only last reference Dynamic linking conditional

A: source file

B: include C: include

B

0 0

3

15

Aggregation-based Semantic Distance

Take arithmetic mean of SD(F1i, F2i), 1< I < number of references to F1 1, 1, 1498 vs. 500, 500, 500

Take Geometric Mean Efficiency:

O(N2) storage Track n (20) closest neighbours

O(N) cost per reference Update SDs of files accessed in the last m (100)

references

16

Clustering Goal

Cluster files into projects based on SDs Difficulties

No objective measure of goodness of clustering Need overlapping clusters

Common header files SD not commutative

17

Distance-based ThresholdF1, F2 in same cluster if

SD(F1, F2) <= p or

SD(F2,F1) <= p Size of project not considered For any p, one can imagine a project with > p files

Combine clusters if they have overlapping filesf1, f2.. fp combined with fp, fp+1.. fl

All files will become one cluster

18

Common Neighbours-based Threshold

Based on the n (nearest) neighbours Look at # common neigbours, c Two thresholds:

kf (far) < kn (near)

kn <= c

kf <= c <= kn

c < kf

Clusters combined into one

Files inserted in each other’s clusters

No action

19

Combining Phase

A B C D E F G

A B

C

D E F

G

kn kf

kn

kf

kn

kn

kn

{A, B}{A, B, C}

{D, E}{A, B, C}

{D, E}{A, B, C} {F, G}

{D, E, F, G}{A, B, C}

20

Insertion Phase

A B C D E F G

A B

C

D E F

G

kn kf

kn

kf

kn

kn

kn

{D, E, F, G}{A, B, C}

{A, B, C, D}{C, D, E, F, G}

21

Other Correlating Factors

Directory membership Files with common ancestor directories related

File naming conventions Source and header files have same prefix

Other relations # include files, import statements, common words

Ancestor level automatically recorded and subtracted from shared neigbours

External investigator generates relationship weight and is added to shared neigbours

22

Another Option

Add/subtract from SD SD is asymmetric Directly modifying shared neighbour count has

more impact

23

Searching Programs Example: Find Opens all files in a sub tree Destroys LRU and SD information Accesses of meaningless program ignored

Program accessing > d % of possible directory members Important to detect meaningless phase rather than program

Get working directory Does exhaustive search Accesses during search ignored rather than entire program

calling getcwd

24

Shared Libraries Accessed by all programs

All clusters will be combined via them Files involved in more than a certain percentage

(1%) of accesses ignored and always put in hoard set.

25

Temporary Files Not important by definition But may have small semantic distance to other files System disregards files in certain directories

26

Rarely Accessed Critical Files Hardly accessed but important

Boot strapping Suspend/resume files

User specified lists System-specific heuristics

. Files in unix

27

Non Files Can be critical

Device file Access to them may not be recorded

Symbolic link points to actual file Non-directories take no space

Always hoarded Directories may be needed to do offline file-name

translation Replication system makes decisions regarding them

28

Handling Hoard Miss If hoard miss

Add file and its project to hoard set Record it for goodness measure.

29

Goodness Measure Caching

Cache miss rate Hoarding

Time to first cache miss Does not take into account working set size vs. hoard

size• working set << hoard size -> no miss• Working set ~ hoard size -> high miss rate

Miss-free hoard size

30

Miss-free Hoard Size Under LRU Look at references before most recent disconnection

F4 F3 F1 F2 F1 F5 Keep only most recent reference to each file

F4 F3 F2 F1 F5 Mark files accessed since disconnection

F3 F5 Locate the first marked file in sorted list

F3 Sum the size of all files between this file and end of

sorted list F3 + F2 + F1 + F5

31

Live usage Gathered user traces of activities Few hoard misses in actual usage

32

Comparison Experiments Gathered user traces of activities Replaced each trace simulating disconnection duration of

24 hours 7 days

Assumed infinitesimal reconnection only for re-hoarding Mode of traced activities

Connected Can do activities normally not done in disconnected mode

• Web access Access patterns remain same

Disconnected mode Actual hoard misses could influence activities But misses were few anyway

Semantic distance leads to hoard size slightly bigger than WS Much better than LRU

33

Unresolved Issues Hoarding of fine-grained data

automating hoarding prasun dewan department of computer science university of north carolina...

Documents

accessed files

dynamic files

specific userdata files

recent execution

semantic distancesdf1

recent opensexecs

bookend definitionscheme

bookend spyingmust