automating hoarding prasun dewan department of computer science university of north carolina...

33
Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina [email protected]

Upload: janet-mulley

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

Automating Hoarding

Prasun Dewan

Department of Computer Science University of North Carolina

[email protected]

Page 2: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

2

Manual HoardingPer user, per workstation hoard profiles, specifying

Files to be added or deleted Current and future (+)

children (c) or descendents(d)

Priority

a /coda/usr/jjk d+

a /coda/usr/jjk/papers 100:d+

Personal Files

a /usr/X11/bin/xterm

a /usr/X11/bin/xinit

Executables

Source Files

a /coda/src/venus 100:c+

a /coda/include 100:c+

Page 3: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

3

LRU Works well when

activity remains same not if context switch occurs after cache fill context switch occurs after disconnection application used in disconnected state may not be

the same as the ones running when hoarding occurs

can afford to have misses need to keep entire working set in disconnected

state some dynamically accessed files may not have

been referenced recently Problems addressed by program trace approach

Page 4: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

4

Per-program Hoarding Fixing activity switch problem

Per program traces User specifies which programs will be used

/a/b/c

/a/b/c1 /a/b/c2 /a/b/c3

/me/file1

/a/b/x/f /a/b/x/g

/me/file2

exec()

open()

Page 5: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

5

Uniting Traces Fixing cache miss problem

Look at multiple executions of program (< n)

Unite accesses of all traces

chance of dynamically accessed file being missed lowered

may not want do do so for (execution-specific) data

• distinguish data from program

• data in directory with different root directory and has different extension

/a/b/c

/a/b/c1 /a/b/c2 /a/b/c3

/me/file1

/a/b/x/f /a/b/x/g

/me/file2

Page 6: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

6

Aggregation Choice

Possible to choose: Most recent trace Trace unification

Page 7: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

7

Data file choice

Possible to choose: Data files of all executions. Data files of all executions by specific user Data files of most recent execution by

specific user.

Page 8: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

8

Multi-Program Activities

Bookends Snapshot spying User specifies start and end of spying period Associates it with a bookend name

For each bookend, can ask for hoarding of: All accesses recorded Accesses in traces of each program executed

Data file filtering

Page 9: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

9

Program Trace Limitations

User involvement Bookend definition Hoarding decisions

Data file filtering Most recent vs. aggregated

Fixed by semantic distance approach

Page 10: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

10

Example of Program Trace Limitations

Wish to hoard all chapters of book written using tex Define a bookend for this project

Get all files accessed by programs (tex) executed during bookend definition

Scheme will get tex and all dynamic files accessed by it Data file choices:

Get all my data files accessed during bookend spying Must access all chapters in snapshot

Get all of my data files accessed by tex Will get more than I want

Get all of my data files accessed during last trace of tex May not have accessed book recently

Page 11: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

11

Semantic Distance Concept

Between files Low if they belong to same project High if they do not Use it to determine files in a project Hoard all or no files of a project (working

set)

Page 12: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

12

Temporal Semantic Distance

Clock time elapsed between most recent opens/ execs of the files Clock time not good indicator

Coffee break between references to related files

Page 13: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

13

Sequence-based Semantic DistanceNumber of intervening references (including open of

first file) between the most recent opens/execs of the files

A: source file

B: include C: include

B?

Non commutative

Looks only at first reference time (open) Files accessed during reference lifetime (open to

close) should have equal semantic distance

1 2

3open close

Page 14: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

14

Lifetime-based Semantic Distance SD(F1, F2)

0 if F2 opened before F1 closed # intervening opens otherwise

Consider an exec as open immediately followed by close

Considers only last reference Dynamic linking conditional

A: source file

B: include C: include

B

0 0

3

Page 15: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

15

Aggregation-based Semantic Distance

Take arithmetic mean of SD(F1i, F2i), 1< I < number of references to F1 1, 1, 1498 vs. 500, 500, 500

Take Geometric Mean Efficiency:

O(N2) storage Track n (20) closest neighbours

O(N) cost per reference Update SDs of files accessed in the last m (100)

references

Page 16: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

16

Clustering Goal

Cluster files into projects based on SDs Difficulties

No objective measure of goodness of clustering Need overlapping clusters

Common header files SD not commutative

Page 17: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

17

Distance-based ThresholdF1, F2 in same cluster if

SD(F1, F2) <= p or

SD(F2,F1) <= p Size of project not considered For any p, one can imagine a project with > p files

Combine clusters if they have overlapping filesf1, f2.. fp combined with fp, fp+1.. fl

All files will become one cluster

Page 18: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

18

Common Neighbours-based Threshold

Based on the n (nearest) neighbours Look at # common neigbours, c Two thresholds:

kf (far) < kn (near)

kn <= c

kf <= c <= kn

c < kf

Clusters combined into one

Files inserted in each other’s clusters

No action

Page 19: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

19

Combining Phase

A B C D E F G

A B

C

D E F

G

kn kf

kn

kf

kn

kn

kn

{A, B}{A, B, C}

{D, E}{A, B, C}

{D, E}{A, B, C} {F, G}

{D, E, F, G}{A, B, C}

Page 20: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

20

Insertion Phase

A B C D E F G

A B

C

D E F

G

kn kf

kn

kf

kn

kn

kn

{D, E, F, G}{A, B, C}

{A, B, C, D}{C, D, E, F, G}

Page 21: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

21

Other Correlating Factors

Directory membership Files with common ancestor directories related

File naming conventions Source and header files have same prefix

Other relations # include files, import statements, common words

Ancestor level automatically recorded and subtracted from shared neigbours

External investigator generates relationship weight and is added to shared neigbours

Page 22: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

22

Another Option

Add/subtract from SD SD is asymmetric Directly modifying shared neighbour count has

more impact

Page 23: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

23

Searching Programs Example: Find Opens all files in a sub tree Destroys LRU and SD information Accesses of meaningless program ignored

Program accessing > d % of possible directory members Important to detect meaningless phase rather than program

Get working directory Does exhaustive search Accesses during search ignored rather than entire program

calling getcwd

Page 24: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

24

Shared Libraries Accessed by all programs

All clusters will be combined via them Files involved in more than a certain percentage

(1%) of accesses ignored and always put in hoard set.

Page 25: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

25

Temporary Files Not important by definition But may have small semantic distance to other files System disregards files in certain directories

Page 26: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

26

Rarely Accessed Critical Files Hardly accessed but important

Boot strapping Suspend/resume files

User specified lists System-specific heuristics

. Files in unix

Page 27: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

27

Non Files Can be critical

Device file Access to them may not be recorded

Symbolic link points to actual file Non-directories take no space

Always hoarded Directories may be needed to do offline file-name

translation Replication system makes decisions regarding them

Page 28: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

28

Handling Hoard Miss If hoard miss

Add file and its project to hoard set Record it for goodness measure.

Page 29: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

29

Goodness Measure Caching

Cache miss rate Hoarding

Time to first cache miss Does not take into account working set size vs. hoard

size• working set << hoard size -> no miss• Working set ~ hoard size -> high miss rate

Miss-free hoard size

Page 30: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

30

Miss-free Hoard Size Under LRU Look at references before most recent disconnection

F4 F3 F1 F2 F1 F5 Keep only most recent reference to each file

F4 F3 F2 F1 F5 Mark files accessed since disconnection

F3 F5 Locate the first marked file in sorted list

F3 Sum the size of all files between this file and end of

sorted list F3 + F2 + F1 + F5

Page 31: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

31

Live usage Gathered user traces of activities Few hoard misses in actual usage

Page 32: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

32

Comparison Experiments Gathered user traces of activities Replaced each trace simulating disconnection duration of

24 hours 7 days

Assumed infinitesimal reconnection only for re-hoarding Mode of traced activities

Connected Can do activities normally not done in disconnected mode

• Web access Access patterns remain same

Disconnected mode Actual hoard misses could influence activities But misses were few anyway

Semantic distance leads to hoard size slightly bigger than WS Much better than LRU

Page 33: Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

33

Unresolved Issues Hoarding of fine-grained data