automating hoarding prasun dewan department of computer science university of north carolina...
TRANSCRIPT
Automating Hoarding
Prasun Dewan
Department of Computer Science University of North Carolina
2
Manual HoardingPer user, per workstation hoard profiles, specifying
Files to be added or deleted Current and future (+)
children (c) or descendents(d)
Priority
a /coda/usr/jjk d+
a /coda/usr/jjk/papers 100:d+
Personal Files
a /usr/X11/bin/xterm
a /usr/X11/bin/xinit
Executables
Source Files
a /coda/src/venus 100:c+
a /coda/include 100:c+
3
LRU Works well when
activity remains same not if context switch occurs after cache fill context switch occurs after disconnection application used in disconnected state may not be
the same as the ones running when hoarding occurs
can afford to have misses need to keep entire working set in disconnected
state some dynamically accessed files may not have
been referenced recently Problems addressed by program trace approach
4
Per-program Hoarding Fixing activity switch problem
Per program traces User specifies which programs will be used
/a/b/c
/a/b/c1 /a/b/c2 /a/b/c3
/me/file1
/a/b/x/f /a/b/x/g
/me/file2
exec()
open()
5
Uniting Traces Fixing cache miss problem
Look at multiple executions of program (< n)
Unite accesses of all traces
chance of dynamically accessed file being missed lowered
may not want do do so for (execution-specific) data
• distinguish data from program
• data in directory with different root directory and has different extension
/a/b/c
/a/b/c1 /a/b/c2 /a/b/c3
/me/file1
/a/b/x/f /a/b/x/g
/me/file2
6
Aggregation Choice
Possible to choose: Most recent trace Trace unification
7
Data file choice
Possible to choose: Data files of all executions. Data files of all executions by specific user Data files of most recent execution by
specific user.
8
Multi-Program Activities
Bookends Snapshot spying User specifies start and end of spying period Associates it with a bookend name
For each bookend, can ask for hoarding of: All accesses recorded Accesses in traces of each program executed
Data file filtering
9
Program Trace Limitations
User involvement Bookend definition Hoarding decisions
Data file filtering Most recent vs. aggregated
Fixed by semantic distance approach
10
Example of Program Trace Limitations
Wish to hoard all chapters of book written using tex Define a bookend for this project
Get all files accessed by programs (tex) executed during bookend definition
Scheme will get tex and all dynamic files accessed by it Data file choices:
Get all my data files accessed during bookend spying Must access all chapters in snapshot
Get all of my data files accessed by tex Will get more than I want
Get all of my data files accessed during last trace of tex May not have accessed book recently
11
Semantic Distance Concept
Between files Low if they belong to same project High if they do not Use it to determine files in a project Hoard all or no files of a project (working
set)
12
Temporal Semantic Distance
Clock time elapsed between most recent opens/ execs of the files Clock time not good indicator
Coffee break between references to related files
13
Sequence-based Semantic DistanceNumber of intervening references (including open of
first file) between the most recent opens/execs of the files
A: source file
B: include C: include
B?
Non commutative
Looks only at first reference time (open) Files accessed during reference lifetime (open to
close) should have equal semantic distance
1 2
3open close
14
Lifetime-based Semantic Distance SD(F1, F2)
0 if F2 opened before F1 closed # intervening opens otherwise
Consider an exec as open immediately followed by close
Considers only last reference Dynamic linking conditional
A: source file
B: include C: include
B
0 0
3
15
Aggregation-based Semantic Distance
Take arithmetic mean of SD(F1i, F2i), 1< I < number of references to F1 1, 1, 1498 vs. 500, 500, 500
Take Geometric Mean Efficiency:
O(N2) storage Track n (20) closest neighbours
O(N) cost per reference Update SDs of files accessed in the last m (100)
references
16
Clustering Goal
Cluster files into projects based on SDs Difficulties
No objective measure of goodness of clustering Need overlapping clusters
Common header files SD not commutative
17
Distance-based ThresholdF1, F2 in same cluster if
SD(F1, F2) <= p or
SD(F2,F1) <= p Size of project not considered For any p, one can imagine a project with > p files
Combine clusters if they have overlapping filesf1, f2.. fp combined with fp, fp+1.. fl
All files will become one cluster
18
Common Neighbours-based Threshold
Based on the n (nearest) neighbours Look at # common neigbours, c Two thresholds:
kf (far) < kn (near)
kn <= c
kf <= c <= kn
c < kf
Clusters combined into one
Files inserted in each other’s clusters
No action
19
Combining Phase
A B C D E F G
A B
C
D E F
G
kn kf
kn
kf
kn
kn
kn
{A, B}{A, B, C}
{D, E}{A, B, C}
{D, E}{A, B, C} {F, G}
{D, E, F, G}{A, B, C}
20
Insertion Phase
A B C D E F G
A B
C
D E F
G
kn kf
kn
kf
kn
kn
kn
{D, E, F, G}{A, B, C}
{A, B, C, D}{C, D, E, F, G}
21
Other Correlating Factors
Directory membership Files with common ancestor directories related
File naming conventions Source and header files have same prefix
Other relations # include files, import statements, common words
Ancestor level automatically recorded and subtracted from shared neigbours
External investigator generates relationship weight and is added to shared neigbours
22
Another Option
Add/subtract from SD SD is asymmetric Directly modifying shared neighbour count has
more impact
23
Searching Programs Example: Find Opens all files in a sub tree Destroys LRU and SD information Accesses of meaningless program ignored
Program accessing > d % of possible directory members Important to detect meaningless phase rather than program
Get working directory Does exhaustive search Accesses during search ignored rather than entire program
calling getcwd
24
Shared Libraries Accessed by all programs
All clusters will be combined via them Files involved in more than a certain percentage
(1%) of accesses ignored and always put in hoard set.
25
Temporary Files Not important by definition But may have small semantic distance to other files System disregards files in certain directories
26
Rarely Accessed Critical Files Hardly accessed but important
Boot strapping Suspend/resume files
User specified lists System-specific heuristics
. Files in unix
27
Non Files Can be critical
Device file Access to them may not be recorded
Symbolic link points to actual file Non-directories take no space
Always hoarded Directories may be needed to do offline file-name
translation Replication system makes decisions regarding them
28
Handling Hoard Miss If hoard miss
Add file and its project to hoard set Record it for goodness measure.
29
Goodness Measure Caching
Cache miss rate Hoarding
Time to first cache miss Does not take into account working set size vs. hoard
size• working set << hoard size -> no miss• Working set ~ hoard size -> high miss rate
Miss-free hoard size
30
Miss-free Hoard Size Under LRU Look at references before most recent disconnection
F4 F3 F1 F2 F1 F5 Keep only most recent reference to each file
F4 F3 F2 F1 F5 Mark files accessed since disconnection
F3 F5 Locate the first marked file in sorted list
F3 Sum the size of all files between this file and end of
sorted list F3 + F2 + F1 + F5
31
Live usage Gathered user traces of activities Few hoard misses in actual usage
32
Comparison Experiments Gathered user traces of activities Replaced each trace simulating disconnection duration of
24 hours 7 days
Assumed infinitesimal reconnection only for re-hoarding Mode of traced activities
Connected Can do activities normally not done in disconnected mode
• Web access Access patterns remain same
Disconnected mode Actual hoard misses could influence activities But misses were few anyway
Semantic distance leads to hoard size slightly bigger than WS Much better than LRU
33
Unresolved Issues Hoarding of fine-grained data