using database technology to improve performance of web proxy servers
DESCRIPTION
Using Database Technology to Improve Performance of Web Proxy Servers. K. Cheng ¹ , Y. Kambayashi ¹ , M. Mohania ² ¹ Kyoto University, Japan ² Western Michigan University, USA. Proxy Server. Lower Bandwidth. Higher Bandwidth. ( WAN ). ( LAN ). X. Direct Access. - PowerPoint PPT PresentationTRANSCRIPT
Using Database Technology to Improve Performance of Web Proxy Servers
K. Cheng¹, Y. Kambayashi¹, M. Mohania²¹Kyoto University, Japan²Western Michigan University, USA
24-25 May 2001 WebDB'2001, Santa Barbara CA 2
Caching on web proxy servers
Improve throughput of proxy servers Improve response times for end users Bridge bandwidth gap between WAN and
LAN Distribute workload from web servers
Web Servers
Clients
(WAN) (LAN)Lower Bandwidth Higher Bandwidth
Proxy Server
Direct AccessX
24-25 May 2001 WebDB'2001, Santa Barbara CA 3
Characteristics of proxy caching
Traditional Caching
Proxy Caching
Storage Memory-based
Disk-based
Cache size
Small Huge
Object survival
time
Short Long
Algorithm Simple Can be complex
Who use ? Programmed process
People with specific
interest
24-25 May 2001 WebDB'2001, Santa Barbara CA 4
Limitations of current caching schemes: case 11. Tom found a very good page “P1” about car
models2. John is also looking for that kind of pages, but
he only got “P2”3. Both “P1” and “P2” were cached, but Tom
didn’t know “P2” and John didn’t know about “P1”.
4. After several days, however, both were replaced since no further visits.
5. As a result, Tom missed “P2”, John missed “P1”, and cache missed 2 hits
State-of-art caching schemes cannot deal this case!!
24-25 May 2001 WebDB'2001, Santa Barbara CA 5
Limitations of current caching schemes: case 21. Suppose the users of a proxy server are mostly
interested in “XML”, but rarely favor of “Fuzzy”2. Suppose some clients retrieved pages “P1” and
“P2”3. After checking the content of “P1”and “P2”, we
know “P1” is a “XML” one, “P2” is a “Fuzzy” one
Should we prefer to cache “P1” or “P2” ?
24-25 May 2001 WebDB'2001, Santa Barbara CA 6
Why current schemes can’t deal with these cases ? Physical object based cache management Content transparency low utilization
rate (Case 1)Approximately 60% data in cache never
usedApproximately 90% data in cache rarely used
Usage-based object replacement Needlessly long stay time for irrelevant contents (Case 2)
24-25 May 2001 WebDB'2001, Santa Barbara CA 7
Our solution
We propose a hierarchical data model for management of web data (physical pages, logical pages and topics).
Object replacement based on Link structure (“logical pages”)Semantic similarity with other objects
(“topics” ) Facilitate active access to cache
contents
24-25 May 2001 WebDB'2001, Santa Barbara CA 8
A hierarchical model for web data
Topic manager
Logical page manager
Physical page manager p1 p2 p3 p4 p5 p6
L1 L2 L3
T1 T2
Mapping
Mapping
Topics
Logical pages
Physical pages
navigate
Search
Browse
24-25 May 2001 WebDB'2001, Santa Barbara CA 9
Physical pages http://www.difa.unibas.it/webdb2001
/instructionsPage/index.html
../icons/webdblogo.gif
Physical page “A”
Physical page “B”
24-25 May 2001 WebDB'2001, Santa Barbara CA 10
Logical page
AA
BB
24-25 May 2001 WebDB'2001, Santa Barbara CA 11
Managing physical pages
Physical pageHTML/plain text file (.html, .txt) Embedded media file (.gif, .png, wav, .mp3) Application Generated File (.pdf, .ps, .doc)
Managing physical pages based onURL (protocol, ip, port, path)Physical properties (e.g. size, cost etc.)Usage (frequency, recency)
24-25 May 2001 WebDB'2001, Santa Barbara CA 12
Constructing logical pages
Basic logical pagesSingle multimedia documentHTML(1)+ embedded media files(1..*)
Extended logical pagesSeveral closely related directly linked
pages E.g. an HTML paper with sections on
different multimedia documents
24-25 May 2001 WebDB'2001, Santa Barbara CA 13
Managing topics Defining a topic
Topic = <id, name, criteria, popularity, date, …>Popularity=f(F, R, P, U)
F – Access Frequency of TopicR - Time interval between last access time and current tim
eP – Number of logical pages belonging to a topicU – Number of users accessing a topic
Deciding membership of a logical page to a topic IR Approaches (K-NN, )ML Approaches (e.g. Support Vector Machine-SVM)
24-25 May 2001 WebDB'2001, Santa Barbara CA 14
Definitions
We use a term “Priority” for object replacement. It is a function of several parameters, e.g. access frequency(F), time interval(R), size of object(S), retrieval cost(C), significance(G).
Significance: Importance of the topic
24-25 May 2001 WebDB'2001, Santa Barbara CA 15
Caching policy: LRU-SP+
Topic managementPriority = f(F, R, G)
Logical page managementBasic logical pages only Priority = g(F, R)
Physical page managementLRU-SP --size-adjusted & popularity-aware
LRU (K. Cheng et al, Compsac’00)Priority = h(F, R, S)
24-25 May 2001 WebDB'2001, Santa Barbara CA 16
Evaluate & add new objects
L1L1 L2L2 L3L3
P10P10
P11P11
P40P40P30P30P20P20
P41P41P31P31
P22P22P12P21P12P21
P42P42
T1T1 T2T2
Physical Pages
Logical Pages
Topics
Higher Lower
New Object “D”
Priority
“D” is of higher priority
24-25 May 2001 WebDB'2001, Santa Barbara CA 17
Replace an object
1. Choose a candidate topic (T1)
2. T1 has 1 logical page (L1), choose (L1)
3. (L1) has 3 physical pages (P10), ( P11), (P12), where (P12) shared by (L2)
4. Choose a victim (P*) from (P10), ( P11).
5. Replace (P*) with the new page
P10P10
P11P11
P40P40P30P30P20P20
P41P41P31P31P23P23P22P22P12P21P12P21
P42P42
L1L1 L2L2 L3L3
T1T1 T2T2
24-25 May 2001 WebDB'2001, Santa Barbara CA 18
Preliminary experiments Replay access logs of our proxy server(Squid)
30 clients, 30 days873,824 requests, 21.30GB data7 Topics, Priority [1..5]
Significance Factor ([0, 2])Measure the significance of each topic
Hit Rate(HR) Percentage of requests satisfied by cache
Profit Rate(PR)-- is significance of topic
otherwise
cacheindify
g
ygPR i
ii
Ni
i ii
,0
,1,
.1
g i
24-25 May 2001 WebDB'2001, Santa Barbara CA 19
Baseline algorithm LRV (Rizzo et al 1998) A physical-page-based algorithm Using size(S) to predict further
access to incoming objectsParameters in consideration
Access frequency (F)Time interval (R)Size of objects (S)
24-25 May 2001 WebDB'2001, Santa Barbara CA 20
Results: Hit Rates 20% UP
0
0.05
0.1
0.15
0.2
0.25
0.5 3 6 10
LRVLRU-SP+
Cache space in % of total unique data
24-25 May 2001 WebDB'2001, Santa Barbara CA 21
Results: Profit Rates 30% Up
00.050.10.150.20.250.30.350.40.450.5
0.5 3 6 10
LRVLRU-SP+
Cache space in % of total unique data
24-25 May 2001 WebDB'2001, Santa Barbara CA 22
Conclusion and future work Performance of caching proxies can be
remarkably improved if cache contents were well organized and managed
Proposed a hierarchical model and the cache management scheme based on that model
Future workTuning various parameters to achieve better
performance(Logical page clustering, priority balancing significance and popularity etc.)
More experiments