uportal 2 status – 9/19/05

24
uPortal 2 Status – 9/19/05 Dan Mindler Enterprise Systems & Services [email protected]

Upload: glyn

Post on 19-Mar-2016

71 views

Category:

Documents


3 download

DESCRIPTION

uPortal 2 Status – 9/19/05. Dan Mindler Enterprise Systems & Services dmindler @rutgers.edu. UPortal 2: Releases. uPortal 2.5.x 2.5.0 GA 5/26/05 2.5.1 RC1 available 7/05 2.5.1 GA ? uPortal 2.4.x 2.4.3 GA 8/12/05 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: uPortal 2 Status – 9/19/05

uPortal 2 Status – 9/19/05

Dan MindlerEnterprise Systems & [email protected]

Page 2: uPortal 2 Status – 9/19/05

9/19/05 2

UPortal 2:Releases

uPortal 2.5.x 2.5.0 GA 5/26/05 2.5.1 RC1 available 7/05 2.5.1 GA ?

uPortal 2.4.x 2.4.3 GA 8/12/05

Quick Starts for 2.5.0 and 2.4.3 made available 9/4/05 (contributed by Vincent Mathieu [email protected])

Page 3: uPortal 2 Status – 9/19/05

9/19/05 3

uPortal 2.x Memory/Performance:Status As of 8/31/05, all known memory

leaks in myRutgers have been resolved

Finalizers did play a part Many leaks have been plugged Performance improvements have

been made But, there is room for improvement…

Page 4: uPortal 2 Status – 9/19/05

9/19/05 4

uPortal 2.x Memory/Performance:Background

Scott Battaglia presented overview on 3/21/05 (http://jasigch.princeton.edu:9000/display/UPC/March+2005+uPortal+Developers+Meeting+Minutes#March2005uPortalDevelopersMeetingMinutes-24PerfMem)

myRutgers has 10-15k unique users per day, with 30-40k total logins. Load balance per session to 4 machines

Symptoms indicated memory leak Saw in production since 11/04 Utilized scripting to monitor JVM heap and mod_jk

connections; restart before the out-of-memory occurred

Used tools (e.g., YourKit) to inspect the heap

Page 5: uPortal 2 Status – 9/19/05

9/19/05 5

uPortal 2.x Memory/Performance:Initial Changes Fixes before March, 2005:

Removed caching of IPersons from PersonDirectory

CError and CSecureInfo now pass events to wrapped channels

Restrict access to ChannelFactory’s channel cache, synchronized instantiateChannel method

Guest sessions created on time out AbstractMultithreadedChannels were not

cleaning out their channel state maps (2 of them)

Page 6: uPortal 2 Status – 9/19/05

9/19/05 6

uPortal 2.x Memory/Performance:More Changes Fixes discussed March, 2005:

Switch homegrown ThreadPool to Backport Concurrent

Remove finalizer in UBC_Webmail Update to AuthorizationImpl

Page 7: uPortal 2 Status – 9/19/05

9/19/05 7

uPortal 2.x Memory/Performance:More leaks? Since March, 2005 changes, behavior

better, but myRutgers JVMs still get bounced in anywhere from 1 to 3 hours

Finalizer queue still indicates many unfinalized Objects

Question: Why can’t the finalizer thread keep up?

Page 8: uPortal 2 Status – 9/19/05

9/19/05 8

uPortal 2.x Memory/Performance:Finalization Problems If an Object implements public void finalize(), after

no longer referenced gets placed on finalizer queue

A finalizer thread will sweep this list at its’ leisure (i.e., no guarantee of when)

Perhaps the GC thread is too busy, many temporary Objects created

Target: finalizers, leaks and temporary objects

Page 9: uPortal 2 Status – 9/19/05

9/19/05 9

uPortal 2.x Memory/Performance:Finalizers, Leaks and Temporary Objects Target finalizer() methods:

org.jasig.portal.MultipartDataSource Xalan XRTreeFrag (now in Xalan 2.7) Upgrade to JavaMail 1.3.2 and modify local copy to remove finalizers

Target Leaks/Performance: org.jasig.portal.ChannelManager.java channelTable (Hashtable) not

cleaned properly org.jasig.portal.car.CarResources.java not closing stream Turn off dynamic class reloading in web.xml:

<Context reloadable="false"/> Target excessive temporary Objects: (

http://jasigch.princeton.edu:9000/pages/viewpage.action?pageId=9271): org.jasig.portal.serialize.HTMLdtd.fromChar(int) – based on Xerces, but

now deprecated – removed temp object creation org.jasig.portal.utils.SubstitutionIntegerFilter – reduced temp object

creation org.jasig.portal.MediaManager – reduced number of MediaManager

objects created

Page 10: uPortal 2 Status – 9/19/05

9/19/05 10

uPortal 2.x Memory/Performance:Still Memory Problems Server behaving better; some servers requiring restart

in 3 hours, others in 3 days Heap dumps still point to Objects on Finalizer queue –

primarily JDK sockets (plain and SSL) Though still running out of a memory:

Unable to replicate in QA – indicates testing does not reflect usage

Unpredictable nature of heap growth - servers running similar traffic/load should show similar pattern in heap usage (not one failing in 3 hours and another in 3 days)

Tools no longer point to leaked Objects (other than on Finalizer queue)

So, why can’t the finalizer thread keep up!?!?

Page 11: uPortal 2 Status – 9/19/05

9/19/05 11

uPortal 2.x Memory/Performance:Out Of Memory Graph

Page 12: uPortal 2 Status – 9/19/05

9/19/05 12

uPortal 2.x Memory/Performance:Try, Try Again Too many open files (configured for 8k)

Thought it was network sockets During production run, got a snapshot of lsof (list of

open files), indicating hundreds of:java 6843 tomcat 30r VREG  85,7 409 4271365 /u01/app/jakarta-tomcat-5.0.25_load_test/work/Catalina/localhost/portallt/loader/com/swabunga/spell/engine/configuration.properties

Email uses Jazzy (http://sourceforge.net/projects/jazzy) was not explicitly closing files (com.swabunga.spell.engine.PropertyConfiguration.java)

Wrote/Deployed a web-app that given parameters: spawns a thread to run System.gc() every configured

milliseconds spawns a thread to run System.runFinalization() every

configured milliseconds – not guaranteed to run finalization, but system will make an effort

Page 13: uPortal 2 Status – 9/19/05

9/19/05 13

uPortal 2.x Memory/Performance:A Watched Pot…Boils Patch solved problem of too many open

files Periodic run of finalization appeared ok:

2005-08-16 07:26:03.591: Thread-1: DEBUG: Requesting finalization 2005-08-16 07:26:03.593: Thread-1: DEBUG: Finalization request complete

2005-08-16 07:26:03.593: Thread-1: DEBUG: Finalization thread sleeping 900 seconds

Monitored tenured area of heap:Heap def new generation total 873856K, used 62635K [0x75000000, 0xb5000000, 0xb5000000) eden space 699136K, 0% used [0x75000000, 0x75000000, 0x9fac0000) from space 174720K, 35% used [0x9fac0000, 0xa37eadb0, 0xaa560000) to space 174720K, 0% used [0xaa560000, 0xaa560000, 0xb5000000) tenured generation total 1048576K, used 287178K [0xb5000000, 0xf5000000, 0xf5000000) the space 1048576K, 27% used [0xb5000000, 0xc6872810, 0xc6872a00, 0xf5000000) compacting perm gen total 32768K, used 24113K [0xf5000000, 0xf7000000, 0xf9000000) the space 32768K, 73% used [0xf5000000, 0xf678c538, 0xf678c600, 0xf7000000)

Page 14: uPortal 2 Status – 9/19/05

9/19/05 14

uPortal 2.x Memory/Performance:Bingo! Noticed the tenured space growing after

successive GC’s: heap leak occurring Scanned the log for finalization messages:

2005-08-16 12:07:04.466: Thread-2: DEBUG: Requesting finalization...

No debug print of finalization complete!!!!! AHAA!!!

Finalization request not returning at the same time the tenured area of heap starts growing… Finalization thread is blocked

Issue a “kill -3” to obtain a Java Thread Dump

Page 15: uPortal 2 Status – 9/19/05

9/19/05 15

uPortal 2.x Memory/Performance:Blocked Finalizer Threads Finalizer threads stack traces:

"Secondary finalizer" prio=5 tid=0x011ef348 nid=0x1477 runnable [d8000000..d80019c0]at java.lang.Object.wait(Native Method)- waiting on <0xeaf54f08> (a com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync)at com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync.lock(Unknown Source)- locked <0xeaf54f08> (a com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync)at com.sshtools.j2ssh.transport.TransportProtocolOutputStream.sendMessage(Unknown Source)- locked <0xeaf55810> (a com.sshtools.j2ssh.transport.TransportProtocolOutputStream)at com.sshtools.j2ssh.transport.TransportProtocolCommon.sendMessage(Unknown Source)- locked <0xeaf49ef0> (a com.sshtools.j2ssh.transport.TransportProtocolClient)at com.sshtools.j2ssh.transport.TransportProtocolCommon.sendDisconnect(Unknown Source)at com.sshtools.j2ssh.transport.TransportProtocolCommon.disconnect(Unknown Source)at com.sshtools.j2ssh.SshClient.disconnect(Unknown Source)at edu.columbia.filesystem.impl.SFTPFileSystemImpl.disconnect(SFTPFileSystemImpl.java:89)at edu.columbia.filesystem.impl.RemoteFileSystemImpl.finalize(RemoteFileSystemImpl.java:715)at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)at java.lang.ref.Finalizer.access$100(Finalizer.java:14)at java.lang.ref.Finalizer$2.run(Finalizer.java:131)at java.lang.Thread.run(Thread.java:534)

Page 16: uPortal 2 Status – 9/19/05

9/19/05 16

uPortal 2.x Memory/Performance:Investigation Thread dump indicated two finalization threads, both

blocked at the same point in code Points to Briefcase channel, in the low level code to

handle secure connections to ftp server (J2SSH http://sourceforge.net/projects/sshtools)

A lock release is NOT in a finally clause A user who opens their briefcase, then proactively

closes it causes the problem when the user logs/times out: Proactive closing sends a close command over the

socket A finalizer in the HyperContent (uses J2SSH) issues

close (throws Exception, not releasing lock) A finalizer in briefcase issues close through

HyperContent, which waits on the lock

Page 17: uPortal 2 Status – 9/19/05

9/19/05 17

uPortal 2.x Memory/Performance:Wash, Rinse, Repeat Need to re-create problem in QA Why not caught in QA?

The briefcase channel utilizes sftp to an individual account, but a test sftp server does not exist, so a “real” user must be used – not part of the Test Scenario

A new test script is written to simulate one user logging into the portal and using the Briefcase

Problem is replicated consistently within 20 minutes using a 256MB heap

Page 18: uPortal 2 Status – 9/19/05

9/19/05 18

uPortal 2.x Memory/Performance:Final Resolution Though the J2SSH code is no longer actively

maintained, a contribution is found to address this issue – made available 6/04

A HyperContent class is extended and a finalizer method is overridden to only close if the connection is open

The finalizer in the briefcase is removed QA test now passes

Page 19: uPortal 2 Status – 9/19/05

9/19/05 19

uPortal 2.x Memory/Performance:Install Fix in Production Final set of fixes rolled into production

8/31/05 Servers have not run out of memory

since release has been introduced But…

Page 20: uPortal 2 Status – 9/19/05

9/19/05 20

uPortal 2.x Memory/Performance:Increase in Threads Monitoring of servers show a gradual

increase in the number of threads (from initial 180 to over 1k)

kill -3 reveals hundreds of threads in JavaMail:

"JavaMail-EventQueue" daemon prio=5 tid=0x002da400 nid=0x534 in Object.wait() [aa881000..aa8819c0]at java.lang.Object.wait(Native Method)- waiting on <0xba6476f0> (a javax.mail.EventQueue)at java.lang.Object.wait(Object.java:429)at javax.mail.EventQueue.dequeue(Unknown Source)- locked <0xba6476f0> (a javax.mail.EventQueue)at javax.mail.EventQueue.run(Unknown Source)at java.lang.Thread.run(Thread.java:534)

Modified JavaMail to restore needed finalization() methods

Patch fixes thread problem, but…

Page 21: uPortal 2 Status – 9/19/05

9/19/05 21

uPortal 2.x Memory/Performance:Heavy Traffic New semester brings much more traffic

(and more problems): Increase in portal logins (40k in one day,

3600/hour) overwhelms authentication infrastructure, causing mod_jk threads blocking resulting in restarts (and more logins, etc…)

Channel time out – traced to heavy load and use of SoftReference caches; worst case scenario caching

Heavy memory usage – many temporary objects; over 600 concurrent sessions results in over 700MB temporary objects every second

Page 22: uPortal 2 Status – 9/19/05

9/19/05 22

uPortal 2.x Memory/Performance:Triage Address immediate concerns for

heavy load*: Introduce LDAP connection pooling Modify caching of layouts to share across

users Tune caching of heavily used channels

* Modifications not yet submitted to uPortal codebase

Page 23: uPortal 2 Status – 9/19/05

9/19/05 23

uPortal 2.x Memory/Performance:Going Forward… Focus on heavily used channels (Email) Revisit statistics to capture better data for

monitoring/reporting/forecasting production servers Implement better caching algorithms Monitor/tune/implement pooling:

LDAP DB HTTP client connections

Migrate to JDK 1.5 Continue to monitor/tune heap

Large new area for temporary objects Go on a Temporary Object Diet!

Page 24: uPortal 2 Status – 9/19/05

9/19/05 24

uPortal 2.x Memory/Performance:Lessons Learned QA should reflect production usage Java “Memory Leaks” becomes a catch-

phrase for many different types of problems Useful tools:

Java Heap/CPU analyzer (e.g., YourKit) lsof (ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/) kill -3 thread dumps

Experience was invaluable: Solved many memory leaks/performance issues

in codebase contributing to a more stable portal Skills can be applied to other Java-based apps