slides by carl kingsford jan. 13, 2014 kt chapter 1example v: side-chain positioning chazelle et...
TRANSCRIPT
02-713 Introduction
Slides by Carl Kingsford
Jan. 13, 2014
KT Chapter 1
Objective of this Course
To study general computational problems and their algorithms,with a focus on the principles used to design those algorithms.
After passing this class, you should be able to:
1. Design algorithms using several common techniques
2. Prove a worst-case running time for many algorithms
3. Prove a problem is probably hard (NP-complete)
Example Problems
Example I: Low-cost network design
Example II: Finding closest pair of points
Given a set of points {p1, . . . , pn} find the pair of points {pi , pj}that are closest together.
Example III: RNA folding
GAUGGCAAAUGCUAAGGCCU... →
G C
CG
UA
A U
U A
C G
CG
AU
A UG
G
G
U
A
A
A
A G C C
GGCUU A
AA
G A
C
C
G
GU
CU
UU
ACC
CC
GG
A
UA
U
G
CC
C
C
A
A
Example IV: Scheduling k planes
Pacific Time4:00
Mountain Time 5:00 Central Time 6:00 Eastern Time 7:00Hawaii-Aleutian Time 2:00
Alaska Time 3:00
CANADA
Delta Air Lines/Delta Connection Route
Future Route Service
Route served by Alaska Airlines/Horizon Air
Destination served by Delta / Delta Connection
Destination served by one of Delta’s Worldwide Codeshare Partners
Time zone (Legislated standard time zones shown. Observed time may differ.)
Effective January 2013. Select routes are seasonal. Some future services subject to government approval. Service may be operated by one of Delta’s codeshare partner airlines or one of Delta’s Connection Carriers. Flights are subject to change without notice.
T E X A S
LOUISIANAMISSISSIPPI
A L A B A M A
G E O R G I A
F LO R I D A
S O U T H C A R O L I N A
N O R T H C A R O L I N A
V I R G I N I A
K E N T U C K Y
O H I O
PE N N SYLVAN IA
N E W YO R K
I N D I A N A
I L L I N O I S
MICHIGAN
T E N N E S S E EO K L A H O M A
M I S S O U R I
I O WA
K A N S A S
N E B R A S K A
S O U T H D A KOTA
N O R T H D A KOTA M I N N E S OTA
WISCONSINW YO M I N G
M O N TA N A
I D A H OO R E G O N
W A S H I N G T O N
C A L I F O R N I A
N E VA D A
U TA H
A R I Z O N AN E W M E X I C O
C O LO R A D O
A R K A N S A S
W E S T V I R G I N I A
V T
N H
M A I N E
M A S S .
C O N N.RI
N J
D E L .MARYLAND
M E X I C O
C A N A D A
BAHAMAS
CUBADOMINICAN
REPUBLIC
PacificOcean
Atlantic Ocean
Gulfof Mexico
H AWA I I
Pacific Ocean
A L A S K A
Miami
Orlando
West Palm Beach
Portland
Seattle/Tacoma
Boise
San Jose
Las Vegas
Burbank
San Diego
San Francisco
Oakland
Denver
Sacramento
Salt Lake City
Tucson
Phoenix/Scottsdale
Sarasota/Bradenton
Albuquerque
Charleston
Colorado Springs
Greenville/Spartanburg
Huntsville/ Decatur
Pensacola
Savannah
Baltimore
Birmingham
Chicago (ORD, MDW)
Houston(IAH, HOU)
Louisville
Memphis
Milwaukee
Philadelphia
San Antonio
St. Louis
Tampa/St. Petersburg
Charlotte
Cleveland
Dallas/Ft. Worth (DFW)
Detroit
Jacksonville
Kansas City
New Orleans
New York (JFK, LGA)
Norfolk/Virginia Beach
Omaha
Baton Rouge
Albany
Atlanta
Austin
Boston
Columbia
Columbus
Jackson
Little Rock
Nashville
Oklahoma City
Raleigh/Durham
Richmond
Tallahassee
Washington, D.C. (DCA, IAD)
Hartford/Springfield
Cincinnati
Bozeman
Orange County
Portland
Providence
Newark
Greensboro/High Point/Winston-Salem
Lexington
Grand Rapids
Ft. Lauderdale/Hollywood
Syracuse
Buffalo/Niagara Falls
KnoxvilleTulsa
Daytona Beach
El Paso/Ciudad Juárez
Manchester
Melbourne
Mobile
Ontario
Ft. Myers/Naples
Ft. Walton Beach
Indianapolis
Minneapolis/St. Paul
Dayton
Lafayette
Alexandria
Harrisburg
Madison
Shreveport
Pittsburgh
Appleton/Fox Cities
Binghamton
Bangor
Billings
Bloomington
Burlington
Cedar Rapids/ Iowa City
Des Moines
Elmira/Corning
Erie
Ft. Wayne
White PlainsLansing
Midland/Saginaw
Moline/ Quad Cities
Rochester
Springfield/Branson
Great Falls
Sioux Falls
Spokane
Wichita
Lincoln
Montgomery
Panama City
Fayetteville/Northwest Arkansas
Missoula
Rapid City
Reno/Tahoe
Asheville
Charleston
Akron/Canton
Gulfport/Biloxi
Chattanooga
Newburgh
Roanoke
Gainesville
Kalispell
Evansville
Monroe
Tri-Cities
Casper
Jackson Hole
Pasco/Richland/Kennewick
Los Angeles Newport News/Williamsburg
South Bend
Helena
Flint
Idaho Falls
Allentown
Kalamazoo/Battle Creek
Duluth
Myrtle Beach
Juneau
KahuluiHonolulu
Kona
Long Beach
La CrosseRochester
Killeen/Ft. Hood
Montrose/Telluride
Traverse City
Hayden/Steamboat Springs
Eagle/Vail/Beaver Creek
Nantucket
Lihue
Gillette
Rock Springs
Fargo
Williston
Green BayWausau
Ithaca
Peoria
West Yellowstone
Anchorage
Fairbanks
Ketchikan
Elko
Palm Springs
Fresno/Yosemite
Sun Valley
TwinFalls
Pocatello
St. George
Columbus/Starkville/West Point
Albany
Augusta
Wilkes-Barre/Scranton
Brunswick
Charlottesville
Columbus/Ft. Benning
Dothan
Key West
Fayetteville/Ft. Bragg
Valdosta
State College
Wilmington
Cody
EugeneLewiston
Medford
Redmond/Bend
Grand Junction
Jacksonville/Camp LejeuneNew Bern
Cedar City
Butte
Bismarck
Aberdeen
Alpena
Sault Ste. Marie
Escanaba
Grand Forks
Iron Mountain
Brainerd Marquette
Columbia
BemidjiChisholm/Hibbing
International FallsMinot
Pellston/Mackinac Island
Santa Rosa
Dallas Love Field (DAL)
Ft. Smith
BellinghamAbbotsford
Wenatchee
Yakima
Walla Walla
Sitka
Mammoth Lakes
Pullman
Maui
Hawaii
Kauai
Oahu
Vancouver
Edmonton
Winnipeg
Saskatoon
Regina
Calgary/Banff
Toronto
Montreal
Québec
Ottawa
Halifax
VictoriaKelowna
Ft. McMurrayGrande PrairiePrince George
Charlottetown
St. John’s, NL
Martha’s Vineyard
Hilo
Thunder Bay
Moncton
Rhinelander
Delta.com
LGA → YVR, 9am - 7pmPIT → KOA, 6am - 8pmDFW → MSY 4pm - 6pm
…
Example V: Side-chain positioning
Chazelle et al.: A Semidefinite Programming Approach to Side Chain Positioning with New Rounding StrategiesINFORMS Journal on Computing 16(4), pp. 380–392, © 2004 INFORMS 387
consider uniform random graphs and then considera family of randomly generated graphs that bettermodel the interaction graphs observed in proteins.The semidefinite programs were solved using ver-
sion 6.0 of the SDPA (Fujisawa et al. 1997) package, animplementation of an infeasible primal-dual interior-point method. The linear programs were solved usingthe dual simplex method with AMPL (Fourer et al.2002) and CPLEX 7.1 (ILOG CPLEX 2000). The SDPsolutions were rounded using the projection andPerron-Frobenius methods described above.We compare our SDP with the LP obtained by
relaxing the integrality constraints from (4). The LPsolutions were rounded by choosing node u withprobability xuu. For problems of this size, optimalintegral solutions (denoted by OPT) can be foundusing formulation (4) and the integer-programmingoption of CPLEX. This allows us to compute the rel-ative gap of each rounded solution, as computed by!!x−OPT"/OPT!, where x is the value of the solution.For both the protein-design problems and the simu-lated data, the SDP rounding schemes perform well,with significantly better average relative gaps.
4.1. Cold-Shock ProteinWe applied the SDP method to the problem ofredesigning the core of the Bacillus caldolyticus cold-shock protein (Mueller et al. 2000) (PDB code: 1c9o).Core residues were defined as having less than 1% oftheir surface area exposed to solvent, and determinedby the program Surfv (Nicholls et al. 1991). The fol-lowing eight residues were found: Val6, Gly16, Ile18,Val28, Leu41, Val47, Phe49, and Val63. The hydropho-bic core positions (i.e., all positions listed above exceptposition 16 with Gly) were varied, and allowed toassume any rotamer of the hydrophobic amino acidsAla, Val, Ile, Leu, Met, and Phe that occurred inthe backbone-dependent rotamer library (Dunbrack
Leu41
Val6
Val47
Val28
Ile18
Phe49
Val63
Leu41
Ile6
Ile47
Ile28
Ile18
Phe49
Val63
Side Chains in Nature(b) Positioning of the(a) Full Protein (c) Solution Returned by the SDP
Rounding Schemes (the Optimal)
Figure 2 Cold-Shock Protein (1c9o)
and Karplus 1993). This yields 55 rotamers per posi-tion. The protein is shown in Figure 2a, with vari-able atoms shown as black spheres and the axis ofthe #-barrel vertical. The native positions of the sidechains are shown in Figure 2b, where the proteinis rotated so that we are looking down the axis ofthe barrel.The resulting problem had 385 nodes, seven posi-
tions, and 63,313 nonzero cost-matrix entries. Simplepairwise DEE of Goldstein (1994), a polynomial-timerule for throwing out rotamers that cannot possiblybe in the optimal solution, was applied to the prob-lem until no more nodes could be eliminated. Thisreduced the problem to 137 nodes, seven positions,and 7,865 nonzero cost-matrix entries.The LP solution was rounded 1,000 times using
the simple LP-rounding scheme. The SDP solutionwas rounded 1,000 times with both the projection andPerron-Frobenius-rounding schemes. The minimum-energy solution found is a good measure of howwell one would do in practice, but this minimumenergy may be influenced by the moderate search-space size. The average energy of a rounded solutionis a better indicator of the distribution obtained fromrounding the relaxations. The best value over 1,000roundings and the empirical average objective valuein the limit are:
Method Best Average
LP −217$2880 4058$6651Projection −238$4218 −102$2822Perron-Frobenius −238$4218 −209$3617
The optimum solution (determined by the integer-programming option of CPLEX) is −238$4218. BothSDP rounding schemes find the optimum solution;this appears to correspond to a well-packed and plau-sible structure, as shown in Figure 2c. The aver-
Chazelle et al.: A Semidefinite Programming Approach to Side Chain Positioning with New Rounding StrategiesINFORMS Journal on Computing 16(4), pp. 380–392, © 2004 INFORMS 387
consider uniform random graphs and then considera family of randomly generated graphs that bettermodel the interaction graphs observed in proteins.The semidefinite programs were solved using ver-
sion 6.0 of the SDPA (Fujisawa et al. 1997) package, animplementation of an infeasible primal-dual interior-point method. The linear programs were solved usingthe dual simplex method with AMPL (Fourer et al.2002) and CPLEX 7.1 (ILOG CPLEX 2000). The SDPsolutions were rounded using the projection andPerron-Frobenius methods described above.We compare our SDP with the LP obtained by
relaxing the integrality constraints from (4). The LPsolutions were rounded by choosing node u withprobability xuu. For problems of this size, optimalintegral solutions (denoted by OPT) can be foundusing formulation (4) and the integer-programmingoption of CPLEX. This allows us to compute the rel-ative gap of each rounded solution, as computed by!!x−OPT"/OPT!, where x is the value of the solution.For both the protein-design problems and the simu-lated data, the SDP rounding schemes perform well,with significantly better average relative gaps.
4.1. Cold-Shock ProteinWe applied the SDP method to the problem ofredesigning the core of the Bacillus caldolyticus cold-shock protein (Mueller et al. 2000) (PDB code: 1c9o).Core residues were defined as having less than 1% oftheir surface area exposed to solvent, and determinedby the program Surfv (Nicholls et al. 1991). The fol-lowing eight residues were found: Val6, Gly16, Ile18,Val28, Leu41, Val47, Phe49, and Val63. The hydropho-bic core positions (i.e., all positions listed above exceptposition 16 with Gly) were varied, and allowed toassume any rotamer of the hydrophobic amino acidsAla, Val, Ile, Leu, Met, and Phe that occurred inthe backbone-dependent rotamer library (Dunbrack
Leu41
Val6
Val47
Val28
Ile18
Phe49
Val63
Leu41
Ile6
Ile47
Ile28
Ile18
Phe49
Val63
Side Chains in Nature(b) Positioning of the(a) Full Protein (c) Solution Returned by the SDP
Rounding Schemes (the Optimal)
Figure 2 Cold-Shock Protein (1c9o)
and Karplus 1993). This yields 55 rotamers per posi-tion. The protein is shown in Figure 2a, with vari-able atoms shown as black spheres and the axis ofthe #-barrel vertical. The native positions of the sidechains are shown in Figure 2b, where the proteinis rotated so that we are looking down the axis ofthe barrel.The resulting problem had 385 nodes, seven posi-
tions, and 63,313 nonzero cost-matrix entries. Simplepairwise DEE of Goldstein (1994), a polynomial-timerule for throwing out rotamers that cannot possiblybe in the optimal solution, was applied to the prob-lem until no more nodes could be eliminated. Thisreduced the problem to 137 nodes, seven positions,and 7,865 nonzero cost-matrix entries.The LP solution was rounded 1,000 times using
the simple LP-rounding scheme. The SDP solutionwas rounded 1,000 times with both the projection andPerron-Frobenius-rounding schemes. The minimum-energy solution found is a good measure of howwell one would do in practice, but this minimumenergy may be influenced by the moderate search-space size. The average energy of a rounded solutionis a better indicator of the distribution obtained fromrounding the relaxations. The best value over 1,000roundings and the empirical average objective valuein the limit are:
Method Best Average
LP −217$2880 4058$6651Projection −238$4218 −102$2822Perron-Frobenius −238$4218 −209$3617
The optimum solution (determined by the integer-programming option of CPLEX) is −238$4218. BothSDP rounding schemes find the optimum solution;this appears to correspond to a well-packed and plau-sible structure, as shown in Figure 2c. The aver-
Design of algorithms
General techniques:
I Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 4)
I Divide & conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 5)
I Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 6)
I Network flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 7)
I Linear and integer programming . . . . . . . . . (Sections 11.6-11.7)
Not all algorithms fit into these categories, but a very largefraction do.
Analysis of algorithms
I Prove correctness(the algorithm always returns the right answer)
I Discuss how to implement(what data structures do we need to implement thealgorithm?).
I Prove worst-case running time(no matter the input, it will never run slower that we expect).
I Prove no algorithm can do better(theory of computational complexity).
Tentative Schedule
1. Introduction, Minimum Spanning Tree case study, andAsymptotic analysis
2. Elementary algorithms: divide & conquer and graphalgorithms
I Closest pair of pointsI Matrix multiplicationI Fast Fourier TransformI Graph search: Breadth first, depth first, topological sortingI Shortest path algorithmsI A* search
3. Advanced data structures (e.g. splay trees, suffix trees)
4. Advanced algorithmic design techniquesI Dynamic programmingI Network flowI Linear and integer programmingI NP-completenessI Randomized algorithms (e.g. hashing, mincut)
Homeworks
I Near-weekly homeworks
I 10% of your grade
I Encouraged to discuss homeworks with other students in class
I MUST WRITE UP HOMEWORKS ON YOUR OWN
I You must list, at the top of your homework, those people withwhom you discussed the problems & any sources you used
I Homework answers must be typeset and submitted online(instructions will be on website)
I A few homeworks will consist of programming in Python
What does “on your own” mean?
You cannot, for example:
I look at another person’s homework
I have them look at yours to see if it is correct
I take notes from a discussion and edit them into yourhomework
I sit in a group and continue discussing the homework whilewriting it up
Intent: you can gather around a whiteboard with your fellowstudents and discuss how to solve the problems. Then you must allwalk away and write the answers up separately.
Exams
Two non-cumulative midterm exams, each 25% of grade:
I Friday, February 14th, 2014
I Friday, March 28th, 2014
A cumulative final exam:
I According to the official university exam schedule.
Why Python?
Pros:
I Expressive, math-like syntax
I Support for modern programming paradigms(object orientation, some functional programming)
I Scripting language avoids compilation
I Extensive on-line help and documentation
I Extensive libraries(graphs, matlab functions, numerical methods)
I Widely used in bioinformatics & other disciplines
Cons:
I Can be slower than other languages (especially loops)
I Less memory efficient than other languages
Homework 0: Survey
Complete the survey at
http://www.cs.cmu.edu/~ckingsf/class/02713-s14/survey.html
Due by 11:59pm on Tuesday, Jan 14.