slides by carl kingsford jan. 13, 2014 kt chapter 1example v: side-chain positioning chazelle et...

17
02-713 Introduction Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1

Upload: others

Post on 23-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

02-713 Introduction

Slides by Carl Kingsford

Jan. 13, 2014

KT Chapter 1

Page 2: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Objective of this Course

To study general computational problems and their algorithms,with a focus on the principles used to design those algorithms.

After passing this class, you should be able to:

1. Design algorithms using several common techniques

2. Prove a worst-case running time for many algorithms

3. Prove a problem is probably hard (NP-complete)

Page 3: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Example Problems

Page 4: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Example I: Low-cost network design

Page 5: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Example II: Finding closest pair of points

Given a set of points {p1, . . . , pn} find the pair of points {pi , pj}that are closest together.

Page 6: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Example III: RNA folding

GAUGGCAAAUGCUAAGGCCU... →

G C

CG

UA

A U

U A

C G

CG

AU

A UG

G

G

U

A

A

A

A G C C

GGCUU A

AA

G A

C

C

G

GU

CU

UU

ACC

CC

GG

A

UA

U

G

CC

C

C

A

A

Page 7: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Example IV: Scheduling k planes

Pacific Time4:00

Mountain Time 5:00 Central Time 6:00 Eastern Time 7:00Hawaii-Aleutian Time 2:00

Alaska Time 3:00

CANADA

Delta Air Lines/Delta Connection Route

Future Route Service

Route served by Alaska Airlines/Horizon Air

Destination served by Delta / Delta Connection

Destination served by one of Delta’s Worldwide Codeshare Partners

Time zone (Legislated standard time zones shown. Observed time may differ.)

Effective January 2013. Select routes are seasonal. Some future services subject to government approval. Service may be operated by one of Delta’s codeshare partner airlines or one of Delta’s Connection Carriers. Flights are subject to change without notice.

T E X A S

LOUISIANAMISSISSIPPI

A L A B A M A

G E O R G I A

F LO R I D A

S O U T H C A R O L I N A

N O R T H C A R O L I N A

V I R G I N I A

K E N T U C K Y

O H I O

PE N N SYLVAN IA

N E W YO R K

I N D I A N A

I L L I N O I S

MICHIGAN

T E N N E S S E EO K L A H O M A

M I S S O U R I

I O WA

K A N S A S

N E B R A S K A

S O U T H D A KOTA

N O R T H D A KOTA M I N N E S OTA

WISCONSINW YO M I N G

M O N TA N A

I D A H OO R E G O N

W A S H I N G T O N

C A L I F O R N I A

N E VA D A

U TA H

A R I Z O N AN E W M E X I C O

C O LO R A D O

A R K A N S A S

W E S T V I R G I N I A

V T

N H

M A I N E

M A S S .

C O N N.RI

N J

D E L .MARYLAND

M E X I C O

C A N A D A

BAHAMAS

CUBADOMINICAN

REPUBLIC

PacificOcean

Atlantic Ocean

Gulfof Mexico

H AWA I I

Pacific Ocean

A L A S K A

Miami

Orlando

West Palm Beach

Portland

Seattle/Tacoma

Boise

San Jose

Las Vegas

Burbank

San Diego

San Francisco

Oakland

Denver

Sacramento

Salt Lake City

Tucson

Phoenix/Scottsdale

Sarasota/Bradenton

Albuquerque

Charleston

Colorado Springs

Greenville/Spartanburg

Huntsville/ Decatur

Pensacola

Savannah

Baltimore

Birmingham

Chicago (ORD, MDW)

Houston(IAH, HOU)

Louisville

Memphis

Milwaukee

Philadelphia

San Antonio

St. Louis

Tampa/St. Petersburg

Charlotte

Cleveland

Dallas/Ft. Worth (DFW)

Detroit

Jacksonville

Kansas City

New Orleans

New York (JFK, LGA)

Norfolk/Virginia Beach

Omaha

Baton Rouge

Albany

Atlanta

Austin

Boston

Columbia

Columbus

Jackson

Little Rock

Nashville

Oklahoma City

Raleigh/Durham

Richmond

Tallahassee

Washington, D.C. (DCA, IAD)

Hartford/Springfield

Cincinnati

Bozeman

Orange County

Portland

Providence

Newark

Greensboro/High Point/Winston-Salem

Lexington

Grand Rapids

Ft. Lauderdale/Hollywood

Syracuse

Buffalo/Niagara Falls

KnoxvilleTulsa

Daytona Beach

El Paso/Ciudad Juárez

Manchester

Melbourne

Mobile

Ontario

Ft. Myers/Naples

Ft. Walton Beach

Indianapolis

Minneapolis/St. Paul

Dayton

Lafayette

Alexandria

Harrisburg

Madison

Shreveport

Pittsburgh

Appleton/Fox Cities

Binghamton

Bangor

Billings

Bloomington

Burlington

Cedar Rapids/ Iowa City

Des Moines

Elmira/Corning

Erie

Ft. Wayne

White PlainsLansing

Midland/Saginaw

Moline/ Quad Cities

Rochester

Springfield/Branson

Great Falls

Sioux Falls

Spokane

Wichita

Lincoln

Montgomery

Panama City

Fayetteville/Northwest Arkansas

Missoula

Rapid City

Reno/Tahoe

Asheville

Charleston

Akron/Canton

Gulfport/Biloxi

Chattanooga

Newburgh

Roanoke

Gainesville

Kalispell

Evansville

Monroe

Tri-Cities

Casper

Jackson Hole

Pasco/Richland/Kennewick

Los Angeles Newport News/Williamsburg

South Bend

Helena

Flint

Idaho Falls

Allentown

Kalamazoo/Battle Creek

Duluth

Myrtle Beach

Juneau

KahuluiHonolulu

Kona

Long Beach

La CrosseRochester

Killeen/Ft. Hood

Montrose/Telluride

Traverse City

Hayden/Steamboat Springs

Eagle/Vail/Beaver Creek

Nantucket

Lihue

Gillette

Rock Springs

Fargo

Williston

Green BayWausau

Ithaca

Peoria

West Yellowstone

Anchorage

Fairbanks

Ketchikan

Elko

Palm Springs

Fresno/Yosemite

Sun Valley

TwinFalls

Pocatello

St. George

Columbus/Starkville/West Point

Albany

Augusta

Wilkes-Barre/Scranton

Brunswick

Charlottesville

Columbus/Ft. Benning

Dothan

Key West

Fayetteville/Ft. Bragg

Valdosta

State College

Wilmington

Cody

EugeneLewiston

Medford

Redmond/Bend

Grand Junction

Jacksonville/Camp LejeuneNew Bern

Cedar City

Butte

Bismarck

Aberdeen

Alpena

Sault Ste. Marie

Escanaba

Grand Forks

Iron Mountain

Brainerd Marquette

Columbia

BemidjiChisholm/Hibbing

International FallsMinot

Pellston/Mackinac Island

Santa Rosa

Dallas Love Field (DAL)

Ft. Smith

BellinghamAbbotsford

Wenatchee

Yakima

Walla Walla

Sitka

Mammoth Lakes

Pullman

Maui

Hawaii

Kauai

Oahu

Vancouver

Edmonton

Winnipeg

Saskatoon

Regina

Calgary/Banff

Toronto

Montreal

Québec

Ottawa

Halifax

VictoriaKelowna

Ft. McMurrayGrande PrairiePrince George

Charlottetown

St. John’s, NL

Martha’s Vineyard

Hilo

Thunder Bay

Moncton

Rhinelander

Delta.com

LGA → YVR, 9am - 7pmPIT → KOA, 6am - 8pmDFW → MSY 4pm - 6pm

Page 8: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Example V: Side-chain positioning

Chazelle et al.: A Semidefinite Programming Approach to Side Chain Positioning with New Rounding StrategiesINFORMS Journal on Computing 16(4), pp. 380–392, © 2004 INFORMS 387

consider uniform random graphs and then considera family of randomly generated graphs that bettermodel the interaction graphs observed in proteins.The semidefinite programs were solved using ver-

sion 6.0 of the SDPA (Fujisawa et al. 1997) package, animplementation of an infeasible primal-dual interior-point method. The linear programs were solved usingthe dual simplex method with AMPL (Fourer et al.2002) and CPLEX 7.1 (ILOG CPLEX 2000). The SDPsolutions were rounded using the projection andPerron-Frobenius methods described above.We compare our SDP with the LP obtained by

relaxing the integrality constraints from (4). The LPsolutions were rounded by choosing node u withprobability xuu. For problems of this size, optimalintegral solutions (denoted by OPT) can be foundusing formulation (4) and the integer-programmingoption of CPLEX. This allows us to compute the rel-ative gap of each rounded solution, as computed by!!x−OPT"/OPT!, where x is the value of the solution.For both the protein-design problems and the simu-lated data, the SDP rounding schemes perform well,with significantly better average relative gaps.

4.1. Cold-Shock ProteinWe applied the SDP method to the problem ofredesigning the core of the Bacillus caldolyticus cold-shock protein (Mueller et al. 2000) (PDB code: 1c9o).Core residues were defined as having less than 1% oftheir surface area exposed to solvent, and determinedby the program Surfv (Nicholls et al. 1991). The fol-lowing eight residues were found: Val6, Gly16, Ile18,Val28, Leu41, Val47, Phe49, and Val63. The hydropho-bic core positions (i.e., all positions listed above exceptposition 16 with Gly) were varied, and allowed toassume any rotamer of the hydrophobic amino acidsAla, Val, Ile, Leu, Met, and Phe that occurred inthe backbone-dependent rotamer library (Dunbrack

Leu41

Val6

Val47

Val28

Ile18

Phe49

Val63

Leu41

Ile6

Ile47

Ile28

Ile18

Phe49

Val63

Side Chains in Nature(b) Positioning of the(a) Full Protein (c) Solution Returned by the SDP

Rounding Schemes (the Optimal)

Figure 2 Cold-Shock Protein (1c9o)

and Karplus 1993). This yields 55 rotamers per posi-tion. The protein is shown in Figure 2a, with vari-able atoms shown as black spheres and the axis ofthe #-barrel vertical. The native positions of the sidechains are shown in Figure 2b, where the proteinis rotated so that we are looking down the axis ofthe barrel.The resulting problem had 385 nodes, seven posi-

tions, and 63,313 nonzero cost-matrix entries. Simplepairwise DEE of Goldstein (1994), a polynomial-timerule for throwing out rotamers that cannot possiblybe in the optimal solution, was applied to the prob-lem until no more nodes could be eliminated. Thisreduced the problem to 137 nodes, seven positions,and 7,865 nonzero cost-matrix entries.The LP solution was rounded 1,000 times using

the simple LP-rounding scheme. The SDP solutionwas rounded 1,000 times with both the projection andPerron-Frobenius-rounding schemes. The minimum-energy solution found is a good measure of howwell one would do in practice, but this minimumenergy may be influenced by the moderate search-space size. The average energy of a rounded solutionis a better indicator of the distribution obtained fromrounding the relaxations. The best value over 1,000roundings and the empirical average objective valuein the limit are:

Method Best Average

LP −217$2880 4058$6651Projection −238$4218 −102$2822Perron-Frobenius −238$4218 −209$3617

The optimum solution (determined by the integer-programming option of CPLEX) is −238$4218. BothSDP rounding schemes find the optimum solution;this appears to correspond to a well-packed and plau-sible structure, as shown in Figure 2c. The aver-

Chazelle et al.: A Semidefinite Programming Approach to Side Chain Positioning with New Rounding StrategiesINFORMS Journal on Computing 16(4), pp. 380–392, © 2004 INFORMS 387

consider uniform random graphs and then considera family of randomly generated graphs that bettermodel the interaction graphs observed in proteins.The semidefinite programs were solved using ver-

sion 6.0 of the SDPA (Fujisawa et al. 1997) package, animplementation of an infeasible primal-dual interior-point method. The linear programs were solved usingthe dual simplex method with AMPL (Fourer et al.2002) and CPLEX 7.1 (ILOG CPLEX 2000). The SDPsolutions were rounded using the projection andPerron-Frobenius methods described above.We compare our SDP with the LP obtained by

relaxing the integrality constraints from (4). The LPsolutions were rounded by choosing node u withprobability xuu. For problems of this size, optimalintegral solutions (denoted by OPT) can be foundusing formulation (4) and the integer-programmingoption of CPLEX. This allows us to compute the rel-ative gap of each rounded solution, as computed by!!x−OPT"/OPT!, where x is the value of the solution.For both the protein-design problems and the simu-lated data, the SDP rounding schemes perform well,with significantly better average relative gaps.

4.1. Cold-Shock ProteinWe applied the SDP method to the problem ofredesigning the core of the Bacillus caldolyticus cold-shock protein (Mueller et al. 2000) (PDB code: 1c9o).Core residues were defined as having less than 1% oftheir surface area exposed to solvent, and determinedby the program Surfv (Nicholls et al. 1991). The fol-lowing eight residues were found: Val6, Gly16, Ile18,Val28, Leu41, Val47, Phe49, and Val63. The hydropho-bic core positions (i.e., all positions listed above exceptposition 16 with Gly) were varied, and allowed toassume any rotamer of the hydrophobic amino acidsAla, Val, Ile, Leu, Met, and Phe that occurred inthe backbone-dependent rotamer library (Dunbrack

Leu41

Val6

Val47

Val28

Ile18

Phe49

Val63

Leu41

Ile6

Ile47

Ile28

Ile18

Phe49

Val63

Side Chains in Nature(b) Positioning of the(a) Full Protein (c) Solution Returned by the SDP

Rounding Schemes (the Optimal)

Figure 2 Cold-Shock Protein (1c9o)

and Karplus 1993). This yields 55 rotamers per posi-tion. The protein is shown in Figure 2a, with vari-able atoms shown as black spheres and the axis ofthe #-barrel vertical. The native positions of the sidechains are shown in Figure 2b, where the proteinis rotated so that we are looking down the axis ofthe barrel.The resulting problem had 385 nodes, seven posi-

tions, and 63,313 nonzero cost-matrix entries. Simplepairwise DEE of Goldstein (1994), a polynomial-timerule for throwing out rotamers that cannot possiblybe in the optimal solution, was applied to the prob-lem until no more nodes could be eliminated. Thisreduced the problem to 137 nodes, seven positions,and 7,865 nonzero cost-matrix entries.The LP solution was rounded 1,000 times using

the simple LP-rounding scheme. The SDP solutionwas rounded 1,000 times with both the projection andPerron-Frobenius-rounding schemes. The minimum-energy solution found is a good measure of howwell one would do in practice, but this minimumenergy may be influenced by the moderate search-space size. The average energy of a rounded solutionis a better indicator of the distribution obtained fromrounding the relaxations. The best value over 1,000roundings and the empirical average objective valuein the limit are:

Method Best Average

LP −217$2880 4058$6651Projection −238$4218 −102$2822Perron-Frobenius −238$4218 −209$3617

The optimum solution (determined by the integer-programming option of CPLEX) is −238$4218. BothSDP rounding schemes find the optimum solution;this appears to correspond to a well-packed and plau-sible structure, as shown in Figure 2c. The aver-

Page 9: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Design of algorithms

General techniques:

I Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 4)

I Divide & conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 5)

I Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 6)

I Network flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (Chapter 7)

I Linear and integer programming . . . . . . . . . (Sections 11.6-11.7)

Not all algorithms fit into these categories, but a very largefraction do.

Page 10: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Analysis of algorithms

I Prove correctness(the algorithm always returns the right answer)

I Discuss how to implement(what data structures do we need to implement thealgorithm?).

I Prove worst-case running time(no matter the input, it will never run slower that we expect).

I Prove no algorithm can do better(theory of computational complexity).

Page 11: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Tentative Schedule

1. Introduction, Minimum Spanning Tree case study, andAsymptotic analysis

2. Elementary algorithms: divide & conquer and graphalgorithms

I Closest pair of pointsI Matrix multiplicationI Fast Fourier TransformI Graph search: Breadth first, depth first, topological sortingI Shortest path algorithmsI A* search

3. Advanced data structures (e.g. splay trees, suffix trees)

4. Advanced algorithmic design techniquesI Dynamic programmingI Network flowI Linear and integer programmingI NP-completenessI Randomized algorithms (e.g. hashing, mincut)

Page 12: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Homeworks

I Near-weekly homeworks

I 10% of your grade

I Encouraged to discuss homeworks with other students in class

I MUST WRITE UP HOMEWORKS ON YOUR OWN

I You must list, at the top of your homework, those people withwhom you discussed the problems & any sources you used

I Homework answers must be typeset and submitted online(instructions will be on website)

I A few homeworks will consist of programming in Python

Page 13: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

What does “on your own” mean?

You cannot, for example:

I look at another person’s homework

I have them look at yours to see if it is correct

I take notes from a discussion and edit them into yourhomework

I sit in a group and continue discussing the homework whilewriting it up

Intent: you can gather around a whiteboard with your fellowstudents and discuss how to solve the problems. Then you must allwalk away and write the answers up separately.

Page 14: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Exams

Two non-cumulative midterm exams, each 25% of grade:

I Friday, February 14th, 2014

I Friday, March 28th, 2014

A cumulative final exam:

I According to the official university exam schedule.

Page 15: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Why Python?

http://xkcd.com/353/

Page 16: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Why Python?

Pros:

I Expressive, math-like syntax

I Support for modern programming paradigms(object orientation, some functional programming)

I Scripting language avoids compilation

I Extensive on-line help and documentation

I Extensive libraries(graphs, matlab functions, numerical methods)

I Widely used in bioinformatics & other disciplines

Cons:

I Can be slower than other languages (especially loops)

I Less memory efficient than other languages

Page 17: Slides by Carl Kingsford Jan. 13, 2014 KT Chapter 1Example V: Side-chain positioning Chazelle et al.: A SemideÞnite Programming Approach to Side Chain Positioning with New Rounding

Homework 0: Survey

Complete the survey at

http://www.cs.cmu.edu/~ckingsf/class/02713-s14/survey.html

Due by 11:59pm on Tuesday, Jan 14.