disks:&structure&and&scheduling&krj/os/lectures/l18-disks.pdf ·...

27
Disks: Structure and Scheduling COMS W4118 Prof. Kaustubh R. Joshi [email protected] hEp://www.cs.columbia.edu/~krj/os 4/8/13 COMS W4118. Spring 2013, Columbia University. Instructor: Dr. Kaustubh Joshi, AT&T Labs. 1 References: OperaVng Systems Concepts (9e), Linux Kernel Development, previous W4118s Copyright no2ce: care has been taken to use only those web images deemed by the instructor to be in the public domain. If you see a copyrighted image on any slide and are the copyright owner, please contact the instructor. It will be removed.

Upload: others

Post on 30-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disks:  Structure  and  Scheduling  

COMS  W4118  Prof.  Kaustubh  R.  Joshi  [email protected]  

 hEp://www.cs.columbia.edu/~krj/os  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   1  

References:  OperaVng  Systems  Concepts  (9e),  Linux  Kernel  Development,  previous  W4118s  Copyright  no2ce:    care  has  been  taken  to  use  only  those  web  images  deemed  by  the  instructor  to  be  in  the  public  domain.  If  you  see  a  copyrighted  image  on  any  slide  and  are  the  copyright  owner,  please  contact  the  instructor.  It  will  be  removed.  

Page 2: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Outline  

•  Basic:  enough  to  understand  FS  (now)  – Disk  characterisVcs – Disk  scheduling  

•  Advanced  (a]er  studying  file  systems)  – Disk  Reliability  – RAID  – SSDs/Flash  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   2  

Page 3: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disk  Structure  

•  Range  from  30GB  to  3TB  per  drive  

 •  Aluminum  plaEers  

with  magneVc  coaVng  

•  Commonly,  2-­‐5  plaEers  per  drive  

•  Common  plaEer  sizes:  3.5”,  2.5”,  and  1.8”  

•  MagneVc  heads  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   3  

Page 4: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

The First Commercial Disk Drive 1956  IBM  RAMDAC  computer  included  the  IBM  Model  350  disk  storage  system    5M  (7  bit)  characters  50  x  24”  plaEers  Access  Vme  =  <  1  second    

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   4  

Page 5: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disk  Interface  •  From  FS  perspecVve:  disk  is  addressed  as  a  one  dimension  

array  of  logical  sectors    •  Disk  controller  maps  logical  sector  to  physical  sector  

idenVfied  by  track  #,  surface  #,  and  sector  #  

 •  Note:  Old  drives  allowed  direct  C/H/S  (cylinder/head/sector)  

addressing  by  OS.  Modern  drives  export  LBA  (logical  block  address)  and  do  the  mapping  to  C/H/S  internally.    

0 1

2

3

5 6

7

11

9

4 8

10 12 13

14

15 16

17

18

19 20

21 22

23

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   5  

Page 6: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disk  Latencies  

•  RotaVonal  delay:  rotate  disk  to  get  to  the  right  sector  

•  Seek  Vme:  move  disk  arm  to  get  to  the  right  track    

•  Transfer  Vme:  get  bits  off  the  disk  

 

6 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 7: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Seek  Time  

•  Must  move  arm  to  the  right  track  •  Can  take  a  while  (e.g.,  5–  10ms)  

–  AcceleraVon,  coasVng,  seEling  (can  be  significant,  e.g.,  2ms)  

7

0 1

2

3

5 6

7

11

9

4 8

10 12 13

14

15 16

17

18

19 20

21 22

23

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 8: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Transfer  Time  

•  Transfer  bits  out  of  disk  •  Actually  preEy  fast  (e.g.,  125MB/s)  

8

0 1

2

3

5 6

7

11

9

4 8

10 12 13

14

15 16

17

18

19 20

21 22

23

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 9: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

I/O  Time  (T)  and  Rate  (R)  

•  T  =  RotaVonal  delay  +  seek  Vme  +  txfer  Vme  

•  R  =    Size  of  transfer  /  T  

•  Workload  1:  large  sequenVal  accesses?  

•  Workload  2:  small  random  accesses?  

9 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 10: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Example:  I/O  Time  and  Rate  Barracuda   Cheetah  15K.5  

Capacity   1TB   300GB  

RotaVonal  speed   7200  RPM   15000  RPM  

RotaVonal  latency  (ms)   4.2   2.0  

Avg  seek  (ms)   9   4  

Max  Transfer   105  MB/s   125  MB/s  

PlaEers   4   4  

Connects  via   SATA   SCSI  

•  Random  4KB  read  •  Barracuda:    T  =  13.2ms,  R  =  0.31MB/s  •  Cheetah:      T  =  6ms,  R  =  0.66MB/s  

•  SequenVal  100  MB  read  •  Barracuda:  T  =  950ms,  R  =  105  MB/s  •  Cheetah:  T  =  800ms,  R  =  125  MB/s  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   10  

Page 11: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Design  Vp:  Use  Disks  SequenVally!  

•  Disk  performance  differs  by  a  factor  of  200  or  300  for  random  v.s.  sequenVal  accesses  

•  When  possible,  access  disks  sequenVally  

11 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 12: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Mapping  of  logical  sectors  to  physical  •  Logical  sector  0:  the  first  sector  of  the  first  (outermost)  track  of  the  first  surface  

•  Logical  sector  address  incremented  within  track,  then  tracks  within  cylinder,  then  across  cylinders,  from  outermost  to  innermost  

•  Track  skew   0 1

2

3

5 6

7

11

9

4 8

10 12 13

14

15 16

17

18

19 20

21 22

23

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   12  

Page 13: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Parallel  Reading  from  Heads  

•  All  heads  should  point  to  same  place  on  track  – Why  not  read  from  all  heads  in  parallel?  

•  Need  perfectly  aligned  heads  – Hard  to  do  in  pracVce  – Heads  not  perfectly  aligned  because  of  

•  Mechanical  vibraVons  •  Thermal  gradients  •  Mechanical  imperfecVons  

– High  density  makes  problem  worse  – Needs  high  throughput  read/write  circuitry  

•  Consequence:  most  drives  have  a  single  acVve  head  at  a  Vme  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   13  

Page 14: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Pros  and  cons  of  default  mapping  

•  Pros  – Simple  to  program  – Default  mapping  reduces  seek  Vme  for  sequenVal  access  

•  Cons  – FS  can’t  precisely  see  mapping  – Reverse-­‐engineer  mapping  in  OS  is  difficult  

•  #  of  sectors  per  track  changes  •  Disk  silently  remaps  bad  sectors  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   14  

Page 15: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disk  cache  

•  Internal  memory  (8MB-­‐32MB)  used  as  cache  

•  Read-­‐ahead:  “track  buffer”  –  Read  contents  of  enVre  track  into  memory  during  rotaVonal  delay  

•  Write  caching  with  volaVle  memory  –  Write  back  or  immediate  reporVng:  claim  wriEen  to  disk  when  not  

•  Faster,  but  data  could  be  lost  on  power  failure  

–  Write  through:  ack  a]er  data  wriEen  to  plaEer  

15 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 16: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disk  scheduling  

•  Goal:  minimize  posiVoning  Vme  – Performed  by  both  OS  and  disk  itself  – Why?  

•  OS  can  control:  – Sequence  of  workload  requests  

•  Disk  knows:  – Geometry,  accurate  posiVoning  Vmes  

16 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 17: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

FCFS  Disk  Scheduling  

Total  head  movement  of  640  cylinders  

Schedule  requests  in  order  received  (FCFS)  Advantage:  fair  Disadvantage:  high  seek  cost  and  rotaVon  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   17  

Page 18: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

SSTF:  Shortest  Seek  Time  First  •  Shortest  seek  Vme  first  (SSTF):    

–  Form  of  Shortest  Job  First  (SJF)  scheduling  –  Handle  nearest  cylinder  next  –  Advantage:  reduces  arm  movement  (seek  Vme)  –  Disadvantage:  unfair,  can  starve  some  requests  

Total  head  movement  of  236  cylinders  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   18  

Page 19: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Elevator  (aka  SCAN  or  C-­‐SCAN)  

•  Disk  arm  sweeps  across  disk  •  If  request  comes  for  a  block  already  serviced  in  this  sweep,  queue  it  for  next  sweep  

19

0 1

2

3

5 6

7

11

9

4 8

10 12 13

14

15 16

17

18

19 20

21 22

23

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 20: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

SCAN  (Elevator)  Disk  Scheduling  Make  up  and  down  passes  across  all  cylinders  

 Pros:  efficient,  simple    Cons:  Unfair.  Oldest  requests  (furthest  away)  also  wait  longest.  

 

Total  head  movement  of  208  cylinders  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   20  

Page 21: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

C-­‐SCAN  

•  Provides  a  more  uniform  wait  Vme  than  SCAN  

•  The  head  moves  from  one  end  of  the  disk  to  the  other,  servicing  requests  as  it  goes  – When  it  reaches  the  other  end,  however,  it  immediately  returns  to  the  beginning  of  the  disk,  without  servicing  any  requests  on  the  return  trip  

•  Treats  the  cylinders  as  a  circular  list  that  wraps  around  from  the  last  cylinder  to  the  first  one  

•  Total  number  of  cylinders?  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   21  

Page 22: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

C-­‐SCAN  (Elevator)  Scheduling  Head  reads  in  one  direcVon  only  

 Wrap  around  without  reading  when  end  is  reached  (like  circular  linked  list)    Provides  a  more  uniform  wait  Vme  than  SCAN  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   22  

Page 23: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

C-­‐LOOK  Scheduling  

•  In  pracVce,  don’t  need  to  scan  to  ends  of  disk  •  Wrap  around  when  no  more  requests  •  Wrap  to  earliest  outstanding  request  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   23  

Page 24: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Modern  disk  scheduling  issues  

•  Elevator  (or  SSTF)  ignores  rotaVon!  •  Shortest  posiVoning  Vme  first  (SPTF)  •  OS  +  disk  work  together  to  implement  

24

0 1

2

3

5 6

7

11

9

4 8

10 12 13

14

15 16

17

18

19 20

21 22

23

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 25: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

I/O  Scheduling  in  PracVce  •  Simple:  Linus  Elevator  scheduler  

–  Default  unVl  2.4  –  Variant  of  C-­‐LOOK  algorithm  –  Merge  new  request  with  exisVng  where  possible  –  Otherwise,  insert  in  sorted  order  between  exisVng  requests  –  If  no  suitable  locaVon  found,  insert  at  queue  tail  

•  In  pracVce  situaVon  is  more  complicated  due  to  –  InteracVons  with  filesystem  (data  layout)  –  InteracVons  with  caches  (write-­‐back  vs.  write-­‐through)  –  Write  and  read  requests  have  different  priority  –  Delay  sensiVve  applicaVons  such  as  mulVmedia  –  AddiVonal  algorithms:  Deadline,  Completely  Fair  Queuing  (CFQ)  

•  Will  look  in  a  bit  more  depth  later  

4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   25  

Page 26: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

Disk  technology  trends  •  Data  è  more  dense  

–  More  bits  per  square  inch  –  Disk  head  closer  to  surface  –  Create  smaller  disk  with  same  capacity  

•  Disk  geometry  è  smaller    –  Spin  faster  è  Increase  b/w,  reduce  rotaVonal  delay  –  Faster  seek  –  Lighter  weight  

•  Disk  price  è  cheaper  

•  Density  improving  more  than  speed  (mechanical  limitaVons)  

26 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  

Page 27: Disks:&Structure&and&Scheduling&krj/os/lectures/L18-Disks.pdf · Disk&Structure& • Range&from&30GB&to& 3TBperdrive & • Aluminum&plaers& with&magneVc& coang& • Commonly,25 plaers&per&drive&

New  mass  storage  technologies  •  New  memory-­‐based  mass  storage  technologies  avoid  seek  

Vme  and  rotaVonal  delay  –  No  moving  parts  means  more  reliable,  shock  resistant  –  NAND  Flash:  ubiquitous  in  mobile  devices  –  BaEery-­‐backed  DRAM  (NVRAM)  

•  Disadvantages  –  Price:  more  expensive  than  same  capacity  disk  –  Reliability:  more  likely  to  lose  data  –  Other  significant  quirk:  cant  rewrite  easily  

•  Open  research  quesVon:  how  to  effecVvely  use  flash  in  commercial  storage  systems  

•  Will  look  in  more  depth  later  

27 4/8/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.