lecture 11: external sorting - github pages · external merge sort algorithm (2-way sort) disk main...
TRANSCRIPT
Lecture11:ExternalSorting
Lecture11
Announcements
1. MidtermReview:ThisFriday!
2. ProjectPart#2isout.ImplementCLOCK!
3. MidtermMaterial:EverythinguptoBuffermanagement.1. Today’slectureisnofairgame.2. Donotforgettogoovertheactivitiesaswell!
4. OurusualWednesdayupdate….
2
Lecture11
Lecture11:ExternalSorting
Lecture11
Whatyouwilllearnaboutinthissection
1. ExternalMerge(ofsortedfiles)
2. ExternalMerge- Sort
4
Lecture11
1.ExternalMerge
5
Lecture11
Challenge:MergingBigFileswithSmallMemory
Howdoweefficientlymergetwosortedfileswhenbotharemuchlargerthanourmainmemorybuffer?
Lecture11>Section1>ExternalMerge
ExternalMergeAlgorithm
• Input:2sorted listsoflengthMandN
• Output: 1sortedlistoflengthM+N
• Required:Atleast3BufferPages
• IOs:2(M+N)
Lecture11>Section1>ExternalMerge
Key(Simple)IdeaTofindanelementthatisnolargerthanallelementsintwolists,one
onlyneedstocompareminimumelementsfromeachlist.
Lecture11>Section1>ExternalMerge
If:𝐴" ≤ 𝐴$ ≤ ⋯ ≤ 𝐴&𝐵" ≤ 𝐵$ ≤ ⋯ ≤ 𝐵(
Then:𝑀𝑖𝑛(𝐴", 𝐵") ≤ 𝐴/𝑀𝑖𝑛(𝐴", 𝐵") ≤ 𝐵0
fori=1….Nandj=1….M
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
7,11 20,31
23,24 25,30
Input:Twosortedfiles
Output:Onemergedsortedfile
Disk
MainMemory
Buffer1,5
2,22
F1
F2
Lecture11>Section1>ExternalMerge
7,11 20,31
23,24 25,30
Disk
MainMemory
Buffer
1,5 2,22Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F2
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
7,11 20,31
23,24 25,30
Disk
MainMemory
Buffer
5 22 1,2Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F2
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
7,11 20,31
23,24 25,30
Disk
MainMemory
Buffer
5 22
1,2
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F2
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
20,31
23,24 25,30
Disk
MainMemory
Buffer
522
1,2
Thisisallthealgorithm“sees”…Whichfiletoloadapagefromnext?
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F2
7,11
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
20,31
23,24 25,30
Disk
MainMemory
Buffer
522
1,2
WeknowthatF2 onlycontainsvalues≥ 22…soweshouldloadfromF1!
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F2
7,11
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
20,31
23,24 25,30
Disk
MainMemory
Buffer
522
1,2
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F27,11
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
20,31
23,24 25,30
Disk
MainMemory
Buffer
5,722
1,2
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F211
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
20,31
23,24 25,30
Disk
MainMemory
Buffer
5,7
22
1,2
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F211
ExternalMergeAlgorithm
Lecture11>Section1>ExternalMerge
23,24 25,30
Disk
MainMemory
Buffer
5,7
22
1,2
Input:Twosortedfiles
Output:Onemergedsortedfile
F1
F211
20,31
Andsoon…
ExternalMergeAlgorithm
Wecanmerge2listsofarbitrarylengthwithonly 3bufferpages.
IflistsofsizeMandN,thenCost: 2(M+N)IOs
Eachpageisreadonce,writtenonce
WithB+1bufferpages,canmergeBlists.How?
Lecture11>Section1>ExternalMerge
2.ExternalMergeSort
20
Lecture11>Section2
Whatyouwilllearnaboutinthissection
1. Externalmergesort(2-waysort)
2. Externalmergesortonlargerfiles
3. Optimizationsforsorting
21
Lecture11>Section2
ExternalMergeAlgorithm
• Supposewewanttomergetwosorted filesbothmuchlargerthanmainmemory(i.e.thebuffer)
•Wecanusetheexternalmergealgorithm tomergefilesofarbitrarylength in2*(N+M)IOoperationswithonly3bufferpages!
Ourfirstexampleofan“IOaware”algorithm/costmodel
Lecture11>Section2>ExternalMergeSort
WhyareSortAlgorithmsImportant?
• DatarequestedfromDBinsortedorderisextremelycommon• e.g.,findstudentsinincreasing GPA order
•Whynotjustusequicksortinmainmemory??• Whataboutifweneedtosort1TBofdatawith1GBofRAM…
Aclassicproblemincomputerscience!
Lecture11>Section2>ExternalMergeSort
Morereasonstosort…
• Sortingusefulforeliminatingduplicatecopiesinacollectionofrecords(Why?)
• SortingisfirststepinbulkloadingB+treeindex.
• Sort-merge joinalgorithminvolvessorting
Comingup…
Comingup…
Lecture11>Section2>ExternalMergeSort
Dopeoplecare?
Sortbenchmarkbearshisname
http://sortbenchmark.org
Lecture11>Section2>ExternalMergeSort
ExternalMergeSort
Lecture11>Section2>ExternalMergeSort
Sohowdowesortbigfiles?
1. Splitintochunkssmallenoughtosortinmemory(“runs”)
2. Merge pairs(orgroups)ofrunsusingtheexternalmergealgorithm
3. Keepmerging theresultingruns(eachtime=a“pass”)untilleftwithonesortedfile!
Lecture11>Section2>ExternalMergeSort
ExternalMergeSortAlgorithm(2-waysort)
27,24 3,1
Example:• 3Bufferpages• 6-pagefile
Disk MainMemory
Buffer
18,22F
33,12 55,3144,10
1. Splitintochunkssmallenoughtosortinmemory
Lecture11>Section2>ExternalMergeSort
Orangefile=unsorted
ExternalMergeSortAlgorithm(2-waysort)
27,24 3,1
Disk MainMemory
Buffer
18,22
F1
F2
33,12 55,3144,10
1. Splitintochunkssmallenoughtosortinmemory
Example:• 3Bufferpages• 6-pagefile
Lecture11>Section2>ExternalMergeSort
Orangefile=unsorted
ExternalMergeSortAlgorithm(2-waysort)
27,24 3,1
Disk MainMemory
Buffer
18,22
F1
F233,12 55,3144,10
1. Splitintochunkssmallenoughtosortinmemory
Example:• 3Bufferpages• 6-pagefile
Lecture11>Section2>ExternalMergeSort
Orangefile=unsorted
ExternalMergeSortAlgorithm(2-waysort)
27,24 3,1
Disk MainMemory
Buffer
18,22
F1
F231,33 44,5510,12
Example:• 3Bufferpages• 6-pagefile
1. Splitintochunkssmallenoughtosortinmemory
Lecture11>Section2>ExternalMergeSort
Orangefile=unsorted
ExternalMergeSortAlgorithm(2-waysort)
Disk MainMemory
BufferF1
F2
31,33 44,5510,12
AndsimilarlyforF2
27,24 3,118,2218,22 24,271,3
1. Splitintochunkssmallenoughtosortinmemory
Example:• 3Bufferpages• 6-pagefile
Lecture11>Section2>ExternalMergeSort
Eachsortedfileisacalledarun
ExternalMergeSortAlgorithm(2-waysort)
Disk MainMemory
BufferF1
F2
2.Nowjustruntheexternalmerge algorithm&we’redone!
31,33 44,5510,12
18,22 24,271,3
Example:• 3Bufferpages• 6-pagefile
Lecture11>Section2>ExternalMergeSort
CalculatingIOCost
For3bufferpages,6pagefile:
1. Splitintotwo3-pagefiles andsortinmemory1. =1R+1Wforeachfile=2*(3+3)=12IOoperations
2. Merge eachpairofsortedchunksusingtheexternalmergealgorithm1. =2*(3+3)=12IOoperations
3. Totalcost=24IO
Lecture11>Section2>ExternalMergeSort
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
18,43 24,2745,38
Assumewestillonlyhave3 bufferpages(Buffernotpictured)
31,33 47,5510,12
18,22 23,2041,3
31,33 39,5542,46
18,23 24,271,3
48,33 44,4010,12
18,22 24,2716,31
Lecture11>Section2>ExternalMergeSort:Largerfiles
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
18,43 24,2745,38
31,33 47,5510,12
18,22 23,2041,3
31,33 39,5542,46
18,23 24,271,3
48,33 44,4010,12
18,22 24,2716,31
1.Splitintofilessmallenoughtosortinbuffer…
Assumewestillonlyhave3 bufferpages(Buffernotpictured)
Lecture11>Section2>ExternalMergeSort:Largerfiles
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
1.Splitintofilessmallenoughtosortinbuffer…andsort
Assumewestillonlyhave3 bufferpages(Buffernotpictured)
Lecture11>Section2>ExternalMergeSort:Largerfiles
Calleachofthesesortedfilesarun
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
2.Nowmergepairsof(sorted)files…theresultingfileswillbesorted!
Disk
18,24 27,3110,12
43,44 45,5533,38
12,18 20,223,10
33,41 47,5523,31
18,23 24,271,3
39,42 46,5531,33
16,18 22,2410,12
33,40 44,4827,31
Assumewestillonlyhave3 bufferpages(Buffernotpictured)
Lecture11>Section2>ExternalMergeSort:Largerfiles
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
3.Andrepeat…
Disk
18,24 27,3110,12
43,44 45,5533,38
12,18 20,223,10
33,41 47,5523,31
18,23 24,271,3
39,42 46,5531,33
16,18 22,2410,12
33,40 44,4827,31
Disk
10,12 12,183,10
22,23 24,2718,20
33,33 38,4131,31
45,47 55,5543,44
10,12 16,181,3
23,24 24,2718,22
31,33 33,3927,31
44,46 48,5540,42
Assumewestillonlyhave3 bufferpages(Buffernotpictured)
Lecture11>Section2>ExternalMergeSort:Largerfiles
Calleachofthesestepsapass
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
4.Andrepeat!
Disk
18,24 27,3110,12
43,44 45,5533,38
12,18 20,223,10
33,41 47,5523,31
18,23 24,271,3
39,42 46,5531,33
16,18 22,2410,12
33,40 44,4827,31
Disk
10,12 12,183,10
22,23 24,2718,20
33,33 38,4131,31
45,47 55,5543,44
10,12 16,181,3
23,24 24,2718,22
31,33 33,3927,31
44,46 48,5540,42
Disk
3,10 10,101,3
12,16 18,1812,12
20,22 22,2318,18
24,24 27,2723,24
31,31 31,3327,31
33,38 39,4033,33
43,44 44,4541,42
48,55 55,5546,47
Lecture11>Section2>ExternalMergeSort:Largerfiles
Simplified3-pageBufferVersionAssumeforsimplicitythatwesplitanN-pagefileintoNsingle-pagerunsandsortthese;then:
• Firstpass:MergeN/2pairsofrunseach oflength1page
• Secondpass:MergeN/4pairsofrunseachoflength2pages
• Ingeneral,forN pages,wedo 𝒍𝒐𝒈𝟐 𝑵 passes• +1fortheinitialsplit&sort
• Eachpassinvolvesreadingin&writingoutallthepages=2NIO
Unsortedinputfile
Split&sort
Merge
Merge
Sorted!
à 2N*( 𝒍𝒐𝒈𝟐 𝑵 +1)totalIOcost!
Lecture11>Section2>ExternalMergeSort:Largerfiles
UsingB+1bufferpagestoreduce#ofpasses
SupposewehaveB+1bufferpagesnow;wecan:
1. Increaselengthofinitialruns.SortB+1atatime!Atthebeginning,wecansplittheNpagesintorunsoflengthB+1andsorttheseinmemory
Lecture11>Section2>Optimizationsforsorting
2𝑁( log$ 𝑁 + 1)
IOCost:
Startingwithrunsoflength1
2𝑁( log$𝑵
𝑩 + 𝟏 + 1)
StartingwithrunsoflengthB+1
UsingB+1bufferpagestoreduce#ofpasses
SupposewehaveB+1bufferpagesnow;wecan:
2.PerformaB-waymerge.Oneachpass,wecanmergegroupsofBrunsatatime(vs.mergingpairsofruns)!
Lecture11>Section2>Optimizationsforsorting
IOCost:
2𝑁( log$ 𝑁 + 1) 2𝑁( log$𝑵
𝑩 + 𝟏 + 1)
Startingwithrunsoflength1
StartingwithrunsoflengthB+1
2𝑁( log@𝑵
𝑩 + 𝟏 + 1)
PerformingB-waymerges
Repacking
Lecture11>Section2>Optimizationsforsorting
Repackingforevenlongerinitialruns
• WithB+1bufferpages,wecannowstartwithB+1-lengthinitialruns(anduseB-waymerges)toget2𝑁( log@
𝑵𝑩A𝟏
+ 1) IOcost…
• Canwereducethiscostmorebygettingevenlongerinitialruns?
• Userepacking- producelongerinitialrunsby“merging”inbufferaswesortatinitialstage
Lecture11>Section2>Optimizationsforsorting
Repacking Example:3pagebuffer
• Startwithunsortedsingleinputfile,andload2pages
57,24 3,98
DiskMainMemory
Buffer18,22F1
10,33 44,5531,12
Lecture11>Section2>Optimizationsforsorting
F2
Repacking Example:3pagebuffer
• Taketheminimumtwovalues,andputinoutputpage
57,24 3,98
DiskMainMemory
Buffer18,22F1
10,33
44,55
31,12
Lecture11>Section2>Optimizationsforsorting
F2 31 33 10,12
m=12
Alsokeeptrackofmax(last)valueincurrentrun…
Repacking Example:3pagebuffer
• Next,repack
57,24 3,98
DiskMainMemory
BufferF1
33
Lecture11>Section2>Optimizationsforsorting
F2 31 31,3310,12
m=1244,55
18,22
Repacking Example:3pagebuffer
• Next,repack,thenloadanotherpageandcontinue!
57,24 3,98
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=1244,55
m=33
18,22
Repacking Example:3pagebuffer
• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…
3,98
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=33
18,2218,22
Wecallthesevaluesfrozen becausewecan’taddthemtothisrun…
44,55
57,24
Repacking Example:3pagebuffer
• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=55
44,55 57,24 18,22
3,98
Repacking Example:3pagebuffer
• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=55
44,55 57,24 18,22 3,98
Repacking Example:3pagebuffer
• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=55
44,55 3,24 18,22 57,98
Repacking Example:3pagebuffer• Onceallbufferpageshaveafrozenvalue,orinputfileisempty,startnewrunwiththefrozenvalues
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=0
44,55 3,24 18,22
57,98
F3
Repacking Example:3pagebuffer• Onceallbufferpageshaveafrozenvalue,orinputfileisempty,startnewrunwiththefrozenvalues
DiskMainMemory
BufferF1
Lecture11>Section2>Optimizationsforsorting
F2 31,3310,12
m=0
44,55
57,98
F3
3,18 22,24
Repacking
• Notethat,forbufferwithB+1pages:• Ifinputfileissortedà nothingisfrozenà wegetasingle run!• Ifinputfileisreversesorted(worstcase)à everythingisfrozenà wegetrunsoflengthB+1
• Ingeneral,withrepackingwedonoworse thanwithoutit!
• Whatifthefileisalreadysorted?
• Engineer’sapproximation:runswillhave~2(B+1)length
~2𝑁( log@𝑵
𝟐(𝑩 + 𝟏) + 1)
Lecture11>Section2>Optimizationsforsorting
Summary
• BasicsofIOandbuffermanagement.
• WeintroducedtheIOcostmodelusingsorting.• SawhowtodomergeswithfewIOs,• Worksbetterthanmain-memorysortalgorithms.
• Describedafewoptimizationsforsorting
Lecture11