mogilefs, 簡約可靠的儲存方案
TRANSCRIPT
![Page 1: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/1.jpg)
MogileFS簡約可靠的儲存方案
TWJUG Meetup Nov. 2016
kaif@kaif (member of mogilefs-moji)
![Page 2: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/2.jpg)
Outline
• Mogilefs
• Moji
• State of the art in mogilefs reliability
![Page 3: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/3.jpg)
Quick facts
“Open source distributed object storage” – a.k.a. cloud storage, soft defined storage…
• 高可用、水平擴展
• 檔案多副本儲存、修復
• 簡單的架構、容易使用
• 眾多應用實績
![Page 4: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/4.jpg)
Brad Fitzpatrick
• Golang
• OpenID
• LiveJournal
– Memcached
– MogileFS
– …
![Page 5: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/5.jpg)
Simplicity
![Page 6: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/6.jpg)
Easy-to-use
• Command line tool
• Config file
![Page 7: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/7.jpg)
Easy-to-use
• Admin tool
![Page 8: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/8.jpg)
client
tracker
store
mysql
create_opendomain=toast&class=triple&debug_profile=0&fid=
0&multi_dest=1&key=qoo3
OK
path_1=http://127.0.0.20:7500/dev2/0/000/000/0000000014.fid&path_3=http://127.0.0.25:7500/dev3/0/000/000/0000000014.fid&devid_1=2&devid_3=3&
fid=14&path_2=http://127.0.0.25:7500/dev4/0/000/000/0000000014.fid&dev_count=3&devid_2=4
storestore
trackertracker
PUT /dev208/0/068/050/0068050934.fid HTTP/1.0Content-length: 9
some data
200 OK
1. Create open
3. Create close
2. Write data (webdav)
create_closedomain=toast&fid=14&devid=2&path=http://127.
0.0.20:7500/dev2/0/000/000/0000000014.fid&size=1048576&key=qoo3&devid_2=3&path_2=http://127.0.0.25:7500/dev3/0/000/000/0000000014.fid&mul
ti_dest=1
![Page 9: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/9.jpg)
Availability
![Page 10: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/10.jpg)
1WNR, memcached…
Scalability
![Page 11: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/11.jpg)
使用者見證
![Page 12: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/12.jpg)
KKBOX
![Page 13: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/13.jpg)
KKBOX
• 超過3,000 萬首歌(檔案)
• 儲存伺服器超過 75 台
• 總硬碟超過 2,300 顆
• 總儲存空間超過 10 PB
• 使用 8 個機櫃
(KKBOX 的音樂檔案儲存技術Posted on August 2, 2016 by Chris Yuan)
![Page 14: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/14.jpg)
My production experience
• 檔案量:KKBOX*10*N
• Node數:10^2*N
• 複雜的workload(備份、串流、物聯網、web、log…orz)
• Java ♥
![Page 15: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/15.jpg)
Moji
• A file-like MogileFS client for Java developers
• Production-ready features
– Connection pooling, load balancing, fault-tolerant…
• Quality
– Spring friendly, integration tests, well documented, actively developing…
https://github.com/mogilefs-moji/moji
![Page 16: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/16.jpg)
Configuration
• Using plain-old-Java
• Using the Spring framework
SpringMojiBean moji = new SpringMojiBean();moji.setAddressesCsv("192.168.0.1:7001,192.168.0.2:7001");moji.setDomain("testdomain");moji.initialise();moji.setTestOnBorrow(true);
moji.tracker.address=192.168.0.1:7001,192.168.0.2:7001moji.domain=testdomain
<import resource="moji-context.xml" />
![Page 17: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/17.jpg)
Usage
• Create/update a remote file
• Download a remote file
MojiFile rickRoll = moji.getFile("rick-astley");moji.copyToMogile(new File("never-gonna-give-you-up.mp3"), rickRoll);
rickRoll.copyToFile(new File("foo-fighters.mp3"));
![Page 18: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/18.jpg)
Usage
• IO streamMojiFile fooFighters = moji.getFile("stacked-actors");
InputStream stream = null;try {
stream = fooFighters.getInputStream();// Do something streamy// stream.read();
} finally {stream.close();
}
OutputStream stream = null;try {
stream = fooFighters.getOutputStream();// Do something streamy// stream.write(...);stream.flush();
} finally {stream.close();
}
![Page 19: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/19.jpg)
• Setup environment manually
– MogileFS
– Maven dependency
Call to action!
• Quickstart feat. docker run -d --name mogile-node jeffutter/mogile-nodedocker run -it --link mogile-node:mogile-node hrchu/mogile-moji
<dependency><groupId>fm.last</groupId><artifactId>moji</artifactId><version>2.0.0</version>
</dependency>
https://code.google.com/p/mogilefs/wiki/QuickStartGuide
![Page 20: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/20.jpg)
來講一些 關於可靠度的事
![Page 21: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/21.jpg)
Mogilefs的可靠度對策
• Single copy ACK
• Multiple host replication policy
• MD5 checksum
• Basic health disk check
• Multiple zone plugin
• Reaper/fsck
![Page 22: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/22.jpg)
從此檔案們就過著幸福快樂的日子~
… ?
![Page 23: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/23.jpg)
強化可靠度可能方向
• Mutiple sites
• Scrubber
• Modern durable write
![Page 24: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/24.jpg)
Multiple Sites
• MogileFS::Network plugin
• 不同機房配置不同網段
• Zone對應網段設定
• Replication policy
![Page 25: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/25.jpg)
Multiple Sites• Given a network of: 10.10.0.0/16
• All of your machines are configured to have a netmask of 10.10.0.0/16 . When assigning IP addresses to machines, pick them from 10.10.5.0/24
• 設定IP
– web1: 10.10.5.1 (netmask 255.255.0.0 or /16)
– web2: 10.10.5.2
– tracker1: 10.10.5.3
– tracker2: 10.10.5.4
– storage node 1: 10.10.5.5
– storage node 2: 10.10.5.6
– storage node 3: 10.10.8.1
• MogileFS zones, you configure:
– near=10.10.5.0/24 far=10.10.8.0/24
web1
tracker1
node1 node2
near
tracker2
node3
far
web2
![Page 26: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/26.jpg)
Scrubber
• Make use of routine FSCK as scrubber
• Modified Algorithm
– Remove exhaustive search
– Improve performance in large scalehttps://github.com/mogilefs/MogileFS-
Network/blob/master/lib/MogileFS/ReplicationPolicy/HostsPerNetwork.pm#L84
mogadm fsck status |grep " Yes " || (mogadm fsck reset; mogadm fsck clearlog; mogadm fsck start) >/var/log/mogadm.fsck 2>&1
![Page 27: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/27.jpg)
Modern durable write
• AS-IS
client
tracker
store
mysql
store store
trackertracker
4. Write other copies asynchronously
Assume that a file should have at least three replicas in the system to fit the durability requirement
![Page 28: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/28.jpg)
Modern durable write
client
tracker
store
mysql
2. Write at least two copiesbefore ACK
store store
trackertracker
4. Write other copiesasynchronously
• TO-BEAssume that a file should have at least three replicas in the system to fit the durability requirement
mogilefs-moji#25
mogilefs/MogileFS-Server#39
![Page 29: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/29.jpg)
Analysis
• Disk failure pattern
– MTTF?
– poisson distribution?
• Mark-out: 發現錯誤的空窗期
• Rep latency: 非同步複製的空窗期
• 硬碟大小,檔案大小也會影響計算結果
![Page 30: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/30.jpg)
Analysis
• Combinatorial analysis model
– Assume that each disk fails independently
– Assume that after x hours of operation each block has P(xi) = p
– Probability of failure q = 1 - p.
– 對replication來說是一個naive的公式:1 – qn
![Page 31: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/31.jpg)
Analysis
• 若考慮
– Non-Recoverable Errors (NREs)
– drive failure events are poisson
– site failures (e.g. due to regional disasters)
– rep latency, mark-out time
– …
• Analysis of system durability is commonly done with Markov models
![Page 32: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/32.jpg)
Analysis
• Example of durable write
– Assume mean disk life is 500K hrs
– 2 replicas, no NRE
249960
249980
250000
250020
250040
250060
250080
1 0.041666667 0.020833333 0.013888889
diff disk life 5
diff disk life 5
Diff of MTTDL in hr
mu
複製速率越低, durable write的改善幅度越大
![Page 33: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/33.jpg)
Analysis
• Example of probability of data loss
0.000000E+00
1.000000E-05
2.000000E-05
3.000000E-05
4.000000E-05
5.000000E-05
6.000000E-05
7.000000E-05
8.000000E-05
1 2 3 4 5 6 7 8 9 10 11 12 13 14
P of data loss 72
P of data loss 48
P of data loss 24
P of data loss 1
![Page 34: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/34.jpg)
Recap
儲存之於架構 案場需求決定儲存架構抉擇
在考量機敏資料、業主需求、成本或是legacy的情境,mogilefs或許會是合適的儲存架構選擇~
關於Mogilefs,我想說的是… 簡單可擴展的非結構化儲存系統
Java stack建議搭配moji服用
如果事業做很大有富爸爸,能找specialist/consulting,ceph/swift會是更先進複雜的選擇!
![Page 35: Mogilefs, 簡約可靠的儲存方案](https://reader034.vdocuments.net/reader034/viewer/2022042423/58a7103b1a28ab02678b4625/html5/thumbnails/35.jpg)
Thank you~
【關於我】
https://kaif.io/u/kaif
https://github.com/hrchu
【關於moji】
https://github.com/mogilefs-moji/moji
FIN~