![Page 1: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/1.jpg)
1
Introduction to Scientific Data Management
[email protected] 2015
http://www.cism.ucl.ac.be/training
![Page 2: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/2.jpg)
2
Goal of this session:
“Share tools, tips and tricks related to the storage, transfer, and sharing
of scientific data”
http://www.cism.ucl.ac.be/training
![Page 3: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/3.jpg)
3
1.
Data storageBlock – File – Object
Databases
![Page 4: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/4.jpg)
4
Data storage
Medium (Flash, Hard Drives, Tapes, DRAM)
LVM
RAID JBODErasure coding
software RAID
Local filesystem Block storage
Attachment (IDE, SAS, SATA, iSCSI, ATAoE, FC)
RDBMSObj store
Global filesystemNoSQL
Schema Serialization (file formats, etc)
![Page 5: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/5.jpg)
5
Storage Medium Technologies
http://www.slideshare.net/IMEXresearch/ss-ds-ready-for-enterprise-cloud
![Page 6: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/6.jpg)
6
Storage performances
http://www.slideshare.net/IMEXresearch/ss-ds-ready-for-enterprise-cloud
![Page 7: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/7.jpg)
7
Storage safety (RAID)
https://en.wikipedia.org/wiki/Standard_RAID_levels
![Page 8: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/8.jpg)
8
Data storage
Medium (Flash, Hard Drives, Tapes, DRAM)
LVM
RAID JBODErasure coding
software RAID
Local filesystem Block storage
Attachment (IDE, SAS, SATA, iSCSI, ATAoE, FC)
RDBMSObj store
Global filesystemNoSQL
Schema Serialization (file formats, etc)
![Page 9: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/9.jpg)
9
Storage abstraction levels
![Page 10: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/10.jpg)
10
(local) Filesystems
http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/
Generation 0: No system at all. There was just an arbitrary stream of data. Think punchcards, data on audiocassette, Atari 2600 ROM carts.
Generation 1: Early random access. Here, there are multiple named files on one device with no folders or other metadata. Think Apple ][ DOS (but not ProDOS!) as one example.
Generation 2: Early organization (aka folders). When devices became capable of holding hundreds of files, better organization became necessary. We're referring to TRS-DOS, Apple //c ProDOS, MS-DOS FAT/FAT32, etc.
Generation 3: Metadata—ownership, permissions, etc. As the user count on machines grew higher, the ability to restrict and control access became necessary. This includes AT&T UNIX, Netware, early NTFS, etc.
Generation 4: Journaling! This is the killer feature defining all current, modern filesystems—ext4, modern NTFS, UFS2, you name it. Journaling keeps the filesystem from becoming inconsistent in the event of a crash, making it much less likely that you'll lose data, or even an entire disk, when the power goes off or the kernel crashes.
Generation 5: Copy on Write snapshots, Per-block checksumming, Volume management, Far-future scalability, Asynchronous incremental replication, Online compression. Generation 5 filesystems are Btrfs and ZFS.
![Page 11: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/11.jpg)
11
Network filesystem
NAS: ex. NFS SAN: ex. GFS
One source many consumers
Pictures from https://www.redhat.com/magazine/008jun05/features/gfs_nfs/
![Page 12: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/12.jpg)
12
Parallel filesystem
ex: Lustre, GPFS, BeeGeeFS GlusterFSMany sources many consumers
Pictures from https://www.redhat.com/magazine/008jun05/features/gfs_nfs/
![Page 13: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/13.jpg)
13
Special filesystems – in memory
![Page 14: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/14.jpg)
14
Filesystems
![Page 15: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/15.jpg)
15
What filesystem for what usage
● Home (NFS) : Small size, Small I/Os
● Global scratch (parallel FS) : Large size, Large I/Os
● Local scratch (local FS): Medium size, Large I/Os
● In-memory (tmpfs): Small Size, Large I/Os
● Mass storage (NFS); Large size, Small I/Os
![Page 16: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/16.jpg)
16
Text File Formats – JSON, YML, XML
![Page 17: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/17.jpg)
17
Text File Formats – CSV,TSV
![Page 18: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/18.jpg)
18
Binary File Formats – CDF, HDF
http://pro.arcgis.com/en/pro-app/help/data/multidimensional/fundamentals-of-netcdf-data-storage.htm
![Page 19: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/19.jpg)
19
Binary File Formats – CDF, HDF
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/
![Page 20: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/20.jpg)
20
What file format for what usage
● Meta data
– Configuration file: INI, YAML
– Result with context information: JSON● Data
– Small data (kBs): CSV, TSV
– Medium data (MBs): compressed CSV
– Large data (GBs): netCDF, HDF5, DXMF
– Huge data (TBs): Database, Object store (“loss of innocence”)
Use dedicated libraries to write and read them
![Page 21: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/21.jpg)
21
Object storage
● Object: data (e.g. file) + meta data
● Often built on erasure coding
● Scale out easily
● Useful for web applications
● Examples:
– Openstack Swift
– Ceph
![Page 22: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/22.jpg)
22
NoSQL
Pictures from http://www.tomsitpro.com/articles/rdbms-sql-cassandra-dba-developer,2-547-2.html
![Page 23: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/23.jpg)
23
RDBMS
Pictures from http://www.ibm.com/developerworks/library/x-matters8/
SQL: SELECT AuthorName, Title FROM Books JOIN Authors ON Authors.AuthorID=Books.AuthorID
![Page 24: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/24.jpg)
24
RDBMS
Pictures from http://www.ibm.com/developerworks/library/x-matters8/
● Mostly needed for categorical data and alphanumerical data (not suited for matrices, but good for end-results)
● Indexes make finding a data element is very fast(and computing sums, maxima, etc.)
● Encodes relations between data (constraints, etc)
● Atomicity, Consistency, Isolation, and Durability
![Page 25: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/25.jpg)
25
When to use?
– when you have a large number of small files
– when you perform a lot of direct writes in a large file
– when you want to keep structure/relations between data
– when software crashes have a non-negligible probability
– when files are update by several processes● When not to use:
– only sequential access
– simple matrices/vectors, etc.
– direct access on fixed-size records and no structure
![Page 26: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/26.jpg)
26
2.
Data transferfaster and less secure
parallel transfers
![Page 27: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/27.jpg)
27
scp -c cipher ...
http://blog.famzah.net/2010/06/11/openssh-ciphers-performance-benchmark/
![Page 28: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/28.jpg)
28
Fastest: No encryption at all
● Need friendly firewall (choose direction accordingly)
● Only over trusted networks
● If rsh is installed: rcp instead of scp
![Page 29: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/29.jpg)
29
Fastest: No encryption at all
● Need friendly firewall (choose direction accordingly)
● Only over trusted networks
● If rsh is installed: rcp instead of scp
● If rsh is not installed: nc on both ends
![Page 30: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/30.jpg)
30
Resuming transfers
● When nothing changed but the transfer was interrupted
– size-only: do not perform byte-level file comparison
![Page 31: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/31.jpg)
31
Resuming transfers
● When nothing changed but the transfer was interrupted
– append: do not re-check partially transmitted files and resume the transfer where it was abandoned assuming first transfer attempt was with scp or with rsync --inplace
![Page 32: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/32.jpg)
32
Parallel data transfer: bbcp
● Better use of the bandwidth than SCP
● Needs to be installed on both sides (easy to install)
● Needs friendly firewalls
![Page 33: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/33.jpg)
33
Parallel data transfers: parsync
http://moo.nac.uci.edu/~hjm/parsync/
![Page 34: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/34.jpg)
34
Parallel data transfers: sbcast
![Page 35: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/35.jpg)
35
Transferring ZOT files
● Zillions Of Tiny files
● More meta-data than data → large overhead for rsync
● Solution: Pre-tar or tar on the fly
● Needs friendly firewall
● Also avoid 'ls' and '*' as they sort the output. Favor 'find'
![Page 36: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/36.jpg)
36
3.
Data sharingwith other users (Unix permissions, Encryption)
with external users (Owncloud)
![Page 37: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/37.jpg)
37
Data sharing
Data sharing with other users
![Page 38: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/38.jpg)
38
Sharing with all other users
![Page 39: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/39.jpg)
39
Sharing with the group
![Page 40: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/40.jpg)
40
Sharing and hiding
![Page 41: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/41.jpg)
41
Sharing and encrypting
![Page 42: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/42.jpg)
42
Data sharing
Data sharing with external users
![Page 43: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/43.jpg)
43
Data sharing with external users
● owncloud
CISM login
![Page 44: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/44.jpg)
44
Dropbox-like
![Page 45: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/45.jpg)
45
External SFTP connectors
![Page 46: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/46.jpg)
46
Dropbox-like
![Page 47: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/47.jpg)
47
My home on Manneback
![Page 48: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/48.jpg)
48
Can create a share URL
![Page 49: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/49.jpg)
49
And distribute it
![Page 50: Introduction to Scientific Data Management · – Result with context information: JSON Data – Small data (kBs): CSV, TSV – Medium data (MBs): compressed CSV – Large data (GBs):](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec98a16b83f5f77ec2d462f/html5/thumbnails/50.jpg)
50
Summary:
Storage: choose the right filesystem and the right file format
Transfer: use the parallel tools when possible and limit encryption in favor of throughput
Sharing: use all the potential of the UNIX permissions and try Owncloud
http://www.cism.ucl.ac.be/training