wiki.volution.ro -- Dehems/Storage

Contents

Links
Column store
Hypertable

Column store

One size fits all: A concept whose time has come and gone;
comments:
- http://developers.slashdot.org/comments.pl?sid=290021&cid=20496021:
  - Column stores are great (better than a row store) if you're just reading tons of data, but they're much more costly than a row store if you're writing tons of data.
  - Therefore, pick your method depending on your needs. Are you storing massive amounts of data? Column stores are probably not for you...Your application will run better on a row store, because writing to a row store is a simple matter of adding one more record to the file, whereas writing to a column store is often a matter of writing a record to many files...Obviously more costly.
- http://developers.slashdot.org/comments.pl?sid=290021&cid=20504275:
  - The column major approach simply keeps the indexes, if you will and discards the rows. This allows for FAST operations if you are doing LOADS of reads, and little changes. That is PERFECT for data warehousing.

Hypertable

Hypertable Lead Discusses Hadoop and Distributed Databases;
http://podcasts.networkworld.com/linuxcast/013008-linuxcast.mp3;
HowHypertableWorks:
- model:
  - multi-layer key-value store;
  - a key is composed from row-id, column family, column-qualifier, timestamp;
- the model is the same as BigTable;
- the distribution has the same scheme as Google File System;
- advantages (in the context of the Dehems project):
  - tables are broken into contiguous ranges and delegated to different physical machines;
    - a form of Shard (database architecture)?
- disadvantages (in the context of the Dehems project):
  - there is no typing for data in cells;
  - all data is stored as uninterpreted byte strings;
  - it (optionally) all revisions of the data are stored, tagged with a timestamp;
  - all the writes go through a commit log;
  - there is a background process -- Heap Merge -- that aggregates pairs in cell cache and stores them in sorted order;
WhyWeChoseCppOverJava:
- All of the data that is inserted into a Hypertable gets compressed at least twice, and on average three times. Once when writing data to the commit log, once during minor compaction and then once for every merging or major compaction.
HypertableOverview:
- column-oriented systems (like Hypertable) are better for "read-mostly" situations;
- Hypertable is built on top of a Distributed File System;
ArchitecturalOverview;
- data model:
  - the first dimension of a table is the row key;
  - the second dimension is the column family;
  - the third dimension is the column qualifier;
  - the fourth and final dimension is the timestamp;
- physical data layout:
  - key/value pairs are stored in files -- cell-stores;
  - key/value pairs are stored as a sequence of compressed blocks inside a cell-store;
programming examples:
- ApacheLogQuery;
- ApacheLogLoad;

Links

Column store

Hypertable