Big Data Mindmap

Accumulo (Apache)
BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance
Sorted, schema less, graph support
On top of Hadoop, Zookeeper, Thrift
BTrees, bloom filters, compression
Uses Zookeeper, MR, have CLI
Clustered. Linear scalability. Failover master
Per cell ACL + table ACL
Built in user DB and plaintext on wire
Iterators can be inserted in read path and table maintenance code
Advanced UI
On top of HDFS
BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance
Clustered. Linear scalability. Failover master
Sorted, RESTful
FB, eBay, Flurry, Adobe, Mozilla, TrendMicro
BTrees, bloom filters, compression
Uses Zookeeper, MR, have CLI
Column family ACL
Advanced security and integration with Hadoop
Triggers, stored procedures, cluster
management hooks
HBase is 4 5 times slower than HDFS in batch
context. HBase is for random access
Stores key/value pairs in columnar fashion
(columns are clubbed together as column families).
Provides low latency access to small amounts of
data from within a large data set.
Provides flexible data model.
HBase is not optimized for classic transactional
applications or even relational analytics.
HBase is CPU and Memory intensive with
sporadic large sequential I/O access
HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink
Distributed Filesystems
Hadoop part
Optimized for streaming access of large files.
Follows write once read many ideology.
Doesn't support random read/write.
Good for MapReduce jobs are primarily I/O bound
with fixed memory and sporadic CPU.
Memory centric
Hadoop/Spark compatible wo changes
High performance by leveraging lineage
information and using memory aggressively
40 contributors from over 15 institutions,
including Yahoo, Intel, and Redhat
High throughput writes
Read 200x and write 300x faster than HDFS
Latency 17x times less than HDFS
Uses memory (instead of disk) and recomputation (instead of replication) to
produce a distributed, fault tolerant, and high throughput file system.
Data sets are in the gigabytes or terabytes
Query Engine
Drill (Apache)
Low latency
JSON like internal data model to
represent and process data
Dynamic schema discovery. Can query self describing
data formats as JSON, NoSQL, AVRO etc.
SQL queries against a petabyte or more of data distributed
across 10,000 plus servers
Drillbit instead of MapReduce
Full ANSI SQL:2003
Low latency distributed execution engine
Unknown schema on HBase,
Cassandra, MongoDB
Fault tolerant
Memory intensive
INPUT: RDBMS, NoSQL, HBase, Hive tables, JSON, Avro, Parquet, FS etc.
Not mature enough?
Impala (Cloudera)
Low latency
HiveQL and SQL92
INPUT: Can use HDFS or HBase wo move/transform. Can use Hive tables as metastore
Integrated with Hadoop, queries Hadoop
Requires specific format for performance
Memory intensive (if join two tables, one of them need to be in cluster memory)
Claimed as x30 faster than Hive for queries with multiple MR jobs, but it's before
No fault tolerance
Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet
Fine grained, role based authorization with Sentry. Supports Hadoop security
Presto (Facebook)
Real time
Petabytes of data
INPUT: Can use Hive, HDFS, HBase, Cassandra, Scribe
Similar to Impala and Hive/Stinger
Own engine
Declared as CPU more efficient than Hive/MapReduce
Just opened, not so big community
Not fail tolerant
Read only
Requires specific format for performance
Hive 0.13 with Stinger
A lot of MR micro jobs
Can use HBase
Spark SQL
SQL, HQL or Scala to run on Spark
Can use Hive data also
Spark part
Replaced with Spark SQL
Non interactive
Hive < 0.13
Long jobs
Based on Hadoop and MR
High latency
Event Input
Distributed publish subscribe messaging system
High throughput
Persistent scalable messaging for
parallel data load to Hadoop
Compression for performance,
mirroring for HA and scalability
Usually used for clickstream
Pull output
Use Kafka if you need a highly reliable and scalable enterprise messaging system
to connect many multiple systems, one of which is Hadoop.
Main use case is to ingest data into Hadoop
Distributed system
Collecting data from many sources
Mostly log processing
Push output
A lot of pre built pre built collectors
OUTPUT: Write to HDFS, HBase, Cassandra etc
INPUT: Can use Kafka
Use Flume if you have an non relational data sources such as
log files that you want to stream into Hadoop.
Machine Learning
Spark implementation of some common machine learning
algorithms and utilities, including classification, regression,
clustering, collaborative filtering, dimensionality reduction
Initial contribution from AMPLab, UC Berkeley,
shipped with Spark since version 0.8
Spark part
On top of Hadoop
Using MR
Lucene part
Oozie (Apache)
Schedule Hadoop jobs
Combines multiple jobs sequentially
into unit of work
Integrated with Hadoop stack
Supports jobs for MR, Pig, Hive and
Sqoop + system app, Java, shell
Hue (Cloudera)
UI for Hadoop and satellites (HDFS, MR,
Hive, Oozie, Pig, Impala, Solr etc.)
Upload files to HDFS, send Hive queries etc.
Data analyze
Pig (Apache)
High level scripting language
Can invoke from code on Java, Ruby etc.
Can get data from files, streams or other sources
Output to HDFS
Pig scripts translated to series of MR jobs
Data transfer
Sqoop (Apache)
Transferring bulk data between Apache Hadoop
and structured datastores such as relational
Two way replication with both
snapshots and incremental updates.
Import between external datastores,
HDSF, Hive, HBase etc.
Works with relational databases such as:
Teradata, Netezza, Oracle, MySQL, Postgres,
INPUT: Can access data in Hadoop via Hive, Impala, Spark
SQL, Drill, Presto or any ODBC in Hortonworks, Cloudera,
DataStax, MapR distributions
OUTPUT: reports, UI web, UI client
Clustered. Nearly linear scalability
Can access traditional DB
Can explore and visualize data
Knox (Apache)
Provides single point of authentication
and access to services in Hadoop
Graph analytic
Spark part
API for graphs and graph parallel computation