Bigdata, Nosql, Hadoop terms explained

What is BigData?

Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infra- structure to address efficiently. Said differently, the volume, velocity or variety of data is too great

‘Big Data’ describes data sets so large and complex they are impractical to manage with traditional software tools.

Volume a Boeing 737 will generate 240 terabytes of flight data during a single flight across the US
Velocity On-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
Variety It isn’t just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media

Relational databases

The architecture behind RDBMS is such that data is organized in a highly-structured manner, following the relational model.

In relational database,Individual records (e.g., ‘orders’) are stored as rows in tables, with each column storing a specific piece of data about that record (e.g., ‘order_no’ ‘order_date,’ etc.), much like a spreadsheet. Related data is stored in separate tables, and then joined together when more complex queries are executed. For example, ‘dept’ might be stored in one table, and ’emp’ in another. When a user wants to find the department num of an employee, the database engine joins the ’emp’ and ‘dept’ tables together to get all the information necessary

Features Of RDBMS

1.SQL databases are table based databases

2.Data store in rows and columns

3.Each row contains a unique instance of data for the categories defined by the columns.

4.Provide facility primary key, to uniquely identify the rows

Limitations for SQL database

Scalability: Users have to scale relational database on powerful servers that are expensive and difficult to handle. To scale relational database it has to be distributed on to multiple servers. Handling tables across different servers is difficult .So it is difficult to met the requirement of bigdata with it

Complexity: In SQL server’s data has to fit into tables anyhow. If your data doesn’t fit into tables, then you need to design your database structure that will be complex and again difficult to handle.So it is no agile enough to meet the changing requirement of today business

SQL databases emphasizes on ACID properties ( Atomicity, Consistency, Isolation and Durability)

Examples : Oracle, Mysql, DB2, sql server

What is NoSQL?

NoSQL (commonly referred to as “Not Only SQL”) represents a completely different framework of databases that allows for high-performance, agile processing of information at massive scale which Bigdata requires.

NoSQL databases are unstructured in is able to accept all types of data – structured, semi-structured, and unstructured

NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers. So performance is very good in Nosql database This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode. We can add more servers as the data grows without compromising the performance

The relational data model is based on defined relationships between tables, which themselves are defined by a determined column structure, all of which are explicitly organized in a database schema – all very strict and uniform. NoSQL database is schema less so Data can be inserted in a NoSQL database without any predefined schema. So the format or data model can be changed any time, without application disruption

So simply put,it is  a non-relational and largely distributed database system that enables rapid, ad-hoc organization and analysis of extremely high-volume, disparate data types.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide superior performance,and their data model addresses several issues that the relational model is not designed to address

Large volumes of rapidly changing structured, semi-structured, and unstructured data

Agile sprints, quick schema iteration, and frequent code pushes

Object-oriented programming that is easy to use and flexible

Geographically distributed scale-out architecture instead of expensive, monolithic architecture


What is Hadoop?

Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.

It is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage

A staple of the Hadoop ecosystem is MapReduce, a computational model that basically takes intensive data processes and spreads the computation across a potentially endless number of servers (generally referred to as a Hadoop cluster).


Types of Nosql databases

Key-Value Store

It has a Big Hash Table of keys & values.

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data. A bucket is a logical group of keys – but they don’t physically group the data. There can be identical keys in different buckets.

Performance is enhanced to a great degree because of the cache mechanisms that accompany the mappings. To read a value you need to know both the key and the bucket because the real key is a hash (Bucket+ Key).

This key/value type database allow clients to read and write values using a key as follows:

  • Get(key), returns the value associated with the provided key.
  • Put(key, value), associates the value with the key.
  • Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys.
  • Delete(key), removes the entry for the key from the data store.

Examples are

Riak, Amazon S3 (Dynamo

Document Store NoSQL Database

The data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the only difference is that the values stored (referred to as “documents”) provide some structure and encoding of the managed data. XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are some common standard encodings.

One key difference between a key-value store and a document store is that the latter embeds attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents

Apache CouchDB is an example of a document store. CouchDB uses JSON to store data

Column Store NoSQL Database

In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema. Read and write is done using columns rather than rows.

In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast search/ access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on disk while Columnar databases store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster

The best known examples are Google’s BigTable and HBase & Cassandra that were inspired from BigTable.

Graph Base NoSQL Database

In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables and columns representation, a flexible graphical representation is instead used which is perfect to address scalability concerns. Graph structures are used with edges, nodes and properties which provides index-free adjacency. Data can be easily transformed from one model to the other using a Graph Base NoSQL database

InfoGrid and Infinite Graph are the most popular graph based databases.