With a constant increased volume of information generated in the world, people have realized the potential of exploring analyzing these data in order to extract more useful information. There are multiple sources of data out there, some of the most important ones are:
- transactional data, including everything from stock prices to bank data and individuals purchase histories;
- social networks content (which is highly unstructured, free-form kind of information);
- sensor data, much of it coming from what is commonly referred to as the Internet of Things (IoT). This sensor data might be
anything from measurements taken from robots on the manufacturing line of an auto maker, to location data on a cell phone network, to instantaneous electrical usage in homes and businesses, to passenger boarding information taken on a transit system.
Most of these data were in the past just archived (those that existed in the past). Due to the size, performance and storage issues were raised, and at some point archived data were dropped. A small part of it, the structured one, was stored in relational databases and analyzed using traditional tools of Business Intelligence.
Now, information is at the center of a new opportunities and companies need to have deeper insights. Being able to analyze different types of data, structured or unstructured, at rest or in motion as they are produced, will reveal trends, patterns which can influence the way companies are doing business.
Big Data means exactly that: extract insight from a high volume, variety and velocity of data in a timely and cost-effective manner. Let's examine the three components:
- volumes: how big is big? big means from terabytes to zettabytes
- variety: manage and benefit from diverse data types and data structures
- velocity: analyze streaming data and large volumes of persistent data.
Due to the three components mentioned above, traditional ways of analyzing data are not applicable. Most of the data is unstructured and do not fit in relational structures to be analyzed using traditional Business Intelligence tools. Not to mention that, due to their volumes, no Database or Warehouse existing solutions will be able to handle, in the case we manage to structure them. More importantly, trying to structure them might cause lost of valuable information due to the fact that you need to fit the data in a way that is very clearly defined by rules and regulations.
In order to use big data, you need tools which span multiple physical and/or virtual machines working together in concert in order to process all of the data in a reasonable span of time. More over, move the processing where the data is and do not try to centralize the data; no centralized system would be able to handle them.
One of the best known methods for turning raw data into useful information is by what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research which took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.
There are a lot of tools for analyzing big data, but one of the most influential and established tools is Apache Hadoop. Apache Hadoop is actually a framework for storing and processing data in a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is
broken into four main parts:
- The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth;
- YARN, a platform for managing Hadoop's resources and scheduling programs which will run on the Hadoop infrastructure;
- MapReduce, as described above, a model for doing big data processing;
- A common set of libraries for other modules to use.
On top of Hadoop, there are a lot of applications developed to support data processing and analysis. Tools like Apache HBase, Pig, Spark, Hive, Tez, Mahout offer much more flexibility in data processing and analysis, using Hadoop APIs but providing to the users a more friendly interface, specialized query languages, and more productive programming environment for development.