“Big Data” term refers to an enormous amount of data (structured, semistructured and unstructured). The organisation of this data is of the most importance. On the way to its organisation, great volumes of this data should be analyzed and as a result, you can work out your business strategy and manage your workflow. Though, the term “big data” is relatively new, it gains more and more value in server management.
The Big Data is usually characterized by the next 3V attributes:
When we consider “Volume” we think of a huge amount of data. Any web application stores more and more data every day. For example, when we use one of the social media, where photos, videos, messages, comments are added and downloaded every second, it is even difficult to imagine how much place is taken by all this content. Terabytes of data are constantly adding and they certainly should be organized.
Velocity is the measure of how fast the data is coming in. Again, social media receive lots of content and somehow they have to deal with it. Computing power must be fast enough to process a big amount of information as well as its huge volume. Organisations tend to rationally apply compute power for big data in order to achieve the desired velocity. It can require up to thousands of servers to achieve a great performance of applications while servers work collaboratively. By the way, velocity is very important when it comes to machine learning and artificial intelligence as analytical processes copy perception by finding and using patterns in the collected data.
Variety refers to an unlimited heterogeneous data. Usually, all the data varies a lot and it is often very unstructured (text documents, email, video, audio, etc). The wide variety of data requires a different approach as well as different techniques to store all raw data.
Since we have 3Vs, it means we can analyze the data. We choose Hadoop for it. Hadoop is supported by Apache Foundation and it is essential to mention, that such companies as IBM, Oracle, Microsoft, SAP use Hadoop as nowadays it is practically a standard of Big Data Systems. This framework was born in 2006, first at Yahoo, inspired in part by ideas put into form by Google in technical documents. Today, Hadoop has become a complex ecosystem of components and tools, some of which are packaged in commercial distributions of the framework.
Hadoop lets you run applications on systems with thousands of hardware nodes and manage thousands of terabytes of data. Its distributed file system facilitates fast data transfer rates between nodes and allows the system to continue functioning in the event of a node failure. This approach reduces the risk of system failure and unexpected loss of data, even if a large number of nodes become inoperative. As a result, Hadoop has rapidly emerged as a base for large data processing tasks, such as scientific analysis, business and sales planning, and working with huge volumes of sensor data, including from the Internet of things.
By running on convenience server clusters, Hadoop offers a low-cost, high-performance approach to implementing an architecture dedicated to analytical processing. As these functional capabilities gained popularity, the framework has interfered in the industry to support reporting and analytics applications, combining structured data and new forms – semi or unstructured data.
Organizations can deploy Hadoop components and support software in their local data center. However, most large data projects depend on the short-term use of large IT resources. This type of use is best suited for highly scalable cloud services such as Amazon Web Services (AWS), Google Cloud Platform and Microsoft Azure. Public cloud providers often support Hadoop components through basic services such as AWS Elastic Compute Cloud and Simple Storage Service. However, there are also services specifically designed for Hadoop tasks, such as AWS Elastic MapReduce, Google Cloud Dataproc and Microsoft Azure HDInsight.
The Hadoop framework includes different open source modules, which are all connected to core modules for capturing, processing, managing and analyzing large amounts of data.
These modules are:
Hadoop Common. A set of library and tools on which the different components are based.
Hadoop Distributed File System. This file system supports a conventional hierarchical directory, but distributes the files on a set of storage nodes on a Hadoop cluster.
Hadoop YARN. This module supports task scheduling and allocates resources to the cluster to run applications, and arbitrate when there is a resource conflict. It also monitors the execution of jobs.
Hadoop MapReduce. It is a programming model and an execution framework for processing batch applications in parallel.
The evolution of Hadoop as a vast ecosystem has also led to the creation of a new market that has transformed analytics. This has for example extended the application potential as well as the types of data companies can now use in their applications.