Friday, October 23, 2015

Tools For the Big Data Frontier, Part 2

Alfred Jacob Miller, "Rendezvous on the Green River", 1837 (source:

And this I believe: that the free, exploring mind of the individual human is the most valuable thing in the world. And this I would fight for: the freedom of the mind to take any direction it wishes, undirected. And this I must fight against: any idea, religion, or government which limits or destroys the individual. - John Steinbeck

There are few symbols as iconic to a culture as the Mountain Man is to America. Fiercely independent and highly individualistic, these intrepid souls forsook the comfort and safety of civilized life in order to be free of any restraints or shackles and struck out into the wilds, depending only on their wits, raw courage and fieldcraft to survive against the elements, ferocious predators and a broad selection of mutually antagonistic native tribes.

At the height of the Mountain Man era during the opening of the American West, an annual spring gathering was organized that brought together fur trading companies and trappers to trade accumulated pelts, skins and furs for needed supplies & equipment. Held between 1825 and 1840 during the apex of the fur trade, the Mountain Man Rendezvous was almost always staged in a well known valley in Northern Utah or Western Wyoming.

Though animal pelts (especially those of the northern beaver) commanded attractive prices on the eastern seaboard and across the Atlantic, the mountain men needed an expensive collection of gear to harvest nature's bounty. The list of tools and implements, considered leading technology for the era, is almost endless - a Hawken muzzle loading rifle of 50 caliber or larger, lead shot, bars of lead with specialized tools to allow the trapper to make his own shot as needed, a powder horn with gunpowder, a toolkit for loading and maintaining the weapon, one or more single shot pistols, an assortment of knives, axes and hatchets, a sharpening stone, multiple iron spring traps for a variety of game, a hodgepodge of tools such as saws, needles, scrapers, awls, pliers, hammers and mauls, nails, a plane for shaping wood, augers, a flint with a piece of iron, shovels, picks, canteens or water skins, pots and pans for cooking, along with basic supplies including salt, sugar, coffee, cornmeal, tobacco and such.


In the first installment of this series, we discussed Hadoop and its suitability for exploring the new Big Data frontier. Today we will examine Hadoop more thorougly - specifically, from the point of view of how it serves as a framework for an extensive assortment of specialized software tools to support data scientists as they explore massive datasets to discover hidden insights and value.

The Apache Hadoop Distribution

Just as a rifle and traps were the most vital pieces of equipment in a mountain man's kit, there are certain programs and utilities which are core to the effectiveness of Hadoop. This set of software consists of the following tools:

Hadoop Common - a collection of libraries, utilities and tools developed to keep Hadoop functionality and programming as simple, easy to use and as far from 'assembly' level as possible. Included are file system, serialization and RPC (remote procedure call) libraries. The RPC library is particularly robust, as it must support client demands to run a given program on one server which requires a subroutine that runs on another. Such diffusion of functions can be a problem in terms of execution, and the Apache distribution has taken this concern well into account.

Hadoop Distributed File System (HDFS) - we briefly touched on this in the previous editorial. HDFS distributes data across a server cluster and makes more than one copy. The default settings are 64MB data blocks replicated 3 times. The file system has been developed to be highly scalable and, of course, fault tolerant thru its redundancy and virtualization. The programming interface is marvelously flexible, supporting Java, C and many other languages.

MapReduce - a highly flexible framework for distributed computation, this tool is exceptionally powerful and requires a detailed explanation of its functions and duties. MapReduce organizes a server cluster hierarchically as one Master node and multiple Worker nodes, with 'node' being equivalent to 'server.' When a client requests support for a job, MapReduce uses the Master node as the Job Tracker, with each Worker node becoming an individual Task Tracker. The Job Tracker copies the client program to each node, parcels tasks amongst the Task Trackers based on proximity to the data and puts them to work. As each Task Tracker finishes its assignment, it reports back "all done, boss," and the Job Tracker then aggregates results.
HDFS follows this hierarchy in parallel, recognizing the Master node as a 'Name node' that holds the metadata for the entire file system, with individual Data nodes beneath it. At the commencement of a job, HDFS responds to the client query by finding the closest data block for the job, with the Data node responding back to the client. If data changes in that block, the first or 'primary' data node forces all the other data nodes (secondaries) to update their blocks. Then the primary Data node reports back to the client. Any changes to data block distribution or metadata are updated in the Name node.

The hierarchy of servers and operations for both HDFS and MapReduce is illustrated in the diagram below.


The tools and utilities described above are part of an already far-flung and growing ecosystem, pictured below:

Source: Philippe Julio, "Big Data Analytics with Hadoop" (

One can readily discern the hierarchical nature of Hadoop. Nevertheless, the architecture is not a mess of latency-adding protocols and interfaces but is quite efficient. Granted - there are vulnerabilities in such an approach. But one must accept that while Hadoop is clearly fault-tolerant, there is no such thing as a software architecture which is fault - immune.

As evident from the above ecosystem diagram, there are quite a few other tools in the distribution essential to the proper operation of Hadoop. Below is a description of some of those tools (please note: as the Hadoop developer community is quite dynamic, the list and the individual capabilities of the tools continues to expand, and of this writing there are at least another 20 development projects underway.) 

The common theme among the tools is that Hadoop developers are instinctively demonstrating the same flexibility, practicality and resilience of frontiersmen, in that they completely eschew the insecurity-based egotism of NIH (Not Invented Here) development and freely adopt good ideas from outside sources. As we can see below, Hadoop community developers also share the frontiersman's knack for inventing colorful names for people, places and things: 

Data Access Tools

HBase - a clone, if you will, of Google's "Big Table" database. The software is written in Java, but is accessible also with Ruby, C++ and other languages. Like many a database, it is column oriented and distributed.Hardware fault tolerance is coded in. In the software stack, HBase is layered over HDFS.
Though it is described as a database tool, HBase is not a classic RDBMS. It was developed with unstructured data in mind. The DB is, as you might expect, highly scalable. ROOT and META tables assist clients in navigation.

Hive - this tool rides on top of MapReduce and HDFS. A highly flexible utility that can be used with PhP, Python and Java, Hive allows clients to summarize data and perform ad-hoc queries.

PIG - a rather amusingly named programming language and framework for MapReduce. It was developed to handle much of the intimate detail for clients in order to simplify programming for jobs that benefit from parallelism.

HCatalog - this is a table and storage management service. The tool offers a table abstraction so that clients need not concern themselves with how or where data is stored.

Data Transfer Tools

SQOOP - for importing existing formatted RDBMS data into Hadoop and vice versa.

FLUME - collects, aggregates and moves large selections of log files. A fault tolerant utility, it supports batching compression, filtering and transformation.

Management Tools

OOZIE - for batching and coordinating workflows.

CHUKWA - a tool which repurposes HDFS and MapReduce to manage large distributed systems.

ZOOKEEPER - detailed system management, including configuration information, synchronization, naming and a variety of other functions.

There are quite a few other tools available under the Apache Hadoop 'umbrella', many of which have been developed by commercial interests. These offerings tend to focus on optimizing, refining and enhancing the existing management/administration features of Hadoop and its current tools & utilities, improve its ability to handle Big Data problems, support parallelization in computing and facilitate the development of new tools. Below is a partial list, along with descriptions as applicable.

Mahout - the extension of Hadoop, with its virtualization and parallelism, to AI applications is an intuitively natural one. This tool was developed to support Machine Learning on Hadoop systems. It includes a variety of capabilities, including:
 - Clustering of text document by topic
 - Classification of new documents
 - Item 'set mining', wherein items are grouped based on query activities by clients
 - 'recommendation mining', where the tool proposes items to a client based on their behavior

WHIRR - supports the deployment and management of cloud services.

AMBARI - a web-based utility for managing Hadoop clusters.

Cloudera Impala - this is Cloudera's version of Hadoop, with enhanced interfaces for languages, improved SQL support, security and other optimizations.

HUE - an anacronym for Hadoop User Experience, begun by Cloudera and since turned over to the open source community. Fundamentally, it boils down to a web-based user interface.

Stinger - a community project that is making Hive more SQL-friendly and improving its ability to handle petabytes of data volume (yes, you read that right.)

POLYBASE - a Microsoft tool (yes, the Dark Side wants to play in this sandbox, too) which provides a framework for SQL work on both RDBMS and non-RDBMS data that bypasses MapReduce.

The above only scratches the surface of all the projects and endeavors swirling around and undertaken on behalf of Hadoop. There are many other capabilities being created, including backups for the Name node, incorporation of the Cassandra File System (CFS), disaster recovery, security, backup schemes, archiving improvements, data compression and so on. The opportunities for innovation appear to be endless.

A Blue Sky and a Range on the Horizon

Albert Bierstadt, "The Rocky Mountains, Lander's Peak," 1863 (source:

In Hadoop, we are seeing the story of Linux repeating itself - an industry-changing OS being developed by contributors not for a narrow commercial interest, but for the sake of advancing technology itself. I am far from being a collectivist, and despite the contradiction in their general voting patterns, very few scientists and engineers are either. There are things that we do as individuals in High Tech which evoke the same spirit that drove most Mountain Men - that fierce longing to explore a new frontier, regardless of the risk and hardship, simply because it's there.

There will be further editorials on Big Data over the coming weeks and months, including explorations of what it is that data scientists do and what kinds of topics are drawing their attention. Our adventures across the frontier are far from over, dear readers. :-)