Friday, October 9, 2015

Tools for the Big Data Frontier, Part 1

Thomas Moran, "The Grand Canyon of the Yellowstone", 1872 (source:

Go West, young man, and grow up with the country. - John Babson Lane Soule, in an 1851 editorial for the Terre Haute Express

Our first exploration of Big Data began what seems almost a lifetime ago back in July:

In that essay we discovered that the common thread underlying all the leading sectors of High Technology development - the IoT, Robotics, AI, Datacenters, Mobile Computing, Consumer Electronics, Semiconductors and even Automotive - was their mutual convergence on Big Data. At the time, it seemed an apt analogy to depict the immense and growing volume of Big Data as an unbounded ocean. 

In retrospect, however, there is more to Big Data than even the vast sweep of a watery, windy, wave-tossed expanse can properly encompass. There are details within the data that form a tremendous variety of landscapes waiting to be discovered - potentially endless numbers of unique canyons, mountains, valleys, rivers, plains, forests, tundras and deserts. We are literally gazing upon a new frontier in High Tech - an uncharted land the likes of which hasn't been encountered since the first explorers set out from the British and French colonies of North America in the 1600's, followed by the American frontiersmen of the 1700's and early 1800's.

Whether they were mountain men, prospectors, fur trappers, woodsmen, buffalo hunters, scouts & guides for settlers or pioneers themselves, many of their tools were identical - a piece of flint for striking, an axe or adze, a variety of knives for preparing meat, skinning game, shaving and such, a sewing kit with needles and awls for both cloth and hide, basic carpentry instruments for sawing and shaping wood to build log cabins, fences, corrals, fishing baskets and/or poles, simple traps, furniture and many other things, and of course a selection of weaponry - a musket and flintlock pistol with powder, ball and various contrivances for cleaning and maintaining these early firearms, which later on were replaced by repeating rifles and revolvers manufactured by famous company names such as Colt, Winchester, Henry, Sharps and others.

Jim Baker, a Scout for Jim Bridger (source:

There can be no doubt that these wilderness adventurers were a hardy, thrifty bunch who excelled at ingeniousness, adaptability, flexibility and imagination. Nonetheless, the most successful pioneers and explorers had a knack for picking the tools that would best serve to meet the demands of surviving in open country. 

That same challenge of the Wild West is facing today's data scientists. Both the volume of data and the variety of forms in which it comes present problems. When exploring the data, how do you know that you have the right kind for your needs? Can you effectively access it? Once you have it, what are you able to do with it?

Fortunately for today's data scientists, there is no need to visit a dozen or so trading posts, blacksmiths, gunsmiths and other artisans or sources to assemble all the necessary gear for an expedition into the wilderness. All the tools that a data frontiersman needs can be found as a complete and complementary kit with a highly suggestive and particularly appropriate name.

Apache Hadoop

Herman Wendelborg Hansen, "Apache Scouts Trailing", c. 1901 (source:

A very great vision is needed, and the man who has it must follow it as the eagle seeks the deepest blue of the sky. - Crazy Horse, War Leader of the Oglala, one of the seven tribes of the Sioux nation

Hadoop is an open source collection of tools and utilities available thru an Apache license. It is a scalable, fault tolerant system for data storage and processing that has been adopted and enhanced by many companies with truly enormous Big Data issues such as Facebook and Yahoo. It is their vision, spawned from the astounding aggregate of data which these firms have to handle on even an hourly basis, which has guided much of the development of this toolset.

Originally, Hadoop was characterized as a publicly available virtualization tool for datacenters. The description is both right and wrong. Hadoop is indeed distributed over the servers of a datacenter as a JVM virtualization that operates as a layer above a given computing node with its own dedicated linux or windows operating system and particular hardware. However, a true virtualization software is oriented towards splitting a specific 'hard' server into multiple 'virtual' servers and sharing the computing resources. Hadoop, by contrast, lumps all the hard computing nodes together and treats them as one gigantic server. One could, in fact, think of Hadoop as an overlay on a datacenter software stack that functions as that datacenter's OS.

Built from the ground up as an open source data management software running on commodity hardware, there are cost and operational benefits to Hadoop that are unequaled. Its advantages over conventional systems are legion.

The standard practice of IT data management has been to collect data, format it for an RDBMS and then put it in a storage server (or even a storage medium such as tape which is then removed and filed away in a physical library) for safekeeping. The ordered data can then be retrieved and analyzed 'at leisure.' 

This doesn't work at all in the Big Data era, however. Assuming to begin with that you have the appropriate data for the task at hand, its volume could be so staggering that the translation into a database format becomes a formidable task, even to the point that it interferes with further data collection. Such floods of data can bog down an enterprise network, further aggravating problems of formatting, managing, using and storing it. Once stored, retrieval and use can actually be very burdensome on IT resources and involve significant delays (on the order of weeks sometimes), potentially rendering any analysis results useless or of marginal value. There are further issues as well stemming from the potential corruption of raw/unstructured data that has been formatted for a database. 

All of these problems reflect on the lack of flexibility and scalability of the standard approach. Hadoop, though, was developed precisely with these requirements in mind. It supports a distributed file system (known rather unoriginally as the Hadoop Distributed File System or HDFS) and a scheduler which 'thinks' in parallel terms. Raw/unstructured data is collected and stored 'as is', without any conversion that slows down access or collection. This also preserves the data in its original 'pure' form, enhancing its intrinsic value and making it amenable to multiple divergent client needs. 

With Hadoop, scalability is intrinsic to the software architecture. There is no longer a need to continually rewrite code/algorithms. In fact, one can write simple algorithms that keep pace with an increased flow of data, whereas complex algorithms will eventually succumb to a rising data flood and get swamped. This scalability also preserves the original data and reduces the need to archive.

The HDFS (Hadoop Distributed File System) takes in unstructured data, breaks it into blocks (typically 64MB), replicates it a few times (three copies is set as a default) and distributes it across the data center. With the distributed file system loaded up, the data is now durable, supports improved throughput and is very accessible.

With Hadoop, computing and storage are often contiguous (meaning that they are in the same place or 'box.') This inevitably leads to faster computing for the client, as latency is greatly reduced by eliminating the need to use a LAN for data traffic.Furthermore, since HDFS scatters at least three copies of a given dataset, a client no longer has to share a SAN/RAID server with others, wait for their turn and go fishing deep into the storage box for the required data - activities which all sequentially increase latency in a cascade.

In addition to virtualizing multiple servers into one, Hadoop supports searching, log processing, recommendation systems, analytics, video/image analysis and data retention. One can immediately discern how this makes Hadoop particularly well-disposed towards Robotics, AI and the IoT - all future sources of a rising storm of data - with such a mix of capabilities.

With Hadoop, you can use PERL, Java, Python, C++, Ruby, and a variety of other languages. There’s are other, newer languages as well that are specifically oriented to Big Data applications, including Java MapReduce, Pig Latin, Crunch, Hive and others. This flexibility in programming approaches permits clients to make tradeoffs between ease of use, performance and flexibility at different levels of abstraction.

There are some who believe that all of this concern with software to virtualize an manage datacenters, networks and data is, as the sage of Stratford-on-Avon once wrote, 'much ado about nothing.' It is their contention that an increasing prevalence of ultra high speed & wide bandwidth physical 'pipes' at 40, 100 and perhaps eventually 400Gbps will solve all of these data throughput problems. Yet the experience of social networking, finance, media and retail firms in Big Data to date empirically indicate that there is already so much data on storage and processing nodes that it will overwhelm even the fastest networks and datacenters that rely on conventional data management standards. Stated differently: it appears that there is no way to simply 'muscle' Big Data. The need for an intelligent OS like Hadoop that is built for this problem is at this point imperative.

There are other capabilities inherent to Hadoop which take advantage of redundancy, physical nearness, task parallelization and very large sets of data. In order to understand those capabilities, though, we will have to delve into the plethora of utilities, tools and languages dedicated to the Hadoop ecosystem. And that story, dear readers, will of necessity have to wait for a future post.  ;-)