Learning Hadoop

Software Engineering 2237 views

Recently, I have been learning more about Hadoop in my spare time. Two concepts used by Hadoop really stuck out for me. One, most processing is done using text files. And two, processes occur using chained (or piped) command line commands, in addition to Google's concept of Map/Reduce. Data is first mapped into a format you desire, and then reduced. In SQL terms this would be equivalent to a custom SELECT statement and GROUP BY for aggregation.

Here's a simple example of a Hadoop streaming pipeline using Python:

cat input.tsv | ./mapper.py | sort | ./reducer.py > output.tsv

The process is referred to as a streaming pipeline because the data is streamed (via standard input and standard output) and piped to each command. The input file (tab separated values) is piped to the Python mapper script, which pipes its results to the sort command, which pipes its results to the Python reducer script, which pipes its results to a new output.tsv file.

Hadoop has other aspects which should be mentioned. Since it uses the map/reduce pipeline it's able to chunk through data in parallel. Paired with elastic cloud computing, it's become the big-data standard.

Borrowing from Hadoop

Using full-blown Hadoop for my needs may be an overkill. I'm not quite at the big-data level yet. However, I found their text file, pipeline pattern to be elegant and simplistic. Especially since it uses batteries-included software (cat, sort, Python etc).

The typical database-connected application has many software layers between your application code and the database. For example, a C# app might have:

C# code → C# ORM code → ODBC.Net code → ODBC (c code) → Win32 or Win64 c code.

In my case, it's:

Ruby code → DBI lib (ruby/c code) → ODBC (c code) → Win32 or Win64 c code.

Using text files, obviates the need for a database and database drivers. It means you are only at the mercy of the datastore at the initial pull of data, not at the many stages afterwards. If the data warehouse is unavailable after I pull the initial data, it's not holding me back since processing is done on text files. Most database-related code can be eliminated.

Using command line tools keeps their solution lightweight and devoid of many external dependencies. For example, most command line environments already come with commands they use, such as cat (for file concatenation), sort (for sorting files), and sometimes grep (for getting occurrences of data within files). The commands have been around for many years. One could even say they’re tried and true so to speak.

Utilizing Map/Reduce in Ruby

I recently refactored some Ruby project code to use a pipeline approach (without text files) after it encountered an out-of-memory error. Initially, I was retrieving the data, saving the data to csv, then saving as a worksheet in an Excel file. This approach left too much data in memory. Switching to a pipelined approach I read a row of data in and send it along the pipeline for processing; one row at a time. The pipeline for the report looks like this now:

reducer.each_with_index do |row, row_index|
  row_normalizer.normalize(row) do |normalized_row|
    xls_worksheet_writer.write(normalized_row, row_index)

Going forward

I have another large, enterprise Ruby application which could take advantage of this pipeline pattern (with text files) as well. Many of its issues are related to Netezza. When loads are running the database does not guarantee ACIDity. Insertions and deletions are accepted by the server but not guaranteed timely execution. This casue a situation of data duplication, and is why I ultimately had to switch to GUIDs as primary keys. Just yesterday, I noticed one insertion request, during the loads, somehow turned into 149 insertions. I've also encountered many times when key data tables have been empty because a load is deleting everything before reloading.

If I switch to a file-based approach, then only the initial data pull is concerned with the database. The rest of the processing steps are free to continue. This also means debugging will be easier, since the process will be more modular.

In the new map/reduce system, I could break out the pieces more. In a file-base pipeline, any monolithic SQL would be replaced with mappers/reducers which deal with text files. No need for creating temporary tables, or deleting them afterwards. General purpose programming languages such as Python or Ruby are much easier to debug than SQL.

We'll see how it goes...