diskodev

DESIGN & DEVELOPMENT AGENCY

Filtering by Tag: Algorithms

Data Structures and Operations

Let us look at this problem definition. Here, given a word, we need to find all the words that can appear next to it in a word ladder. The list of words is given in a text file. And a word ladder is a sequence of words made by changing one letter at a time. For example, the below sequence defines a word ladder,

cold -> cord -> card -> ward -> warm

What we need to find out are,
1) Given a word, list all the words that can appear next to it in a word ladder
2) Which word has the longest ladder?
3) How many words can be reached, from a source word, in n or fewer steps?

Let us try to come up with a naive algorithm that given a word, it finds all the words that can appear next to it. The code would be like something given below,

In the above naive algorithm, O(n) of time is required. For a given word, we parse the entire list to check if the diff between a pair is 1 and hence the 0(n) time. [The function in the above gist, getDiffCount(), gives the diff count between the pair of words.]

Now, let us try to find the word that has the longest ladder. Making use of the code we have written earlier, we can write the below function,

The runtime of the above code is O(n^2). This is so because we try to get the next words (which is a O(n) operation) for each word which makes the algorithm run in O(n^2) time.

The importance of Data Structures
What if, now, these operations were not one off and are to be repeated often. Using the above naive functions to do these operations is sub-optimal. This is where a good Data Structure can help us. We need to come up with a DS such that we can maintain the state of the program. This is so because during every operation we do not need to create the state afresh.(In the above naive functions, re-creating the state was more expensive than other work we did)

Operations define a Data Structure
How do we choose what Data Structure that we are going to use? The golden rule to effectively choose a Data Structure is to choose it in such a way so as to easily support the operations that we plan to do during the life of the program.

Let us come up with a data structure for the above operations. Looking at the operations that we are going to perform, let us represent each words as an element in the DS as follows,

where, the given word is stored as a string in the element. The noOfLadders contains the value of number of ladders for that particular element. The links contains the list of all the words that can replace the element's word in a word ladder. The next two members in the element struct are there to aid us in traversing the DS specifically to support operation 3 (How many words can be reached, from a source word, in n or fewer steps?).

The data structure (similar to a adjacency lists) can be diagrammatically shown below as,

image001.png

Each node has an element and the links contains the list to the words that can replace the current word.

The code to initialize the vector of elements is given below.

Above, wordToVectorIndex is a map that maps a word to an vector index. Each element is marked as not visited and the distance from source is set as 0. Once the words are collected in the vector, we need to build the ladder DS where we set the links to point to the list of words that can replace the element's word in a word ladder. The code to do that is given below,

Here, we link a pair of words in their respective links if the difference between the words is 1. Note that, the link contains the pointer to the element and not a copy of the element. The built structure is similar to the figure shown earlier.

After building the ladder data structure, we need to update the number of ladders (links) for each element. The code to do that is given below,

Now, let us try to come with the runtime of the operations using our new data structure.

To find the list of words that can appear next to a word in a word ladder, we can use the following code,

The runtime of the above code block is O(m) where m is the number of words that can replace the given word in a ladder. This is very optimal with respect to the naive solution seen earlier.

Similarly, to find the longest ladder in the set of words, it very easy since we have stored the state in each element. Hence the runtime of the operation is 0(n) where n is the number of words. The code to find the longest ladder is given below

Now to find the words that can be reached, from a source word, in n or fewer steps? Using the ladder data structure, we can easily do a depth limited BFS walk from the source to the given depth. The code to do it is given below,

The runtime of the above code block is O(V + E) where V is the number of words in that depth and E is the relationship between the words.

As you can see, we have really improved upon our runtime using the ladder data structure we have built. Remember this - you need a data struture that can help you with your future operations hence operations always define the data structure and not the other way around.

Let me know if anything can be improved.

If you have read this far, you should follow me on twitter.

Also feel free to follow my projects on github.

Multiplication Algorithm and Recurrences

Multiplication of two numbers is a well defined computational problem. It is such a basic step that we do not give a thought about the algorithms behind it. In this post we shall look at the various implementations of the multiplication algorithm. We will be only looking at the basic algorithms and for the complete set, refer here. This will also be an introductory post to analyzing recurrence relations and to finding the upper bound on the running time of the recurrence algorithm.

Let us try to multiply two numbers X and Y. Let us use the basic algorithm that we all studied in our younger days.

56 X 236

---------------

3402 + 17010 + 113400

---------------

133812

---------------

Here each number in Y is multiplied against each number in X and shifted accordingly. Let us try to frame a pseudo code for the above algorithm.

In the above code block, we assume that there is a function - length(), that returns the length of the number. For example, for the number 567, it returns 3. The running time of the above algorithm is easy to guess. Since each number is multiplied against another number, we need two loops to do the work. Hence the upper bound on the above algorithm is O(n^2)(n-squared).

Let us try to do better than the n-squared algorithm we have seen above. Let us use a recursive algorithm to multiply two numbers. The basic idea of the algorithm is to breakdown the input into smaller sub-problems, solve them and then combine the solutions.

The working of the algorithm is as follows :

In the algorithm above, we divide both the numbers into n/2 size each and then make four recursive call to multiply them (which might make more recursive calls) and get the final result. The base condition here is that when we get a single digit number, we return the product of the numbers.

Now let us come try to come up with the upper bound on the running time of the recurrence algorithm earlier. The recurrence in the algorithm above is of the form,

T(n) = 4T(n/2) + f(n)

Here, there are four sub-problems and the size of each sub-problem is half of the original problem. And f(n) represents the work done outside the recurrences which is just multiplying the single digit numbers and hence O(1). (These are loose bounds and assuming that multiplying single digit numbers is a elementary operation)

The solution of the recurrence can easily be found using the master theorem. But let us analyze it using the recursion tree method to understand it intuitively.

The recursion tree for the above recurrence can be shown as below,

image001_s.png

In the above recurrence, except the leaves, each call makes four other recursive calls. To come up with the upper bound on the running time, we need to calculate the cost at each level and add them up along the height of the recursion tree. To aid us in our calculation, let us assume that n is an power of 2.

The number of sub-problems at each level j is equal to (4 ^ j). The size of sub-problems at level j is equal to n/(2 ^ j). Hence the cost accrued in each level of the reccursion tree is equal to (4 ^ j) * (n/(2 ^ j)) which gives to (2 ^ j) * cn, the cost of one level in the tree. The height of the tree is log n to the base 2. (It is actually log n-1 to the base 2, but we can be sloppy in this and this does not affect the analysis) This is so because the growth of the tree is a logarithmic function with base 2. So let us sum the cost at each level over the entire tree. Hence,

image002.png

Above, we are trying to sum the cost of each level over the entire recursion tree. This gives us the upper bound on the runtime of the algorithm. So continuing,

image003.png

This equation gives geometric series of the form, (The base of the logarithmic function above is 2)

image004.png

Since log 2(to the base 2) is 1, we get

image005.png

So we have used a complex algorithm to multiply two numbers without getting a increase in the runtime. The runtime of the recurrence algorithm is the same as the "naive" algorithm we had seen earlier. Let us see if we can improve upon the recurrence algorithm we have seen. Note that, to use the recursive algorithm we make four recursive multiplications to compute the procedure. Can we reduce this so that we can reduce the recursive calls thereby reducing the running time of the algorithm? There is a clever algorithm called Karatsuba algorithm which exactly does this. The algorithm makes use of a Gauss trick that makes three recursive calls instead of four to multiply two numbers. The working of the algorithm is as follows:

The recurrence of the above algorithm is given as,

T(n) = 3T(n/2) + f(n)

Here, there are three sub-problems and the size of each sub-problem is half of the original problem. And f(n) represents the work done outside the recurrences which is just multiplying the single digit numbers and hence O(1).

The recursion tree for the above recurrence can be shown as below,

image006_s.png

The number of sub-problems at each level j is equal to (3 ^ j). The size of sub-problems at level j is equal to n/(2 ^ j). Hence the cost accrued in each level of the recursion tree is equal to (3 ^ j) * (n/(2 ^ j)) which gives to ((3/2) ^ j) * cn, the cost of one level in the tree. The height of the tree is log n to the base 2. (It is actually log n-1 to the base 2, but we can be sloppy in this and this does not affect the analysis) This is so because the growth of the tree is a logarithmic function with base 2. So let us sum the cost at each level over the entire tree. Hence,

image007.png

Above, we are again trying to sum the cost of each level over the entire recursion tree. So continuing,

image008.png

This equation gives geometric series of the form, (The base of the logarithmic function is 2)

image009.png

The above makes use of logarithmic properties to simplify the equation.

image010.png

We have actually reduced the running time from n-squared to (n ^ 1.5) which is a slower growing function than n-squared as shown in the figure below.

image011_s.png

The code to do the multiplication is given in the below gist

The function getOrderOfNumber() returns the number of digits in a number. The code makes three recursive calls as we had described earlier and returns the result.

Note: We can confirm the upper bound we had calculated using the recursion tree by the easier master theorem. The details of the process are given below.

The first step in using the master theorem is to calculate the number of leaves in the recursion. The leaves are given by,

image012.png

See you in the comments.

If you have read this far, you should follow me on twitter.

Also feel free to follow my projects on github.

Fibonacci Numbers & its Implementations

The Fibonacci sequence is defined by the recurrence,

            F(0) = 0
            F(1) = 1
            F(i) = F(i - 2) + F(i - 1)         [For all i ≥ 2]

The series that yield from the above recurrence is

            0, 1, 1, 2, 3, 5, 8, 13, …

The Fibonacci series appear throughout nature. The sequence in which rabbit multiply, branch patterns in trees, arrangement of a pinecone etc. are some practical examples.

There are various implementations of the Fibonacci series. Lets look at some of them along with their run times.

The above gist contains the recursive implementation of the Fibonacci series. This is by far the worst implementation of the sequence. The running time of the above algorithm is O(2^n) where n is the input to the recurrence F from above definition. This algorithm might suffice when n is small, but when n is large, the time taken by the program grows exponentially.

The above gist is the iterative implementation of the algorithm. The running time of the algorithm is O(n). This is simple and faster than the above algorithm. For most of the time, this algorithm is fast enough to compute the Fibonacci sequence. But if you still want to push the boundary, there are a few techniques available.

Fibonacci sequence is closely related to the golden ratio (ϕ). It is given by the following formula,

image001.png

And the Fibonacci number of i is given by

image002.png

Using the above formula we can calculate the Fibonacci sequence of a given number i. The running time of the above method is O(1). But it is advised that this method is not used in non-mathematical software. This is so because there might be overflows in the floating-point computation and you might not end with the right result.

The fastest way then, in normal circumstances, to find out the Fibonacci sequence of a number i is to use a matrix of a special form and square it recursively. It turns out that

image003.png

The Fibonacci number is present in the upper right corner and in the lower left corner of the result matrix. For example,

image004.png

The initial matrix is of a constant size (2x2) and when you multiply the matrix against itself, you get another 2x2 matrix. Only four numbers are used here and this makes the algorithm simple. The running time for this algorithm is O(lg n) which is very fast and this is the algorithm that is mostly used to calculate F(i).

If you have read this far, you should follow me on twitter.

Also feel free to follow my projects on github.

Graph Theory

(This is a first post of a two part series giving an introduction into Graph Theory and its applications. This post will contain more of an theory on graphs and the next one would concentrate more on its applications)

The different types of graphs discussed here are - Directed Graphs, Undirected Graphs & Trees (which are similar to graphs as we shall see)

A directed graph or digraph G is a pair (V,E) where V is a finite set of vertices and E is a set of edges which has a binary relation on V. Given below is an example of a digraph.
image001_s.png

An undirected graph G = (V, E), the edge set E consists of unordered pairs of vertices. The edge (u, v) is the same as (v, u). No self-loops are allowed all though this is possible in a digraph. Given below is an example of an undirected graph.

image002_s.png

Terminologies used in graphs:

Related to edges of a graph

  • If (u, v) is an edge in a directed graph G = (V, E), we say that it is incident from u and incident to v. For example, in the above directed graph, the edge (1, 2) is incident from 1 and incident to 2
  • if (u, v) is an edge in an undirected graph G = (V, E), we say that is incident on vertices u and v. For example, in the above undirected graph, the edge (1, 2) is incident on vertices 1 and 2

Related to vertices of a graph

  • If (u, v) is an edge in the graph G = (V, E), we say that vertex v is adjacent to vertex u. This relation is symmetric if the graph is undirected. In the directed graph, w.r.t edge (2, 3) vertex 3 is adjacent to vertex 3 and in the first undirected graph, the edge (2, 4) vertex 2 is adjacent to vertex 4 and vertex 4 is also adjacent to vertex 2
  • The degree of a vertex in an undirected graph G = (V, E) is the number of edges incident to it. A vertex is said to be isolated if its degree is 0. In the first undirected graph, the degree of vertex 4 is four (In fact the graph is a complete graph with all vertices of degree four). And in the second undirected graph, vertex 2 is isolated and has a degree of zero
  • The degree of a vertex in a directed graph G = (V, E) is the sum of its In-degree (Number of edges incident from the vertex) and its Out-degree (Number of edges incident to the vertex). In the above directed graph, the in-degree of vertex 4 is three and the out-degree is two
  • A path from a vertex u to a vertex v in a graph G = (V, E) is a sequence <V0, V1, V2, V3, … , Vn> where V0 = u and Vn = v and v is reachable from u. A path is simple if all vertices in a path is unique. A example path in the directed graph is 1, 2, 3, 5
  • A subpath of a path P is the continuous subsequence of its vertices. An example subpath in the directed graph is 2, 3, 4 of the complete path 1, 2, 3, 4
  • A path forms a cycle if one or more of its vertices are repeated during the path. A path with no cycle is called a simple path. A graph with no cycle is called acyclic. A dag is a directed acyclic graph which has no cycles. An example of a cycle in the directed graph is 1, 2, 3, 4, 3, 5.
  • A connected graph is a graph G = (V, E), where all its vertices are reachable using a path. The first undirected graph in the above illustration is a connected graph where as the second undirected graph is not connected. In the directed graph too all vertices form a connected graph
  • The strongly connected components of a directed graph are the set of vertices for which each pair of vertices in the set are reachable from each other. In the first undirected graph, all the vertices form a strongly connected component

Trees are related to graphs but uses some slightly different notions. A tree is a connected, acyclic, undirected graph. Many algorithms that work on graphs also work on trees. Below are  some illustration of trees.
image003_s.png

Terminologies used in trees:

  • A root is a node which is the base of the graph. No edges come into it. Edges only go out of it. In the illustrations above, 15 and 1 are some examples of trees
  • If x is a node in the tree with root r, and any node y on the path from r to x is called an ancestor of x and x is an descendant of y. In the first example, 12 is an descendant of 15 and and 22 is an ancestor of 3
  • If there is an edge (x,y), then x is called the parent of y and y is called the child of x. In the second example, 2 is the parent of 5 and 5 is a child of 2
  • The length of the path from the root r to a node x is the depth of x in the tree. The depth of the first illustration is 4 and of the second is 2 (Note: the depth starts from 0)
  • The height of the tree is the depth of the lowest node from the root of the tree. The height of the first tree is 4 and of the second one is 2
  • A binary tree is one which has zero, one, two or three disjoint sets of nodes. The first example is not a binary tree where as the second one is
  • A node to the left of the root is called the left child of the root and a node to the right of the root is called the right child of the root
  • A full binary tree is a tree where each node is a leaf or has two children. The second illustration is a full binary tree
  • A complete binary tree is a tree where all the leaves are in the same level and each non leaf node has two children. The second illustration is again a complete binary tree

Some properties of graphs and trees:

  • In a complete graph, the sum of degree of all vertices is equal to the 2 times number of edges in the graph i.e. for v ∈ V ∑degree (v) = 2 | E |
  • Any connected, undirected graph G = (V, E), | E | ≧ | V | -1
  • Any two vertices in a tree are connected by a unique simple path
  • In a connected tree, | E | = | V | - 1
  • A complete binary tree has (2 ^ h) - 1 nodes where h is the height of the tree
  • A non empty binary tree with n nodes has a height at least floor(lg n)

If you have read this far, you should follow me on twitter.

Also feel free to follow my projects on github.

Parallel Processing Using the Map Reduce Programming Model

Let’s say we want to find the highest rated movie each year. As input to this program, we can download movie data as a text file from IMDB’s alternative interface. We need to work a bit on it before we use it. For now, let us focus on English movies alone, so we shall remove all foreign films.  And also let us get rid of unwanted columns and do other formatting tasks. Finally, we prepare a tab limited file of the format [<Total Votes> <Rating> <Movie Name> <Year>]. (If you want to use the worked file for your own mashup, you can get it from here)

We can then write a perl script (given below) to read the above file and give us the highest rated movie each year as the output.

The script maintains a hash for the movie, rating and votes with the year as the key. We iterate per line to find the highest rated movie for that particular year. We only consider movies whose total votes is greater than 30,000. This removes movies whose ratings are ‘fixed’. And in case of a tie in the rating, we break it using the total votes as this pushes the movies which are more popular and well known. The output should be of the format [<Year> <Movie Name> <Rating>].

We can then run the program by issuing the following command (The command is same under Windows):

         raj@ubuntu:~$perl ratings.pl ratings_worked.list

This should print out the results to the terminal. If we need the output in the file, we could redirect using the ‘>’ operator. The sample output for the last five lines is given below.

         2007   No Country for Old Men    8.2
         2008  The Dark Knight      8.9
         2009   Inglourious Basterds          8.4
         2010   Inception       8.9
         2011    X-Men: First Class  8

Next, Let us assume we have a much larger dataset with consists of all language movies and their details. And after running the new file on the same program, we find that it takes a lot of time. We want it to run faster. There are two ways to go about this. One, we hack the hell out of perl. Although this sounds promising, this might not be the best path. Or, we can try to run parts of the program in parallel thereby speeding up the execution. But, going this route, we might open up a Pandora’s Box of issues that accompany parallel processing.

How many processes are we going to run? How do we co-ordinate the processes? How are we going combine the output of all the process? What if processes fail? How are we going to control access to a shared resource? What if the dataset outgrows our single machine? These are some of the questions that need answering before we start writing parallel programs. Although we can get the program to run faster, it is often messy in real life.

What if there is programming model that abstracts away these details and let us concentrate only on writing our business logic? And that the model automatically takes care of all the finer details of parallelism, freeing us from thinking about its intrinsic details. It seems there is that model available.

Enter: MapReduce.

MapReduce is a programming model for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Although this might sound constraining, lots of real world problems can be expressed this way.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines (We will be running our program on a single node). The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

MapReduce consists of two phases – the map phase and the reduce phase. Each phase take a <key, value> as input and gives a <key, value> as output. As a programmer, you need to implement the map and the reduce interface in your program. Then you create a new job for your program and pass on this job to the framework. The framework runs the job and returns you the results.

MapReduce programming model is very useful since it can run on top of the Hadoop framework (I will not be going into Hadoop’s details in this post). So you can store huge amount of data on Hadoop’s distributed file system and then run MapReduce programs to process these huge quantities of data. And Hadoop is designed to run on a variety of machines – from your commodity cluster to Amazon EC2 instances. Hadoop & MapReduce have combined to solve problems in various domains like CRM, Machine Learning, Social Analysis etc. For more problems solved using Hadoop/MapReduce, check here. (Although Hadoop/MapReduce is powerful, as they say in investing, please read the fine print as to what it can and not do).

If you want to install & configure Hadoop on a single machine, check out this post. If you want a VM with Hadoop installed and configured, Cloudera’s VM should be alright. Both use Ubuntu as their OS, but if you want to install Hadoop on Windows, read the instructions on this page.

Now coming back to the problem, we need to write our map and reduce interface. The input to our map interface is the file that contains the movie details. We read the file line by line as a text value. The key is the line number (which we will never be using in our analysis) and the value is the line of text. The part of code which checks if the movie has registered more than 30,000 votes can be put into the mapping function. This greatly reduces the data going into the reduce function.

To see how the mapper and reducer interface work, consider the following lines as the input (format - [<Votes> <Rating> <Movie Name> <Year>]) to the MapReduce program:

         193718            8          Casino Royale           2006
         313960           8.3      Batman Begins        2005
         90848            7.7       Munich          2005
         270071           8.2      V for Vendetta          2006
         168557            8          Shutter Island          2010
         313004           8.5       The Departed            2006

So the input to the map interface would be

         1          193718            8          Casino Royale           2006
         2          313960           8.3      Batman Begins        2005
         3          90848            7.7       Munich          2005
         4          270071           8.2      V for Vendetta          2006
         5          168557            8          Shutter Island          2010
         6          313004           8.5       The Departed            2006

The keys to the map interface are the line numbers. And the value is the line of text.  The output of the map function is of the form [<Year>, <Votes> <Rating> <Movie Name>] (All movies are included since they have votes greater than 30,000).

         [2006,            193718            8          Casino Royale]
         [2005,            313960           8.3      Batman Begins]
         [2005,            90848            7.7       Munich]
         [2006,            270071           8.2      V for Vendetta]
         [2010,             168557            8          Shutter Island]
         [2006,            313004           8.5       The Departed]

The output of the map interface is processed midway by the framework to order and group the <key, value>. The ordered and grouped <key, value> is then sent to the reducer interface. The input that is passed to the reduce interface is:

         [2005,            ([313960        8.3      Batman Begins], [90848   7.7       Munich])]
         [2006,            ([193718         8          Casino Royale], [270071     8.2      V for Vendetta], [313004   8.5            The Departed])]
         [2010,             ([168557         8          Shutter Island])]

When the reduce interface gets the above input, it just needs to iterate over the movies for each year and find the highest rated movie for that year. The final output of the MapReduce program is the output from the reduce interface. The output for the above example would be:

         2005   Batman Begins        8.3
         2006   The Departed            8.5
         2010   Shutter Island          8

The java code (other languages are not covered in this post) which implements the mapper and reducer interface of MapReduce are given below.

The main class (imdbrating) is the class that implements the mapper and reducer interface of Hadoop. Inside the main class, you have the static imdbmapper which extends the Mapper class of Hadoop. The inputs to the mapper (<LongWritable, Text>) and the output of the mapper (<Text, Text>) are passed as parameters to the Mapper class. Types like LongWritable and Text are Hadoop specific types and they map to a Java specific type. LongWritable maps to a long in Java and Text maps to a String. Hadoop types are optimized for network serialization. Details about various Hadoop types are given here. The imdbmapper needs to implement the map interface of the Mapper class. The parameters to the interface is the key and value along with a Context object. The Context object collects the output of the interface and passes it on to the next stage in the framework. Here, We do not do anything in the mapper interface except filtering out the movies which has votes less than 30,000. And the output finally got from the map interface is a <Text, Text>.

We follow the same process for implementing the reducer. We create a static class imdbreducer which extends the Reducer class of Hadoop. The inputs from the previous map interface (<Text, Text>) and the output of the reducer (<Text, Text>) is passed as parameters to the Reducer class. The imdbreducer needs to implement the reduce interface of the Reducer class. The parameters to the interface is again key and value along with the Context object. The Context object collects the output of the interfaces and stores it in a file as the final output. The whole business logic is contained in the reducer interface. We basically iterate through the input got at this stage, break it up by tabs and then compare against the maximum for that year. In the case of a tie, it is again broken by comparing the total votes.

In the main function, we first check if the right number of arguments are passed to the program. Then we create a new MapReduce job for this task. We then set the entry point of the program in the next line using the setJarByClass() method. We must next set up the path to the input and the output. Since the path in the file system and the distributed file system of Hadoop is different, we need to construct a new Path specific to Hadoop. One thing to note is that the input path should be a path and the output path should be a directory. And the directory should not be present already. Otherwise Hadoop will spilt out an error, when running the job, that the directory is already present. We set the mapper and reducer class using setMapperClass() and setReducerClass() respectively. We then specify the output key class and output value class. Since this is same for both the mapper and the reducer, we can set them using setOutputKeyClass() and setOutputValueClass(). If they are not the same, map output types can be set by using setMapOutputKeyClass() and setMapOutputValueClass(). Finally we wait for the MapReduce job to complete and then exit.

Before running the MapReduce job on Hadoop, we need to prepare the environment for execution. First we need to make a jar out of our class files. This can be done by issuing the following command (Assuming the class files are in the same directory) :

        raj@ubuntu:~$jar cvf imdbrating.jar *.class

we now need to start Hadoop’s services. We shall start all the services for now. We can issue the following command to do so:

        raj@ubuntu:/usr/local/hadoop$ bin/start-all.sh

You can check if all the services are running by issuing the command:

        raj@ubuntu:/usr/local/hadoop$ jps

If some of the services are not running, check the log files for errors. They should be present under $HADOOP_DIR/logs, where $HADOOP_DIR is the base directory of the Hadoop installation. Next we need to copy the input dataset (txt file) to Hadoop’s directory. This can be done by issuing the following command:

        raj@ubuntu:~$hadoop dfs –copyFromLocal ratings_worked.list /user/HADOOP_USER_NAME/

Above, /user/HADOOP_USER_NAME is the root directory for that user. And Hadoop supports some basic UNIX commands along with its own set. –copyFromaLocal copies a local file to the given Hadoop file system. After finishing the above tasks, we can run the MapReduce job on Hadoop. This can be done by:

        raj@ubuntu:~/Projects/Java/Hadoop$ hadoop jar imdbrating.jar imdbrating /user/ HADOOP_USER_NAME/ratings_worked.list /user/ HADOOP_USER_NAME /imdbrating-output

imdbrating.jar is the jar that we have created above and imdbrating is the entry point into the job. The input is the dataset – ratings.txt, given with the right path. The output file is the directory that does not already exist on Hadoop. You should be getting a sample output similar to the one shown below:

         11/08/05 14:23:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
         11/08/05 14:23:04 INFO input.FileInputFormat: Total input paths to process : 1
         11/08/05 14:23:04 INFO mapred.JobClient: Running job: job_201108051418_0001
         11/08/05 14:23:05 INFO mapred.JobClient:  map 0% reduce 0%
         11/08/05 14:23:20 INFO mapred.JobClient:  map 100% reduce 0%
         11/08/05 14:23:35 INFO mapred.JobClient:  map 100% reduce 100%
         11/08/05 14:23:40 INFO mapred.JobClient: Job complete: job_201108051418_0001
         11/08/05 14:23:40 INFO mapred.JobClient: Counters: 25
         11/08/05 14:23:40 INFO mapred.JobClient:   Job Counters
         11/08/05 14:23:40 INFO mapred.JobClient:     Launched reduce tasks=1
         11/08/05 14:23:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11406
         11/08/05 14:23:40 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
         11/08/05 14:23:40 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
         11/08/05 14:23:40 INFO mapred.JobClient:     Launched map tasks=1
         11/08/05 14:23:40 INFO mapred.JobClient:     Data-local map tasks=1
         11/08/05 14:23:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=12695
         11/08/05 14:23:40 INFO mapred.JobClient:   File Output Format Counters
         11/08/05 14:23:40 INFO mapred.JobClient:     Bytes Written=2270
         11/08/05 14:23:40 INFO mapred.JobClient:   FileSystemCounters
         11/08/05 14:23:40 INFO mapred.JobClient:     FILE_BYTES_READ=48090
         11/08/05 14:23:40 INFO mapred.JobClient:     HDFS_BYTES_READ=7791294
         11/08/05 14:23:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=137855
         11/08/05 14:23:40 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2270
         11/08/05 14:23:40 INFO mapred.JobClient:     Bytes Read=7791175
         11/08/05 14:23:40 INFO mapred.JobClient:   Map-Reduce Framework
         11/08/05 14:23:40 INFO mapred.JobClient:     Reduce input groups=76
         11/08/05 14:23:40 INFO mapred.JobClient:     Map output materialized bytes=48090
         11/08/05 14:23:40 INFO mapred.JobClient:     Combine output records=0
         11/08/05 14:23:40 INFO mapred.JobClient:     Map input records=233997
         11/08/05 14:23:40 INFO mapred.JobClient:     Reduce shuffle bytes=48090
         11/08/05 14:23:40 INFO mapred.JobClient:     Reduce output records=76
         11/08/05 14:23:40 INFO mapred.JobClient:     Spilled Records=2806
         11/08/05 14:23:40 INFO mapred.JobClient:     Map output bytes=45278
         11/08/05 14:23:40 INFO mapred.JobClient:     Combine input records=0
         11/08/05 14:23:40 INFO mapred.JobClient:     Map output records=1403
         11/08/05 14:23:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119

         11/08/05 14:23:40 INFO mapred.JobClient:     Reduce input records=140

The above output means, the MapReduce job has run successfully without any problems. If the job is not successful, you can check the terminal and logs for the exact nature of your error. Or you can check the web interface of the MapReduce job tracker - http://localhost:50030/ (localhost and port 50030 are the defaults) for any errors and its descriptions. The output of the MapReduce job should be present in /user/raj/imdbrating-output/part-r-00000. Issuing the following command should give us the following results as above perl output (getting the last five entries) :

         raj@ubuntu:~$hadoop dfs –cat  /user/HADOOP_USER_NAME/imdbrating-output/part-r-00000 | tail

The MapReduce job should return the output in the same format as the above perl script ([<Year> <Movie Name> <Rating>]). You can transfer the output file to your local file system by running the following command:

         raj@ubuntu:~$hadoop dfs –copyToLocal  /user/HADOOP_USER_NAME/imdbrating-output/part-r-00000  Ouput.list

That is it. If you have any issues in running the program, leave a comment or email/tweet me. And sorry for the long post.

If you read this far, you should follow me on twitter.

Also feel free to follow my projects on github. You can get the above code from the same place.