Big Data Analytics – Hadoop Pseudo-Distributed Mode

Hadoop Pseudo-Distributed Mode

Hadoop can also be run on a single node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. In contrast to standalone mode HDFS is available to the MapReduce jobs. This is useful for testing on a single node before scaling to a cluster.

Configuration

Copy your Hadoop distribution into a new folder (e.g., hadoop-1.0.3-dist) for testing the pseudo-distributed mode. Replace the configuration files in the conf directory by the given configuration files for pseudo-distributed mode. The folder contains the configuration for core-site.xml, hdfs-site.xml, and mapred-site.xml.

Now check that you can ssh to localhost without a passphrase:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Execution

Format a new distributed filesystem:

$ bin/hadoop namenode -format

This command will create a distributed filesystem located at /tmp/hadoop-USERNAME.
You can change the default location by defining hadoop.tmp.dir property in core-site.xml.

Start the hadoop daemons:

$ bin/start-all.sh

This will startup a Namenode, a SecondaryNameNode, and a Datanode processes for the HDFS and a Jobtracker and a Tasktracker processes for the MapReduce on your machine. You can check them using jps command:

$ jps
79875 SecondaryNameNode
80024 TaskTracker
80047 Jps
79937 JobTracker
79789 DataNode
79703 NameNode

The hadoop daemon log output is written to the directory ${HADOOP_LOG_DIR} (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode (HDFS master node) and the JobTracker (MapReduce master node); by default they are available at:

NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/

Using the web interface check whether there is one node running.

Copy the input folder (that contains a weather sample file) into the distributed filesystem:

$ bin/hadoop fs -put input /input

You can check the content of a folder on the distributed filesystem:

$ bin/hadoop fs -ls /input

Note: Remember that the folders on HDFS start with /.

Run the Maximum Temperature example provided previously on pseudo-distributed mode:

$ bin/hadoop jar maxtemp.jar \
  heigvd.bda.labs.weather.MaxTemperature /input /output

Examine the output files:

Copy the output files from the distributed filesystem to the local filesytem and examine them:

$ bin/hadoop fs -get /output output
$ cat output/*

View the output files on the distributed filesystem:

$ bin/hadoop fs -cat /output/*

When you're done, stop the daemons with:

$ bin/stop-all.sh

Reference

HDFS Shell Guide