Hadoop Pseudo-Distributed Mode
Hadoop can also be run on a single node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. In contrast to standalone mode HDFS is available to the MapReduce jobs. This is useful for testing on a single node before scaling to a cluster.
Configuration
Copy your Hadoop distribution into a new folder (e.g.,
hadoop-1.0.3-dist) for testing the pseudo-distributed mode. Replace
the configuration files in the conf
directory by the given
configuration files for pseudo-distributed
mode.
The folder contains the configuration for core-site.xml
, hdfs-site.xml
, and mapred-site.xml
.
Now check that you can ssh to localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Execution
Format a new distributed filesystem:
$ bin/hadoop namenode -format
This command will create a distributed filesystem located at /tmp/hadoop-USERNAME
.
You can change the default location by defining hadoop.tmp.dir
property in core-site.xml
.
Start the hadoop daemons:
$ bin/start-all.sh
This will startup a Namenode
, a SecondaryNameNode
, and a Datanode
processes
for the HDFS and a Jobtracker
and a Tasktracker
processes for the MapReduce
on your machine. You can check them using jps
command:
$ jps
79875 SecondaryNameNode
80024 TaskTracker
80047 Jps
79937 JobTracker
79789 DataNode
79703 NameNode
The hadoop daemon log output is written to the directory ${HADOOP_LOG_DIR}
(defaults to ${HADOOP_HOME}/logs
).
Browse the web interface for the NameNode
(HDFS master node) and the JobTracker
(MapReduce master node); by default they are available at:
- NameNode - http://localhost:50070/
- JobTracker - http://localhost:50030/
Using the web interface check whether there is one node running.
Copy the input folder (that contains a weather sample file) into the distributed filesystem:
$ bin/hadoop fs -put input /input
You can check the content of a folder on the distributed filesystem:
$ bin/hadoop fs -ls /input
Note: Remember that the folders on HDFS start with /
.
Run the Maximum Temperature example provided previously on pseudo-distributed mode:
$ bin/hadoop jar maxtemp.jar \
heigvd.bda.labs.weather.MaxTemperature /input /output
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get /output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat /output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh