Thursday, December 29, 2011

Pig store data in compressed format

Its really easy to compress data while using Pig in compressed format using bz2 format.All you need to do is add '.bz2' extension to the output directory. For e.g.,
STORE data into 'compressedoutput.bz2' USING SomeStore();


Taadaa!!! You are done!

Aother advantage of bz2 is this format is splittable.So if some other subsequent Map/Reduce job is going to read this data, it can be split giving more parallelism.

Awesome easy? isn't it?!

Monday, August 1, 2011

wget with proxy

vey simple but I forget the syntax everytime I want to use it. So, making it available for people like me who quickly want to get to the answer.

How to use proxuy with wget:


export http_proxy=http://proxyserver:port

Then use wget like

"wget -c http://url-u-want-to-wget-with-proxy"


HTH

Tuesday, May 17, 2011

Hadoop issue: jobtracker.info could only be replicated to 0 nodes, instead of 1

After wasting 36 hours of my life, finally found a way to bypass the problem.
Instead of using hadoop/bin/start-all.sh, run each of the commands separately:
* ./bin/hadoop-daemon.sh start namenode
* ./bin/hadoop-daemon.sh start jobtracker
* ./bin/hadoop-daemon.sh start tasktracker
* ./bin/hadoop-daemon.sh start datanode



There still may be some intermediate issue with dfs dir. Try deleting the dfs dir and cleanup everything of the form /tmp/hadoop*
Then do a hadoop namenode -format.

Note: make sure before doing all these things you have killed /stopped all the hadoop processes running on your machine.