Pydoop is a Python interface to Hadoop that allows you to write MapReduce applications in pure Python:
class Mapper(api.Mapper):
def map(self, context):
for w in context.value.split():
context.emit(w, 1)
class Reducer(api.Reducer):
def reduce(self, context):
context.emit(context.key, sum(context.values))
Pydoop offers several features not commonly found in other Python libraries for Hadoop:
- a rich HDFS API;
- a MapReduce API that allows to write pure Python record readers / writers, partitioners and combiners;
- transparent Avro (de)serialization;
- easy installation-free usage;
Pydoop enables MapReduce programming via a pure (except for a performance-critical serialization section) Python client for Hadoop Pipes, and HDFS access through an extension module based on libhdfs.
To get started, read the tutorial. Full docs, including installation instructions, are listed below.