A Naive Implementation of MapReduce in C#
After reading Ralf Lämmel's paper on MapReduce, I was inspired to code my own version in C#. Well, I didn't quite have the time to do a fully distributed and reliable version, so I just implemented the simplest, sequential, strongly-typed, and moderately efficient version I could think of.
I decided to utilize the IEnumerable`1 interface for the majority of my "collections" so that it could more efficiently handle very large data sets. That is, the IEnumerable`1 interface does not require that you store all elements in memory, but that you simply hand them out as requested. This seems to achieve what the Google authors mentioned in their original paper:
In my version, you will see plenty of use of C#'s iterators in an attempt to emulate this feature:
As an example of its usage, I present the magical word counter:
I would like to end this entry with a reference to Christopher Diggins' (the inventor of the Cat's) implementation in Cat (a stack-based functional language). It's short, insanely short in fact:
For an explanation of this function, please read Christopher's writeup.
I decided to utilize the IEnumerable`1 interface for the majority of my "collections" so that it could more efficiently handle very large data sets. That is, the IEnumerable`1 interface does not require that you store all elements in memory, but that you simply hand them out as requested. This seems to achieve what the Google authors mentioned in their original paper:
The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.
In my version, you will see plenty of use of C#'s iterators in an attempt to emulate this feature:
class MapReduce
{
public delegate IEnumerable<KeyValuePair<TOK, TOV>> Map<TIK, TIV, TOK, TOV>(TIK key, TIV value);
public delegate IEnumerable<KeyValuePair<TOK, TOV>> Reduce<TIK, TIV, TOK, TOV>(TIK key, IEnumerable<TIV> value);
public static IEnumerable<KeyValuePair<TOK, TOV>> Run<TIK, TIV, TTK, TTV, TOK, TOV>(
Map<TIK, TIV, TTK, TTV> map,
Reduce<TTK, TTV, TOK, TOV> reduce,
IEnumerable<KeyValuePair<TIK, TIV>> input)
{
//
// Map and group all at once
//
Dictionary<TTK, List<TTV>> tempStorage = new Dictionary<TTK, List<TTV>>();
foreach (KeyValuePair<TIK, TIV> inputPair in input)
{
foreach (KeyValuePair<TTK, TTV> tempPair in map(inputPair.Key, inputPair.Value))
{
List<TTV> vals;
if (!tempStorage.TryGetValue(tempPair.Key, out vals))
{
vals = new List<TTV>();
tempStorage.Add(tempPair.Key, vals);
}
vals.Add(tempPair.Value);
}
}
//
// Reduce
//
foreach (KeyValuePair<TTK, List<TTV>> tempPair in tempStorage)
{
foreach (KeyValuePair<TOK, TOV> outPair in reduce( tempPair.Key, tempPair.Value))
{
yield return outPair;
}
}
}
}As an example of its usage, I present the magical word counter:
class Program
{
private static IEnumerable<KeyValuePair<string, int>> WordCountMap(
string docId,
string doc)
{
char[] splitChars = {' ', '\r', '\n', '\t', '\x0085'};
foreach (string word in doc.Split(splitChars, StringSplitOptions.RemoveEmptyEntries))
{
yield return new KeyValuePair<string, int>(word, 1);
}
}
private static IEnumerable<KeyValuePair<string, int>> WordCountReduce(
string word,
IEnumerable<int> counts)
{
int wordCount = 0;
foreach (int count in counts)
{
wordCount += count;
}
yield return new KeyValuePair<string, int>(word, wordCount);
}
private static IEnumerable<KeyValuePair<string, string>> AllFilesInDirectory(
string path,
string searchPattern)
{
foreach (string fileName in Directory.GetFiles(path, searchPattern))
{
yield return new KeyValuePair<string, string>(fileName, File.ReadAllText(fileName));
}
}
private static void Main(string[] args)
{
foreach (
KeyValuePair<string, int> wordAndCount in MapReduce.Run<string, string, string, int, string, int>(WordCountMap,
WordCountReduce,
AllFilesInDirectory(args[0], "*.cs")))
{
Console.WriteLine("{0}: {1}", wordAndCount.Value, wordAndCount.Key);
}
}
}I would like to end this entry with a reference to Christopher Diggins' (the inventor of the Cat's) implementation in Cat (a stack-based functional language). It's short, insanely short in fact:
define map_reduce { [map flatten self_join] dip map }
For an explanation of this function, please read Christopher's writeup.

Reader Comments
Post a Comment