Frequency analysis with Parallel LINQ and regex

One of the most basic exercises when you learn a new programming language is to count the frequency of all the words in a text file. Google has many C# word-counting examples, most of which are rather verbose and/or dubious.

Just how compact and fast can we do this? Behold the magic of Parallel LINQ and regular expressions:


public Dictionary<String, Int32> FrequencyAnalysis()
    // This enables a lazy line-by-line sequence from 
    // the specified text file. We're doing this to 
    // avoid loading the whole text file at once.
    // Under the hood, this is a UTF-8 StreamReader
    // with a fixed buffer size of 1024 bytes, and 
    // the iterator returns IEnumerable<String>.
    var allLines = File.ReadLines(this.FilePathName);
    // Compact code, and not big-O(scary). There are more 
    // performant ways to do this, but PLINQ splits and 
    // counts all the words in Shakespeare's combined  
    // works (5.5 MB of text) in under 1 second.
    // The AsParallel() option doubles the performance on 
    // my quad-core dev machine - YMMV.
    // In this case, we're keeping case-sensitivity.
    this.WordFrequency = allLines.AsParallel()
                                 .SelectMany(line => Regex.Split(line, @"\W+"))
                                 .GroupBy(word => word)
                                 .ToDictionary(word => word.Key, word => word.Count());
    // Remove all empty strings.
    return this.WordFrequency;

And now the result of running this code on a 5.5 MB text file containing Shakespeare's combined works:

MSDN has good documentation on the "\W+" regex expression. It matches all the normal Unicode characters representing letters, plus 10 connector characters such as underscore.

The AsParallel() option roughly doubles the performance on my quad-core hardware when using that text file. But 5.5 MB of text is relatively small in today's world. What happens when we go much bigger? Let's try a 282 MB text (technically XML) file:

From a scaling perspective, we increased the size of the problem by roughly 51 times, and it took roughly 48 times longer. That looks pretty linear to me. One issue that developers have with using declarative "functional style" code such as PLINQ is that it can be hard to understand the performance aspects without seeing the underlying implementation. But the reality is that you need to profile the performance anyway - it's hard to understand the performance characteristics of an implementation just by looking at the code.