A while ago I was asked, as a pre-interview task for another company, to write some code in any language that counted word frequencies in these newsgroup articles. I recently came across the Ruby and Scala scripts I wrote and thought it would be fun to post them.
First, here is the wordcounts.rb script. It assumes the newsgroup files have already been downloaded and is executed by running ruby wordcounts.rb
.
#Change rootDir to the location of your newsgroup files
rootDir = "/home/zcox/dev/20_newsgroups"
raise rootDir + " does not exist" unless File.directory? rootDir
#Iterates over all files under rootDir, opens each one and passes it to the block
def files(rootDir)
Dir.foreach(rootDir) do |dir|
if dir != "." && dir != ".."
puts "Processing " + dir
Dir.foreach(rootDir + "/" + dir) do |file|
if file != "." && file != ".."
open(rootDir + "/" + dir + "/" + file) do |f|
yield(f)
end
end
end
end
end
end
t1 = Time.now
counts = Hash.new(0) #0 will be the default value for non-existent keys
files(rootDir) do |file|
file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }
end
puts "Writing counts in decreasing order"
open("counts-descreasing-ruby", "w") do |out|
counts.sort { |a, b| b[1] <=> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end
puts "Writing counts in alphabetical order"
open("counts-alphabetical-ruby", "w") do |out|
counts.sort { |a, b| a[0] <=> b[0] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end
t2 = Time.now
puts "Finished in " + (t2 - t1).to_s + " seconds"
The code just iterates over each file, splits each file into words and aggregates the word counts using a Hash object. It then sorts the Hash in two different ways and writes each to a separate file. I think Ruby makes this code very easy to read and Ruby’s core library support makes this task pretty simple to accomplish.
Next, here is the wordcounts.scala version. Again it assumes the newsgroup articles are already downloaded and can be run by executing scala wordcounts.scala
.
//Change rootDir to the location of your newsgroup files
import java.io._
val rootDir = new File("/home/zcox/dev/20_newsgroups")
if (!rootDir.exists) throw new IllegalArgumentException(rootDir + " does not exist")
/** Iterates over all files under rootDir, opens each one and passes it to the function */
def files(rootDir: File)(process: File => Unit) {
for (dir <- rootDir.listFiles; if dir.isDirectory) {
println("Processing" + dir)
for (file <- dir.listFiles; if file.isFile) {
process(file)
}
}
}
val t1 = System.currentTimeMillis
var counts = Map.empty[String, Int].withDefaultValue(0)
files(rootDir) { file =>
file.split("""\W+""").foreach { word => counts = counts(word.toLowerCase) += 1 }
}
println("Writing counts in decreasing order")
write(counts, "counts-descreasing-scala") {_._2 > _._2}
println("Writing counts in alphabetical order")
write(counts, "counts-alphabetical-scala") {_._1 < _._1}
val t2 = System.currentTimeMillis
println("Finished in " + ((t2 - t1)/1000.0) + " seconds");
/** Writes the specified map to the specified file in tab-delimited format, sorted accordingly. */
def write[K, V](map: Map[K, V], file: String)(sort: (Tuple2[K, V], Tuple2[K, V]) => Boolean) {
using (new PrintWriter(new FileWriter(file))) { out =>
map.toList.sort(sort).foreach { pair => out.println(pair._1 + "\t" + pair._2) }
}
}
/** Converts a File to a String. */
implicit def file2String(file: File): String = {
val builder = new StringBuilder
using (new BufferedReader(new FileReader(file))) { reader =>
var line = reader.readLine
while (line != null) {
builder.append(line).append('\n')
line = reader.readLine
}
}
builder.toString
}
/** Performs some operation on the specified closeable object and then ensures it gets closed. */
def using[Closeable <: {def close(): Unit}, B](closeable: Closeable)(getB: Closeable => B): B =
try {
getB(closeable)
} finally {
closeable.close()
}
Notice that Scala code is statically typed and compiled software that can be run as a simple script - Scala really does scale from small one-off scripts to large enterprise systems. It mirrors the Ruby script fairly closely, with the addition of some helper functions at the end to make up for some features that the core Scala library doesn’t provide. Overall I think it’s just as readable as the Ruby code, but comes with static type-checking and it runs on the JVM.
I have included some timing info as well. On my 2GHz Core2 Duo with 3GB RAM, the Ruby script averages about 60 seconds while the Scala script averages about 30 seconds. The Scala version uses immutable Map objects to store the word counts; thus a new Map object is created for each word. I switched it to a mutable Map (which required only minor code changes) and it dropped to about 15 seconds. Hardly scientific, I’m sure a lot of optimizations could be done on both scripts, but this shows the performance gains provided that can be achieved on the JVM using Scala.
If anyone can spot any improvements to make to these scripts, please share them with comments!