Sunday, February 23, 2014

Scala: Encoded Polyline, an Example in Implementing a Custom Collection Type

This week I'm in chapter 25 of Programming in Scala, 2nd ed., the chapter titled The Architecture of Scala Collections.

As an exercise, I decided I'd implement a custom collection for encoded latitude/longitude points using Google's encoding method described here if you're interested. I've written some encoding/decoding code for this format for the FuelMyRoute Android app I've been working on, so I'm fairly familiar with it. The example the authors implement in the book is a RNA collection where the bases are stored as bits in an Array of Ints. In trying to come up with a different data structure to implement I likewise thought of one where the data is stored in a different and more compact format.

So I started by creating a case class to represent a latitude/longitude point, LatLng:

case class LatLng(lat:Double, lng:Double)

Then I created a Polyline class that extends LinearSeq:

class Polyline private (val encoding: String, val initLatLng:LatLng)
  extends LinearSeq[LatLng] {
  ...
}

Why a LinearSeq? Well, the polyline encoding format works like this. You encode the value of the first latitude and longitude point as a character string. For the second point, you encode the difference from the first point. For the third point, you encode the difference from the second point, and so on. So there is no efficient random access possible. Getting the first point (head) is fine, but in general getting a point at some index will take time proportional to the length of polyline. This all strongly suggests LinearSeq as the right abstraction for an encoded polyline. Scala's List has the same performance characteristics, for example, and is also a LinearSeq.

There are two abstract methods that must be implemented by subclasses of LinearSeq: apply and length. Unfortunately, again, these will take time proportional to the length of the polyline. I'm implementing them here based on the iterator:

def apply(idx: Int): LatLng = {
    val it = iterator
    var i = 0
    while (it.hasNext && i <= idx) {

        val latlng = it.next
        if (idx == i)
            return latlng
        else
            i += 1
    }
    throw new IndexOutOfBoundsException("No LatLng at index " + idx)
}

def length: Int = {
    val it = iterator
    var i = 0
    while (it.hasNext) {

        it.next
        i += 1
    }
    i
}

But for a LinearSeq, what you want to do is provide efficient implementations of isEmpty, head and tail. The private constructor for Polyline is given the encoding as a String, as well as the initial values for latitude and longitude, which are 0 for starting from an encoded polyline string. So for isEmpty we can simply check the length of the encoding string:

override def isEmpty: Boolean = encoding.length == 0

head will lazily convert the first latitude/longitude point and return it.

override def head(): LatLng = {

    if (isEmpty) {
        throw new NoSuchElementException("Collection is empty.")
    }

    computeFirstLatLng()
    firstLatLng.get
}

computeFirstLatLng sets firstLatLng, which is of type Option[LatLng].

The way tail will work is this: it will get the first latitude/longitude point and the remainder of the encoded string and then create a new Polyline instance from those values. Recall that when a Polyline is created it is given an initial value for the first latitude/longitude point. We can pass the first LatLng as the initial value to use for the Polyline that includes the substring of the encoded polyline from the second point on.

override def tail(): Polyline = {

    if (isEmpty) {
        throw new UnsupportedOperationException("Collection is empty.")
    }

    computeFirstLatLng()
    new Polyline(encoding.substring(index), firstLatLng.get)
}

Where did index come from? That's a private var in class Polyline that keeps track of the position within the Polyline. It really only ever points to first point (index=0) or the second point, since traversing this Polyline will consist of calling head then tail then head and so on until the returned tail isEmpty. Here's how index is used and how computeFirstLatLng is implemented:

private def computeFirstLatLng() {

    if (firstLatLng.isEmpty && encoding.length > 0) {

        val lat = initLatLng.lat + getNextNumber
        val lng = initLatLng.lng + getNextNumber

        firstLatLng = Some(LatLng(lat, lng))
    }
}

private def getNextNumber():Double = {
    var b = 0
    var shift = 0
    var result = 0

    do {
        b = encoding.charAt(index) - 63
        index += 1
        result |= (b & 0x1f) << shift
        shift += 5
    } while (b >= 0x20)
    return (if ((result & 1) != 0) ~(result >> 1) else (result >> 1)) / 1e5
}

Now to create a companion object that can construct a Polyline instance:

object Polyline {

    def empty: Polyline = new Polyline("", LatLng(0,0))
    def fromEncoding(encoding:String) = new Polyline(encoding, LatLng(0,0))
    def fromSeq(latlngs:Seq[LatLng]):Polyline = {
        // TODO: implement encoding LatLngs into a polyline string
    }
}

Now we can create and iterate over a Polyline instance:

scala> val pl = Polyline.fromEncoding("_p~iF~ps|U_ulLnnqC_mqNvxq`@")
pl: Polyline = (LatLng(38.5,-120.2), LatLng(40.7,-120.95), LatLng(43.252,-126.453))

Builders

This all works well enough, but what happens if you apply methods like take or filter?

scala> pl.take(2)
res0: scala.collection.LinearSeq[LatLng] = List(LatLng(38.5,-120.2), LatLng(40.7,-120.95))

The type of the result is List, not Polyline as we would like. To have your custom Collection type return results in the same type you need to provide a Builder that can construct instances of your custom collection type.

What's a Builder? A Builder can add elements to a particular kind of collection type. Also useful is that a Builder provides a mapResult method that allows you to map from the output of one Builder to a different collection type. Here's a simplified interface:

trait Builder[-Elem, +To] {

    def +=(elem: Elem): this.type
    def result(): To
    def clear()
    def mapResult(f: To => NewTo): Builder[Elem, NewTo]
}

The power of mapResult is we can use an existing Builder to create a collection of one type and then map it to the Polyline collection type. In our case we'll use ArrayBuffer.

So what do you need to do to supply a custom Builder?

  1. Extend the "Like" trait and supply the result type of your custom collection as the second type parameter of the trait. Every Seq type has a corresponding "Like" type that contains all of the implementation. For example, LinearSeq's Like type is LinearSeqLike.
  2. Override newBuilder to return a Builder that can construct instances of your custom collection type.

For Polyline it would work like this:

class Polyline private (val encoding: String, val initLatLng:LatLng)
    extends LinearSeq[LatLng] with LinearSeqLike[LatLng, Polyline] {

    override def newBuilder: Builder[LatLng, Polyline] =
        new ArrayBuffer[LatLng] mapResult Polyline.fromSeq
    ...
}

Note: I'd need to actually implement Polyline.fromSeq for this to work. I ran out of time, but hopefully you can see how the Builder would work and actually implementing fromSeq is just a detail at this point.

CanBuildFrom

Overriding newBuilder isn't quite enough for our custom collection type to always return the same collection type (at least when it is possible to return the same collection type). For example, the collection concatenation operator, ++, needs to be able to return a sequence where the element type of the sequence is the most specific superclass of the right hand side and left hand side collections. It does this via an implicit parameter of type CanBuildFrom. The signature of ++ is:

def  ++[B >: A, That](that: GenTraversableOnce[B])
    (implicit bf: CanBuildFrom[LinearSeq[A], B, That]): That

CanBuildFrom is a trait defined as such:

trait CanBuildFrom[-From, -Elem, +To] {

    def apply(from: From): Builder[Elem, To]
    def apply(): Builder[Elem, To]
}

So basically the Scala compiler will try to bind the most specific implicit CanBuildFrom that it can match the types of the arguments. LinearSeq defines a very general CanBuildFrom that can be used for any types. To have ++ concatenate two Polylines and return a Polyline instance we have to define a specific CanBuildFrom, which can be done in the companion object:

object Polyline {

    //...

    // We'll move the newBuilder method to the companion object and the
    // Polyline class will reference this one
    def newBuilder: Builder[LatLng, Polyline] =
        new ArrayBuffer[LatLng] mapResult Polyline.fromSeq

    def implicit def canBuildFrom[Polyline, LatLng, Polyline] =
        new CanBuildFrom[Polyline, LatLng, Polyline] {

            def apply(): Builder[LatLng, Polyline] = newBuilder
            def apply(from: Polyline): Builder[LatLng, Polyline] = newBuilder
        }
}

Resources

Monday, February 17, 2014

Scala: What's the difference between Traversable and Iterable?

I'm currently working through the book Programming in Scala, 2nd ed.. I'm in chapter 24 and reading about the Traversable and Iterable traits and the difference between them.

Traversable is a trait with one abstract method, foreach. It's signature is

def foreach[U](f: Elem => U)

Defining this one method for a Traversable allows your collection class to gain several useful methods like

  • map
  • flatMap
  • take
  • drop
  • filter
  • foldLeft, foldRight
  • etc.

Iterable is a trait that extends Traversable and also has one abstract method, iterator. It's signature is

def iterator: Iterator[A]

Iterable defines foreach in terms of this iterator method:

def foreach[U](f: Elem => U): Unit = {
    val it = iterator
    while (it.hasNext) f(it.next())
}

If we can implement foreach in terms of iterator, why have Traversable as a separate trait?

The book provides one reason: that for some collections, implementing iterator may not be easy or may not be efficient. To illustrate this the book uses a Tree data structure as an example.

The Tree data structure is defined as a case class hierarchy:

sealed abstract class Tree
case class Branch(left: Tree, right: Tree) extends Tree
case class Node(elem: Int) extends Tree

I'm a Scala newbie so I couldn't get their Traversable Tree code example to work. Instead I came up with this:

sealed abstract class Tree extends Traversable[Int]
case class Branch(left: Tree, right: Tree) extends Tree {
  def foreach[U](f: Int => U) = left foreach f; right foreach f
}
case class Node(elem: Int) extends Tree {
  def foreach[U](f: Int => U) = f(elem)
}

For the Iterable version I came up with this:

sealed abstract class Tree extends Iterable[Int]
case class Branch(left: Tree, right: Tree) extends Tree {
  def iterator: Iterator[Int] = left.iterator ++ right.iterator
}
case class Node(elem: Int) extends Tree {
  def iterator: Iterator[Int] = Iterator.single(elem)
}

Both fairly straightforward. The problem with the Iterable version is with the efficiency of visiting nodes in the tree. In particular, the implementation of ++ is such that it has to do an extra indirection at every call to next to determine whether to pull from the left or the right iterator.

I think it helps to see what a naive implementation of ++ would look like:

def ++[A](left:Iterator[A], right:Iterator[A]):Iterator[A] =
    return new Iterator[A] {

        def hasNext: Boolean = left.hasNext || right.hasNext

        def next: A =
            if (left.hasNext)
                left.next
            else
                right.next
    }

Every call to next has to first check left.hasNext before returning a value. In a balanced binary tree with N leaf nodes, the depth of the tree is approximately log(N). So to reach a leaf via this added iterator would require an extra log(N) calls to left.hasNext. Therefore, visiting all of the leaf nodes using the iterator would take O(N*log(N)).

But, that's with the naive implementation of ++. The actual implementation, shown below, optimizes by keeping a reference, cur, to the current Iterator. This allows it to avoid needing to check hasNext on the first Iterator on each call to next.

def ++[B >: A](that: => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] {
    // optimize a little bit to prevent n log n behavior.
    private var cur : Iterator[B] = self
    private var selfExhausted : Boolean = false
    // since that is by-name, make sure it's only referenced once -
    // if "val it = that" is inside the block, then hasNext on an empty
    // iterator will continually reevaluate it. (ticket #3269)
    lazy val it = that.toIterator
    // the eq check is to avoid an infinite loop on "x ++ x"
    def hasNext = cur.hasNext || (!selfExhausted && {
        it.hasNext && {
            cur = it
            selfExhausted = true
            true
        }
    })
    def next() = { hasNext; cur.next() }
}

So, to a certain degree, this example with the Tree is a little artificial. It's just as easy to implement Tree as an Iterable as it is to implement it as a Traversable and there is no fundamental reason why the performance of visiting all of the leaf nodes needs to be worse than a Traversable implementation.

I think it is reasonable to assume that there might be data structures where implementing an Iterator over their elements could be challenging (either fundamentally so or with regards to efficiency). I just wish I could think of a more compelling example.

Further resources: