CS 240: Lecture 3 - Searching

Dear students:

Last lecture we encountered some algorithms whose execution took longer than we'd like. Today, we formalize this notion of evaluating an algorithm's cost, use it to judge a couple of search algorithms, and then drop the costs into a fixed set of buckets called complexity classes.

Measuring Algorithms

We have an arsenal of tools at our disposal and how each tool has its strengths and weaknesses. To be able to choose which tool to use, we need a way of judging them. Here are a few possible ways in which we can measure algorithms:

Correctness. A tool passes muster if it produces the right result. Correctness is certainly necessary of any tool, but it alone is too simple. If a tool takes years to produce a correct answer, it's not a good choice.
Simplicity. A tool passes muster if it's easy for humans to understand. This too is naive criteria. A simple tool might also take years to produce the desired result. Additionally, simplicity is generally something we are willing to sacrifice in exchange for other benefits.
Space efficiency. A tool passes muster if it uses less memory or disk than other tools. This is an important measure used in some situations, but this class does not focus on space efficiency.
Time efficiency. A tool passes muster if it completes a computation in less time than other tools. This is the measure that we focus on in the class.

With both the list experiment and the biggest difference experiment from last lecture, we used execution time to measure the time efficiency. Execution time turns out not to the best measurement for several reasons:

Measuring execution time on a single machine measures more than the algorithm or data structure. It also measures your hardware and compiler.
Comparing executions on different machines is not scientific because too many variables change.
Comparing executions even on the same machine is not scientific. Caching and interactions with other processes will muddy the execution time.

A better measure is the number of steps. A step is loosely defined as a single operation. An algorithm's number of operations is expressed as a function of n:

$$T(n) = \mathrm{?}$$

Consider this algorithm:

total = 0;
for (int i = 0; i < numbers.length; ++i) {
  total += numbers[i];
}

total = 0;
for (int i = 0; i < numbers.length; ++i) {
  total += numbers[i];
}

How many steps are taken? The += happens n times. So does the array indexing. So does ++i. So does the comparison. We could define $T$ for this algorithm in this way:

$$T(n) = 4n$$

But don't forget the assignment that happens before the loop:

$$T(n) = 4n + 1$$

Even then, this isn't the code that ultimately gets executed. Rather, the code gets compiled down to a bunch of machine instructions. The array index might expand into five machine instructions. The comparison might expand into three. So, $T$ might more realistically be defined this way:

$$T(n) = 26n + 1$$

While this may be more accurate, the definition of $T$ now depends on the machine again, which will make comparing algorithms on different machines difficult. For this reason, we define $T$ more loosely in terms of some constant coefficients:

$$T(n) = cn + d$$

How should we define $T$ for this algorithm?

total = 0;
for (int i = 0; i < numbers.length; ++i) {
  for (int j = 0; j < numbers.length; ++j) {
    total += numbers[i];
  }
}

total = 0;
for (int i = 0; i < numbers.length; ++i) {
  for (int j = 0; j < numbers.length; ++j) {
    total += numbers[i];
  }
}

The outer loop executes n times. On each iteration, the inner loop executes n times. That gives us this loose definition of $T$:

$$T(n) = cn^2 + d$$

All algorithms whose number of steps can be expressed this way are said to have quadratic time. Would you rather have a linear time algorithm or a quadratic time algorithm?

How should we define $T$ for this algorithm?

public static <T extends Comparable<T>> T bigger(T a, T b) {
  if (a.compareTo(b) >= 0) {
    return a;
  } else {
    return b;
  }
}

public static <T extends Comparable<T>> T bigger(T a, T b) {
  if (a.compareTo(b) >= 0) {
    return a;
  } else {
    return b;
  }
}

This method has no loop nor a collection, so there's no $n$ elements. Its cost will therefore not vary but always be some constant:

$$T(n) = c$$

Now let's examine the cost function of two classic search algorithms.

Searching

Even before computers, we humans collected massive amounts of data. We made dictionaries and phonebooks, for example. Finding an entry in our records can be difficult. One option is the linear search, which looks like this:

to search(target)
  for each record
    if record matches target
      return record
  return null

to search(target)
  for each record
    if record matches target
      return record
  return null

In the worst case $T(n) = cn$.

We can find a target faster if we sort the records first. If we pick a record at random from the list and compare it to the target, we know how to proceed. One of the three things will have happened:

We found the target.
The target appears before the record.
The target appears after the record.

We throw away the portion of the list in which we know the target doesn't appear. Throwing away big chunks of the list is what makes binary search faster than linear search.

How shall we pick a random element? To minimize the number of steps in the worst case, we pick the element in the exact middle of the list. That means we will eliminate half of the list on every check. We don't physically halve the list, as that would be expensive. Instead, we'll maintain a pair of indices that mark a window within the list in which the target might appear.

This contains method uses a binary search to see if a certain number is within an array:

public static boolean contains(ints[] xs, int target) {
  int lo = 0;
  int hi = xs.length - 1; 
  while (lo < hi) {
    int mid = (lo + hi) / 2;
    if (target < mid) {
      hi = mid - 1;
    } else if (target > mid) {
      lo = mid + 1; 
    } else {
      return true;
    }
  }
  return false;
}

public static boolean contains(ints[] xs, int target) {
  int lo = 0;
  int hi = xs.length - 1; 
  while (lo < hi) {
    int mid = (lo + hi) / 2;
    if (target < mid) {
      hi = mid - 1;
    } else if (target > mid) {
      lo = mid + 1; 
    } else {
      return true;
    }
  }
  return false;
}

What happens if you pass an unsorted array to this method? Try it.

The cost function of the binary search is proportional to the number of iterations of the loop. Each iteration halves the list. How many times can a number be halved? The flip of this problem is how many times must 2 be doubled until you reach $n$:

$$2^? = n$$

The answer is the base-2 logarithm. The binary search then has this cost function:

$$T(n) = c\log_2 n$$

Compare these two functions in Desmos. Binary search is a clear winner. It scales to very large collections. However, remember that the list must be sorted before you can achieve these gains.

Big-O and Big-Theta

As we encounter algorithms and data structures, we will judge them by their cost functions. To ease their comparison, we will place them on the Ladder of Complexity. This ladder categorizes the cost functions into a small number of families. These families are formed by considering only the most significant term and dropping constants. Here are a few of those families:

Constant time ($1$): the cost is fixed and doesn't depend on the input size.
Logarithmic time ($\log_2 n$): one extra datum is hardly a blip on the cost.
Linear time ($n$): one extra datum means one extra unit of cost.
Quadratic time ($n^2$): the cost starts growing more and more withe each additional datum.
Cubic time ($n^3$): like quadratic, but worse.
Exponential time ($2^n$): one extra datum doubles the cost.
Factorial time ($n!$): one extra datum more than doubles the cost.

We'll draw this ladder with high costs at the top and low costs at the bottom. Each family is called a complexity class.

To say that an algorithm or data structure is within some family on this ladder, we use a notation called Big-Theta. These cost functions are Big-Theta of these families:

$$\begin{align} 3n^2 &\in \Theta(n^2) \\ 2n + 10 &\in \Theta(n) \\ 5 &\in \Theta(1) \\ \end{align}$$

To say that an algorithm or data structure is within or below some family on this ladder, we use a notation called Big-O. These cost functions are Big-O of these families:

$$\begin{align} 3n^2 &\in O(n^2) \\ 3n^2 &\in O(n^3) \\ 3n^2 &\in O(2^n) \\ 2n + 10 &\in O(n) \\ 2n + 10 &\in O(n^2) \\ 2n + 10 &\in O(n^3) \\ 5 &\in O(1) \\ 5 &\in O(\log_2 n) \\ 5 &\in O(n!) \\ \end{align}$$

Big-O is a weaker description about the cost function's placement on the ladder. It only gives an upper bound. Big-Theta puts the cost function in a very specific family. We use both notations because sometimes we want to be specific and sometimes we don't. Big-O appears on a kid's menu for 10-and-unders or a ride that only permits riders under 75 pounds. Big-Theta appears on socks for shoe sizes 6–12 and on jars of salsa. A spicy salsa doesn't mean it's either spicy, medium, or mild.

TODO

You've got a few things to do before we meet again:

The first quiz will be posted Thursday afternoon and will be due before class on Monday. You can consult notes, the book, and the code we've written in class, but you must do your own work.

Get all your repository hiccups worked out before Friday's lab.

See you next time!

Sincerely,

P.S. It's time for a haiku!

The prince sorted them The slipper fit the fifth foot “What magic!” they thought