Finding array pivot points and measuring complexity

Abstract

First, the news: I’m now in on the jobs market and C++ beckons. Secondly, some fun problems to consider.

Table of Contents

This week, I’ve been going jobs crazy, working hard at going to loads of interviews. I am happy that one of the first places I heard back from offered me a job on the spot, and they are easily in the top three or four companies I applied to in preference (with none strictly higher). So, it looks probable I will be working on an enterprise server product here in Cambridge next year.

The interviews have been dragging me back down the ground and forcing me to brush up on my rusty programming skills. C++ in particular has been a triumph in theory, and troublesome in practice. I present here two interesting problems encountered in the last week.

1. A fun game

The most accessible problem is here first. There is a hat, with the numbers 1 to 100 on slips of paper. Drawing a slip costs £1, but gives you the option of claiming the amount on it in winnings, or putting it back. Once you claim your winnings, the game is over. What is your strategy, and how much do you win? I make the average winnings 1135/13 exactly (about £87.31), though I am not confident that all my floors and ones are in the right place. (Knowing you can ignore the special case of pay-off being an integer might make your sketch easier.) On average, you accept the 50/7-th number drawn.

2. Finding array pivots

Google had a very interesting process: they sent me an online test at a certain website rhyming with agility (to prevent Google picking this up; the terminology here is mine too). The problems were pretty easy, but I fiffed and faffed with the STL so much that I nearly ran out of thinking time. The most interesting problem was as follows, because it spawned some thoughts and questions.

Posed: given an array A[0],,A[n1]A[0],\ldots,A[n-1], determine the array pivots, the points around which the array is already pivoted, that is, the indices A[j]A[k]A[j]\le A[k] 0jk\forall 0 \le j \le k and A[j]A[k]A[j] \ge A[k] n>jk.\forall n \gt j \ge k. This might be an extra, Google-provided, question, because it is not on the list of questions ***ility says they provide to clients on their site. At any rate, working out and implementing a good algorithm in thirty minutes takes a bit of effort, especially as five were spent working out how to initialise a vector<> from iterators, and so on. If you know the various STL containers and their methods reasonably well, you would do fine (I could certainly score much higher now that I have written 100 lines of C++ in the last twelve months instead of 0).

I present my code here, for various reasons.

#include <vector>
#include <set>
#include <list>
#include <sstream>
#include <iostream>
#include "cycle-min.h"
#include <ctime>
#include <cstdlib>
using namespace std;

/*
 * Our naive method. Not great: does a load of repetitive
 * searching and ends up with O(n^2) complexity.
 */
int arrayPivotNaive(vector<int> &A) {
  int n = A.size();
  if (n < 1) return -1;
  
  int lmax = A[0];
  for (int i = 0; i < n; ++i) {
    if (A[i] < lmax) continue;
    lmax = A[i];
    bool rmin = true;
    for (int j = i+1; j < n; ++j) {
      if (A[i] > A[j]) {
        rmin = false;
        break;
      }
    }
    if (rmin)
      return i;
  }
  return -1;
}

/*
 * What I coded up in the test (with an error). This still
 * has O(n logn) performance, which I thought was about right.
 */
int arrayPivotBetter(vector<int> &A) {
  int n = A.size();
  if (n < 1) return -1;

  int lmax = A[0];
  multiset<int> rmin(++A.begin(), A.end()); // n logn right there.
  for (int i =0; i < n - 1; ++i) {
    if (A[i] < lmax) continue;
    lmax = A[i];
    if (A[i] <= *rmin.begin()) return i;
    rmin.erase(rmin.find(A[i + 1]));// log n; line is run n times
  }
  
  if (A[n - 1] >= lmax) return n - 1;
  
  return -1;
}

/*
 * Clearly the optimal answer: O(n) time
 */
int arrayPivotBest(vector<int> &A) {
  int n = A.size();
  if (n < 1) return -1;
  
  list<int> possPivots;
  
  int rmin = A[n - 1];
  for (int i = n - 1; i >= 0; --i) {
    if (A[i] > rmin) continue;
    rmin = A[i];
    possPivots.push_front(i);
  }
  
  int lmax = A[0];
  for (int i = 0; i < n; ++i) {
    bool poss = false;
    if (possPivots.front() == i) {
      poss = true;
      possPivots.pop_front();
    }
    if (A[i] < lmax) continue;
    if (poss) return i;
  }
  
  return -1;
}

void addTest(vector< vector<int> > &tests, string test) {
  stringstream ss(test);
  vector<int> t;
  ss.peek();
  while (ss.good()) {
    int next;
    ss >> next;
    t.push_back(next);
  }
  tests.push_back(t);
}

string testComplexity(int (*fun)(vector<int>&)) {
  vector< pair<int, double> > samples;
  long length = 1;
  double t;
  clock_t ts;
  srand(time(NULL));
  do {
    length *= 2;
    vector<int> test;
    //It is hard to pick random input which is not 'kind'; by
    //inspection, it is clear that the vector [1,2,...,n-1,0]
    //selects worst-case time for the naive algorithm, and
    //unsorted hits the multimap one badly.
    for (int i = 0; i < length; ++i)
      test.push_back(i==length-1?0:i+1);//rand()%length + 1
    ts = clock();
    ticks tick0 = getticks();
    (*fun)(test);
    t = elapsed(getticks(), tick0);
    ts = clock() - ts;
    samples.push_back(pair<int, double>(length, t));
    cout << length << "," << t << endl;
  } while (length < 67108865 && ts < 30 * CLOCKS_PER_SEC);
}

int main() {
  vector< vector<int> > tests;
  
  // Some basic conformance tests:
  addTest(tests, "");
  addTest(tests, "1");
  addTest(tests, "1 2");
  addTest(tests, "2 1");
  addTest(tests, "3 5 4");
  addTest(tests, "3 4 2");
  addTest(tests, "3 4 3");
  addTest(tests, "3 4 4");
  addTest(tests, "3 4 5");
  addTest(tests, "3 3 2");
  addTest(tests, "3 3 3");
  addTest(tests, "3 3 4");
  addTest(tests, "3 2 1");
  addTest(tests, "3 2 2");
  addTest(tests, "3 2 3");
  addTest(tests, "3 2 4");
  addTest(tests, "3 1 2");
  cout << "***** Conformance tests *****\n*****\n";
  for (vector< vector<int> >::iterator it = tests.begin(),
                      end = tests.end(); it != end; ++it) {
    cout << (*it).size() << ": ";
    for (vector<int>::iterator val = (*it).begin(),
               valend = (*it).end(); val != valend; ++val) {
      cout << *val << " ";
    }
    cout << " | "
         << arrayPivotNaive(*it) << " "
         << arrayPivotBetter(*it) << " "
         << arrayPivotBest(*it) << " "
         << endl;
  }
  
  // Now we want to check the complexity:
  cout << "\n\n***** Complexity tests *****\n*****\n";
  testComplexity(&arrayPivotNaive);
  testComplexity(&arrayPivotBetter);
  testComplexity(&arrayPivotBest);
}

To test, I wrote up the obvious answer in a few minutes; no particular C++ knowledge required—it’s just for loops. I don’t know why I fixated on the next algorithm, but it’s the one I went for during the test. The third one came to me just after and is optimal.

One of the nice things about ***ility is that it marks your code according to various criteria, one of which is how the algorithm scales. As practice, I started to write a test harness to see if I could do it too.

(Incidentally, as Google fodder: I had to hunt for a while to find an accurate way of timing how long a program takes to execute in C++, using high resolution timers better than ordinary ctime functions. The answer is extremely easy: Boost has it, or better still there is a file cycle.h to download which does the magic of detecting your compiler and grabbing the data from your processor’s most accurate register.)

Now, this is the open point: how do you actually work out empirically the performance of an algorithm? It takes more than just random data to work out the worst case running time (because a random array is very unlikely to have any pivots, and the naïve algorithm works this out in linear time). Even assuming you could find a way to heuristically stress test some code as a black box and work out what sort of input it favoured, there is another question I am considering: given some data, can we actually tell the difference between O(n)O(n) and O(nlogn)O(n \log n)?

The different polynomial times can be distinguished very easily, but as discussed here, logn\log n for our purposes does not change very much. I have done some heavy fiddling around with the curves, to see if perhaps we can tell between the O(n)O(n) and O(nlogn)O(n \log n) fits by correlation coefficient, or more sneakily by guessing that the power of nn is probably round and using a heuristic to see whether we can identify the shape of the curve that way. My guess though is that the real world difference is too tiny to give a reliable test. I would be interested though if anyone thinks there is a practical way of black-box determining complexity. For reference, these are my data and analysis.