Writing functional code in C++ III – Performance of different allocation schemes

-

Now we know how to rep­re­sent records and we know how to op­er­ate on them us­ing a nice F# like syn­tax. But how do we store our record in a data struc­ture in the first place?

Code for this post is here. Thanks to Andy Sawyer and Steve Bower for re­view­ing this.

As it is of­ten the case, C++ gives you many op­tions that are not avail­able in stan­dard func­tional lan­guages. A mixed bless­ing of some sort.

  1. You can store them by value or by pointer
  2. If you store them by pointer, you can use normal pointers or smart pointers
  3. If you store them by smart pointer, you can use different ones

Well, stor­ing them by value is the sim­plest op­tion, but it has two short­com­ings:

  1. You pay the price of a full record copy whenever the object need to be copied (i.e. when you add/remove records to/from a vector and whenever the vector needs to resize itself)
  2. If you store a record in two containers, you get two different copies of it. If you modify one, the other won’t be modified

The sec­ond is­sue is not a prob­lem in our sce­nario as records, by de­f­i­n­i­tion, are im­mutable and struc­turally equiv­a­lent (aka equal­ity for them does­n’t mean pointer equal­ity). The first is­sue is more tricky. Our records get a copy con­struc­tor by de­fault which does a memcpy’ of the ob­ject. That should be fast. But fast enough?

To an­swer this ques­tion (and more), I em­barked in a lit­tle per­for­mance test that en­com­passes most ways to store a pod type in an vec­tor like data struc­ture. For all my test I used this struct:

struct Record {
    int Id;
    char k1[2];
    char k2[2];
    char k3[2];
    char k4[2];
    char k5[2];
    char k6[2];
    char mem[bigBlock];
    void Lock() {}
    void Unlock() {}
};

By re­com­pil­ing with dif­fer­ent val­ues for bigBlock’ I can test what hap­pens when the record size grows. In all tests a record is ini­tial­ized with this func­tion:

void record_init(Record& r, int i) {
    r.Id = i;
    strcpy(r.k1, "0");
    strcpy(r.k2, "0");
    strcpy(r.k3, "0");
    strcpy(r.k4, "0");
    strcpy(r.k5, "0");
    strcpy(r.k6, "0");
    memset(r.mem, '-', bigBlock);
}

The tests are spe­cific to the sce­nario I care about: per­form­ing func­tional-like op­er­a­tions on a Range:

struct to_int {
    typedef int result_type;
    int operator() (const Record& r) const {
        return r.Id;
    };
};
function<bool (const Record&)> filter_f = [](const Record& r) {return r.Id < filterNo;};
template <class Range>
int accumulate_filter(const Range& r) {
    return boost::accumulate(
        r | filtered(filter_f) | transformed(to_int()),
        0,
        plus<int>());
}

The us­age of a func­tion and  a func­tor is a bit odd. I don’t re­call why I did it that way, but it does­n’t mat­ter as it is the same for each test. What the test does is just fil­ter­ing a bunch of record, trans­form­ing them (map) to ints and sum these ints.

How many rep­e­ti­tions of each test are used, how big is the record, how many records the Range con­tains is spec­i­fied in these con­stants:

const int repetitions = 1000;
const int bigBlock = 1000;
const int howMany = 1000;

Time is kept us­ing this clock that wraps boost::chrono:

typedef boost::chrono::process_cpu_clock the_clock;
struct timer {
    timer(): clock_(the_clock::now()) {}
    the_clock::times elapsed() {
        return (the_clock::now() - clock_).count();
    }
    the_clock::time_point clock_;
};

I have tested the fol­low­ing con­fig­u­ra­tions. Adding records by value:

int normal() {
    vector<Record> v;
    for (int i = 0; i < howMany; ++i) {
        Record r;
        record_init(r, i);
        v.push_back(r);
    }
    return accumulate_filter(v);
}

I then tested adding records us­ing a pod_vec­tor. This is a data struc­ture de­scribed here and in Imperfect C++”. It is a vec­tor that uses as an au­to_buffer as the un­der­ly­ing stor­age. An au­to_buffer is a class that uses stack mem­ory if it needs more than a cer­tain num­ber of bytes spec­i­fied at com­pile time, oth­er­wise it uses heap mem­ory. It works well if you know at com­pile time that most al­lo­ca­tions for some­thing take at most N bytes, but some might need more. This sit­u­a­tion is sur­pris­ingly fre­quent. Unfortunately, your ob­jects need to be POD to use the pod_vec­tor.

I tested it in the case where the stack space is big enough (podVector<howMany*2>) and when it is not (podVector<howMany/10>).

int podVector() {
    pod_vector<Record, size> v;
    for (int i = 0; i < howMany; ++i) {
        Record r;
        record_init(r, i);
        v.push_back(r);
    }
    return accumulate_filter(v);
}

I then tested just al­lo­cat­ing the mem­ory, with­out free­ing it and us­ing boost::pool in it’s local’ form, which means that mem­ory is freed when it goes out of scope (stack-like). The main draw­back of the lat­ter is that you can­not re­turn the con­tainer/​range.

int boostallocator(WhichOne which) {
    boost::pool<> p(sizeof(Record));
    vector<Record*> v;
    Record* r;
    for (int i = 0; i < howMany; ++i) {
        if(which == Boost)
            r = (Record*)p.malloc(); // memory freed at function exit
        else
            r = new Record; // oops, memory leak
        record_init(*r, i);
        v.push_back(r);
    }
    return accumulate_filter(v | indirected);
}

Indirected is needed be­cause we are not talk­ing about point­ers. I also tested var­i­ous vari­a­tions of shared point­ers. First a nor­mal shared_ptr, then a shared_ptr ini­tial­ized with the boost::sin­gle­ton_pool and fi­nally a pointer ini­tial­ized with make_shared.

int pointers(WhichOne which) {
    vector<shared_ptr<Record>> v;
    for (int i = 0; i < howMany; ++i) {
        shared_ptr<Record> r;
        if(which == Normal)
            r = shared_ptr<Record>(new Record);
        else if(which == Boost)
            r = shared_ptr<Record>((Record*)record_pool::malloc(), [](void* p) {record_pool::free(p);});
        else if(which == Make)
            r = make_shared<Record>();
        else throw "This kind of pointer doesn't exist";
        record_init(*r, i);
        v.push_back(r);
    }
    return accumulate_filter(v | indirected);
}

Finally, I used a Loki smart pointer. This is a very el­e­gantly de­signed smart point­ers from  Modern C++ de­sign. You pass as tem­plate pa­ra­me­ters the poli­cies you want your smart pointer to have (aka how it should be­have). I tested it like so:

typedef Loki::SmartPtr<Record,
                 Loki::RefCounted,
                 Loki::DisallowConversion,
                 Loki::AssertCheck,
                 Loki::DefaultSPStorage,
                 LOKI_DEFAULT_CONSTNESS> RecordPtr;

And us­ing the fol­low­ing, slightly more con­vo­luted code:

int lokipointers(WhichOne) {
    vector<RecordPtr> v;
    for (int i = 0; i < howMany; ++i) {
        RecordPtr r = RecordPtr(new Record());
        record_init(*r, i);
        v.push_back(r);
    }
    auto ret = accumulate_filter(v | transformed(RecordPtr::GetPointer) | indirected);
    return ret;
}

The re­sult of my tests are in this table and at this link. I used VS10 and gcc 4.6.2 on a Windows 7 box. The first num­ber in the (S, N) pair at the top of each col­umn rep­re­sents  the size of the record, the sec­ond one rep­re­sents the num­ber of ob­jects in the vec­tor. To ob­tain rea­son­able num­bers, the tests have been run with a dif­fer­ent num­ber of it­er­a­tion for each pair, which means that you can com­pare the re­sults ver­ti­cally, but not hor­i­zon­tally.

Results

There are a lot of in­ter­est­ing things in this table. Here are my take­aways. They are ob­vi­ously spe­cific to this sin­gle sce­nario. You might have dif­fer­ent sce­nar­ios or dif­fer­ent take­aways:

So here you have it. A small de­tour on the per­for­mance of dif­fer­ent schemes for al­lo­cat­ing pod types. As it is of­ten the case in C++, it de­pends …

BTW: Andy Sawyer told me about his rough al­go­rithm to de­cide which STL con­tainer to use. I re­pro­duce it here:

BEGIN

A de­ci­sion tree for se­lect­ing a se­quence con­tainer:

Or, to put it an­other way, use std::deque  un­less there’s a good rea­son not to”.  That’s prob­a­bly an overly-strong state­ment, but it’s a rea­son­able start­ing point.

Naturally, every case will be dif­fer­ent - and no set of rules of thumb’ is go­ing to work every time (this is one of the things which dis­tin­guish rules of thumb from laws of physics). Commonly, we don’t know ahead of time which of those cri­te­ria is go­ing to have the most sig­nif­i­cant im­pact on what we’re do­ing; when I find my­self in that sit­u­a­tion, I’ll use std::deque; once I have a firmer idea of the cri­te­ria, I’ll go back and re­visit that choice (which — quite of­ten — is as sim­ple as chang­ing a type­def). And of course - as with all optimisations’ - mea­sure, mea­sure and mea­sure again!

END

Next time, we’ll go back to more tra­di­tional func­tional pro­gram­ming top­ics.

Tags