BPNN++ is an implementation of a feed-forward back-propagation neural network, with a maximum of 2 hidden layers, in C++. It comes as a simple C++ header file and it has no external dependencies (except for the STL). It heavily uses C++ templates to be both extremely fast and memory efficient.
The interface exposes 2 methods one for training the network (train()) and a second one for query the network (feed_forward()). Here it is an example of how to train the network to learn the XOR function:

      #include "bpnn++.hpp"
      ...
      // Try to learn the XOR function
      // network structure: 2 inputs, 1 hidden layer (containing 4 neurons) and one output
      bpnn_1H<2, 4, 1> net(0.2 /* learning rate */, 0.8 /* momentum */); 
      // training data: (ins, outs)
      std::vector<row<double, 2> > ins(4) ;
      ins[0] = (double[2]) { 0, 0 };
      ins[1] = (double[2]) { 0, 1 };
      ins[2] = (double[2]) { 1, 0 };
      ins[3] = (double[2]) { 1, 1 };
      
      std::vector<row<double, 1> > outs(4);
      outs[0] = (double[1]) { 0 };
      outs[1] = (double[1]) { 1 };
      outs[2] = (double[1]) { 1 };
      outs[3] = (double[1]) { 0 };
      
      int epochs = 0;
      double error = 0;
      do{
          // we run an epoch
          error = net.train(ins, outs);
          std::cout << error << std::endl;
      }while(++epochs && error > 0.01);
      std::cout << "Number of epochs: " << epochs << std::endl;
      net.print(std::cout);
      
      // Now that the neural network is trained, we can test its output
      row<double, 1> o = net.feed_forward(ins[0]);
      o.print(std::cout);
      
      net.print(std::cout); // prints out the weights of the network
bpnn++ is available for download through sourceforge project page here

Performance Evaluation

The main drawback about the use of neural networks in machine learning is their slow, time consuming training phase. For those reasons performance has been (and will be) the main aspect driving the development of bpnn++. In the following table you can find some measurements related to the execution of a single epoch considering several network topologies and training set sizes (your contributions is welcome):

# of InputsHidden Layers# of Outputs# of InstancesCompilerCPU ClockExec. Time (1 epoch)
191563136GCC440 -O3 -m642.86GHz (Intel)2 msecs.
19151363136GCC440 -O3 -m642.86GHz (Intel)3 msecs.
..................

Known Issues and bugs