My Bachelor Thesis Sat, Mar 29. 2014
This PDF is my Bachelor Thesis (german), the grade for the thesis is 1,3 (see this entry in wikipedia for a comparison of grades). I am pretty proud about it. Apart from diving into evolutionary algorithms, implementing these from scratch on CUDA C/C++ and applying those within the context of a specific problem domain (portfolio optimization) was a lot of fun.
It was a great opportunity to utilize the latest tech, C/C++ and CUDA ... I was able to achieve a speedup of 22 on a GPU (GTX560Ti) in comparison to one CPU core. I used Latex and the great TexMaker IDE for the text, DIA and Inkscape for drawings, etc.. Some calculations and plots were created with the R package and associated plugins. I used libeigen3 for vector-math on a single cpu core (on GPU I wrote those operations myself).
Now I am just waiting for my actual degree to arrive
It was a great opportunity to utilize the latest tech, C/C++ and CUDA ... I was able to achieve a speedup of 22 on a GPU (GTX560Ti) in comparison to one CPU core. I used Latex and the great TexMaker IDE for the text, DIA and Inkscape for drawings, etc.. Some calculations and plots were created with the R package and associated plugins. I used libeigen3 for vector-math on a single cpu core (on GPU I wrote those operations myself).
Now I am just waiting for my actual degree to arrive

CUDA Single-GPU Debugging (Breakpoints) Fri, Oct 4. 2013
I use a simple GTX 560Ti and the CUDA 5.5 SDK to write my CUDA Code
I use Ubuntu Linux 12.04 LTS as OS.
You can develop your CUDA Apps and use cuPrintf for Debugging (or even printf with the latests CUDA SDKs). That's actually good enough for 90% of all use cases.
Anyway if you want to set a breakpoint and check out the vars using NVIDIA NSight on Linux the Debugger can't break into your CUDA Kernel Code - since your X Display is utilizing the graphics Card (you can set breakpoints outside of the kernel).
My Solution
I run my XServer on another machine. I connect to the target machine via XDMCP. I ensure that the NVidia Graphics Card isn't being used on the target machine by starting Xvfb (which uses the CPU) instead of Xorg. As an XServer for Windows I recommend using VcXsrv. Since I use Windowmaker, everything is still pretty lightweight ...
The following assumes you are using lightdm as your Display Manager ..
• Enable XDMCP for lightdm via lightdm.conf
• Install XVfb .
• Modify your lightdm.conf so it doesn't try to open a local Display,but instead just uses Xvfb. Use a custom xserver-command
• Now start your XServer on Windows (or Linux) and connect to the machine via XDMCP.
This is my lightdm.conf
Actually NSight for Visual Studio 2010 is pretty cool, unfortunately single GPU Debugging doesn't work as expected.. Display flickers, etc... On the first breakpoint, the Debugger breaks
.. but you can't step... and then Windows 7 resets the Display Driver or so... But I guess most people use a Dual GPU config anyway. I am pretty happy I can work at that level using a consumer-level GPU.
Have fun
I use Ubuntu Linux 12.04 LTS as OS.
You can develop your CUDA Apps and use cuPrintf for Debugging (or even printf with the latests CUDA SDKs). That's actually good enough for 90% of all use cases.
Anyway if you want to set a breakpoint and check out the vars using NVIDIA NSight on Linux the Debugger can't break into your CUDA Kernel Code - since your X Display is utilizing the graphics Card (you can set breakpoints outside of the kernel).
My Solution
I run my XServer on another machine. I connect to the target machine via XDMCP. I ensure that the NVidia Graphics Card isn't being used on the target machine by starting Xvfb (which uses the CPU) instead of Xorg. As an XServer for Windows I recommend using VcXsrv. Since I use Windowmaker, everything is still pretty lightweight ...
The following assumes you are using lightdm as your Display Manager ..
• Enable XDMCP for lightdm via lightdm.conf
• Install XVfb .
• Modify your lightdm.conf so it doesn't try to open a local Display,but instead just uses Xvfb. Use a custom xserver-command
• Now start your XServer on Windows (or Linux) and connect to the machine via XDMCP.
This is my lightdm.conf
[XDMCPServer]This is the xserverrc2 file I use
enabled=true
[SeatDefaults]
greeter-session=unity-greeter
user-session=ubuntu
xserver-command=/etc/X11/xinit/xserverrc2
#!/bin/sh
exec Xvfb :0 -screen 0 1280x1024x24
Actually NSight for Visual Studio 2010 is pretty cool, unfortunately single GPU Debugging doesn't work as expected.. Display flickers, etc... On the first breakpoint, the Debugger breaks

Have fun
My BSc Thesis Wed, Sep 11. 2013
Hi everybody,
I have almost completed by BSc Informatik Studies (BSc Computer Science), and I am now writing my BSc Thesis...
It is called
which can be roughly translated to
More specifically, the financial application Domain is Modern Portfolio Theory, and it's extensions (integer constraints, Transaction costs). See Luenberger - Investment Science for an overview of the topic, and Maringer - Portfolio Management with Heuristic Optimization for the extensions.
The target platforms will be "traditional" shared memory processors (x86) using C/C++ and GPUs using CUDA.
I have three months to complete my thesis and I am starting right now..
I have almost completed by BSc Informatik Studies (BSc Computer Science), and I am now writing my BSc Thesis...
It is called
Parallelisierung von Genetischen Algorithmen für Anwendungen der Finanzwirtschaft
which can be roughly translated to
Parallelization of Genetic Algorithms for Applied Finance
More specifically, the financial application Domain is Modern Portfolio Theory, and it's extensions (integer constraints, Transaction costs). See Luenberger - Investment Science for an overview of the topic, and Maringer - Portfolio Management with Heuristic Optimization for the extensions.
The target platforms will be "traditional" shared memory processors (x86) using C/C++ and GPUs using CUDA.
I have three months to complete my thesis and I am starting right now..
Maven2 version ranges gotcha Thu, May 17. 2012
I was wondering today why my homebrew java apps didn't compile, the version range resolution somehow didn't work as expected
This discussion @ stackoverflow demonstrates that problem
Well are version ranges a bad thing? IMHO Version ranges are fine in certain cases, if you require a certain amount of api compatibility for your code to work.
Well, shouldn't 3.0.0.-m1 (m2,m3) at least mach 3.0. ? As it turns out, as soon as a version string "violates" the maven2 naming schemes, your whole version string is being interpreted as a qualifier.
Failed to resolve artifact.
Couldn't find a version in [2.0.0.m1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0,
1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.1.6, 2.0.0-m2, 2.0.0-m3, 2.0.0-m4,
2.0.0-release, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.1.0-m1, 2.1.0-m2,
2.1.0-m3, 2.1.0-release, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.2.0-m1, 2.2.0-m2,
2.2.0-m3, 2.2.0-release, 2.2.1, 2.2.2, 2.2.3, 3.0.0-m1, 3.0.0-m2, 3.0.0-m3]
to match range [3.0,)
org.datanucleus:datanucleus-core:jar:null
This discussion @ stackoverflow demonstrates that problem
Well are version ranges a bad thing? IMHO Version ranges are fine in certain cases, if you require a certain amount of api compatibility for your code to work.
No, I don't want to jump around in a "dirndl" ...yet! Wed, May 9. 2012
Yesterday I blogged about quicksort that implied folk dance, today I received spam about "Trachtenmode" and Dirndls. This must be the internet era!
That's great, that's exactly what a bald-headed, 100kg computer-infested programmer / nerd would love to do!! Let's go
(click for the complete image from the "html newsletter")
I can see a Perl script here...
I can even smell it
At least a Perl script is reading my blog.
That's great, that's exactly what a bald-headed, 100kg computer-infested programmer / nerd would love to do!! Let's go


(click for the complete image from the "html newsletter")
I can see a Perl script here...
I can even smell it

At least a Perl script is reading my blog.
Quick-sort with Hungarian folk dance Tue, May 8. 2012
That's UBER-fantastic.
After watching this you can probably easily figure out a way to parallelize this algo
And now you can measure quick sort in calories burnt.
Remember O(n log n) ? You better do...
After watching this you can probably easily figure out a way to parallelize this algo
And now you can measure quick sort in calories burnt.
Remember O(n log n) ? You better do...
I am now a Certified Qt Developer ... Fri, May 4. 2012
Yeah, today I passed the Qt Essentials Exam, and therefore I am a Nokia Certified Qt Developer now.
Please, stay seated !
Please, stay seated !

Alan Watt's 3D Computer Graphics: Errors Tue, Nov 22. 2011
I really like Alan Watt's book as a general introduction to the topic, anyway there's a huge error on page 7 (3rd edition) , where the basics of Matrix operations are described. Rotation around an arbitrary point in 2D space is done by translating to the origin (T1) , rotating (R) , then translating back (T2) . Since Watts' book uses column vectors (like GL, as opposed to row vectors / DirectX) thats basically T2 R T1 x for a given vector x. Well, Alan Watts' book shows T1 R T2 x 
Still a good book.

Still a good book.
"Beating" the linux standard quicksort (glibc) Thu, Nov 17. 2011
I am programming parallel sorting algorithms with MPI / C++ on computer clusters right now, and therefore had to implement qsort in a serial fashion (before creating a parallelized version of it) ..
For arrays of integers, the following "pedestrian" C-code implementation beats the builtin quicksort implementation (defined by ISO) on Linux by being 58% percent faster...
These are my results for a randomized array of integers...
Not bad.. The reason? Inlining. . Rather missing inlining. For glibc qsort, the comparison method (which I provide) can't be inlined, since the code for qsort has already been generated, it's in glibc. Again, in my implementation the comparison is within the existing divide method.
The complete file, for those interested: main_sort.cpp. For large arrays, be sure to set the maximum stacksize accordingly.
Again, on Windows the qsort provided by VS 2010 is twice as fast as my implementation!
PS: I know, this ain't real "beating", because it doesn't improve the algorithmic complexity
For arrays of integers, the following "pedestrian" C-code implementation beats the builtin quicksort implementation (defined by ISO) on Linux by being 58% percent faster...
inline void swap(int* p1, int* p2) {
int tmp = *p1;
*p1 = *p2;
*p2 = tmp;
}
inline int divide(int* start, int* end, int pivotIndex) {
int len = end - start + 1;
int pivot = start[pivotIndex];
int storeIndex = 0;
swap(&start[pivotIndex], &start[len-1]);
for (int i=0; i < len-1; i++) {
if (start[i] < pivot) {
swap(&start[i], &start[storeIndex++]);
}
}
swap(&start[storeIndex], &start[len-1]);
return storeIndex;
}
inline void mysort(int* start, int* end) {
int len = end - start + 1;
if (start >= end || len == 1) {
return;
}
int pivotIndex = rand() % len;
int newPivotIndex = divide(start, end, pivotIndex);
mysort(&start[0], &start[newPivotIndex-1]);
mysort(&start[newPivotIndex+1], end);
}
These are my results for a randomized array of integers...
Sorting 57 MB
My sort: 3694 msec
Quicksort: 5861 msec
Intel(R) Pentium(R) D CPU 3.00GHz (Presler)
2GB RAM
Linux 2.6.32-5-amd64 x86_64 GNU/Linux
g++ (Debian 4.4.5-8) 4.4.5
glibc 2.11.2
Not bad.. The reason? Inlining. . Rather missing inlining. For glibc qsort, the comparison method (which I provide) can't be inlined, since the code for qsort has already been generated, it's in glibc. Again, in my implementation the comparison is within the existing divide method.
The complete file, for those interested: main_sort.cpp. For large arrays, be sure to set the maximum stacksize accordingly.
Again, on Windows the qsort provided by VS 2010 is twice as fast as my implementation!
PS: I know, this ain't real "beating", because it doesn't improve the algorithmic complexity
Latest NVIDIA 285.58 WHQL + Quadcore + Borderlands == Deadlock Sun, Nov 13. 2011
Hi Folks,
Problem
I downladed the latest NVIDIA drivers for my 8800GT. Everything worked fine until I tried to start Borderlands.
I was pretty busy lately - as always - and wanted to relax for an hour or so.
It hung at the splash screen. Damn. It really hung. CPU 0%, NO IO being done. Nothing.
Argh...
I attached my VS.NET 2010 debugger to that process and took a glimpse.

Hmm.. nothing interesting. WaitForSingleObject means basically waiting for ownership of a Mutex or Win32 Kernel Event. A deadlock? Seems like some optimizations in the latest NVIDIA driver exposed a vulnerability of the game code towards deadlocks. It's probably not NVIDIAs fault ..
Anyway, the other threads didn't provide any useful info. Actually, I had no time. I just wanted to take one or two hours off and relax now.. So i decided to work-around it.
This is my quick hack
I decided to forcibly serialize / "single-thread" the execution..
I simply started Borderlands from Steam, as soon as the process started to appear in the process manager, I changed the affinity for that process to one physical CPU
This won't make the execution single-threaded, but would reduce the probability of the dead lock because there is no hardware parallelism anymore wrt CPU cores.

You have to be quick
If the splash screen appears, it's too late ... .

... Then the game started normally, it passed the splash screen ... I immediately activated all cores again.
If you have a SSD this may be difficult :-/ Anyway I could continue this way.
And off I went blasting off some psychos
REMEMBER: If it took more than one shot, you weren't using a Jakobs
Problem
I downladed the latest NVIDIA drivers for my 8800GT. Everything worked fine until I tried to start Borderlands.
I was pretty busy lately - as always - and wanted to relax for an hour or so.
It hung at the splash screen. Damn. It really hung. CPU 0%, NO IO being done. Nothing.
Argh...
I attached my VS.NET 2010 debugger to that process and took a glimpse.
Hmm.. nothing interesting. WaitForSingleObject means basically waiting for ownership of a Mutex or Win32 Kernel Event. A deadlock? Seems like some optimizations in the latest NVIDIA driver exposed a vulnerability of the game code towards deadlocks. It's probably not NVIDIAs fault ..
Anyway, the other threads didn't provide any useful info. Actually, I had no time. I just wanted to take one or two hours off and relax now.. So i decided to work-around it.
This is my quick hack
I decided to forcibly serialize / "single-thread" the execution..

I simply started Borderlands from Steam, as soon as the process started to appear in the process manager, I changed the affinity for that process to one physical CPU

You have to be quick

... Then the game started normally, it passed the splash screen ... I immediately activated all cores again.
If you have a SSD this may be difficult :-/ Anyway I could continue this way.
And off I went blasting off some psychos

REMEMBER: If it took more than one shot, you weren't using a Jakobs
Posted by Amanjit Singh Gill
in C++, Mostly Harmless, Windows programming Comments: (0)
Trackbacks: (0)
Hadoop vs HPC. Tue, Oct 18. 2011
I took a quick glance at the 2009 sortbenchmark.org results, where Hadoop reached the first place.
The benchmark category is called GraySort
Metric: Sort rate (TBs / minute) achieved while sorting a very large amount of data (currently 100 TB minimum).
This is how it looked like

73 times more nodes!?
(the raw cpu power of the nodes are roughly comparable - both Dual Quadcore)
Even though the benchmarks themselves aren't 100% identical (but triton was also sorting 100 terabytes), I think its remarkable..
Sometimes you just have to look at the numbers.
The benchmark category is called GraySort
Metric: Sort rate (TBs / minute) achieved while sorting a very large amount of data (currently 100 TB minimum).
This is how it looked like
73 times more nodes!?

(the raw cpu power of the nodes are roughly comparable - both Dual Quadcore)
Even though the benchmarks themselves aren't 100% identical (but triton was also sorting 100 terabytes), I think its remarkable..
Sometimes you just have to look at the numbers.
Normally, the database isn't the bottleneck. Tue, Oct 18. 2011
A lot of people like to point out that code efficiency for webapps isn't relevant nowadays, code can be slow, even glacially slow given the fact that the webapp mostly waits for the database anyway.
This, quite frankly, is wrong almost all of the time.
Let's take a look at writes (since the reading problem can be solved by some kind of caching strategy - once you have almost everything cached you are measuring Dictionary/Map performance).
I invite you to turn on statement logging on your database and capture the SQL DML that's being emitted by your favourite webapp framework for a specific write use case (including the transaction boundaries). Turn off your logging again for max. performance. Normally your web framework will be capable of handling 5-20 non-cacheable user requests/sec that result in direct write requests to the database (measure with httperf or whatever). Now shut down your webapp, and run that queries directly against your database, using the correct transaction boundary. Then run queries in parallel. Get some solid numbers.
You may discover that the transaction and its queries complete in a few msecs, and you get a lot more requests/s to the DB right away. From a plain DB perspective on a developer machine. Now who's bottlenecking? Certainly not your DB.
If your queries really run slow, the first step is to ensure correct DB design and DB configuration. Your db design must be sane (2NF/3NF or BCNF) and indices must have been set correctly. Technical issues like full table locks with MyISAM matter. String operators and functions in general, LIKE queries etc. Yes, of course you will have to be able to optimize or circumvent really "hard" queries.. At the end of the day, with sane database design, you can get really decent performance.. on a totally normal, average box. And the database will prevail. A whole lot more is going to die before that CUSTOMERS table won't exist any more.
Of course, Disc IO is the upper limit for DB IO (if the db really fsyncs on every transaction), but even with std 7200RPM Discs on developer machines you can get fast performance with correct db design.
But don't tell me your webapp is only capable of handling just a few req/s per second for your average webapp because of your database.
It's highly probable that everything else is preventing you from achieving higher throughput...
...Your ORM that thinks its an object runtime
...Your favourite programming language
...Your beloved ultimate framework,
...Your own application code
...Missing caching strategy
...And a whole lot more
PS
I am not talking about ultra-high-traffic sites here, like facebook. I am talking about normal webapps with moderate traffic.. It makes me wonder why everybody tries to solve the problems facebook has to solve (millions of users and gazillions of internet traffic) ...
PPS
If in doubt, measure.
This, quite frankly, is wrong almost all of the time.

Let's take a look at writes (since the reading problem can be solved by some kind of caching strategy - once you have almost everything cached you are measuring Dictionary/Map performance).
I invite you to turn on statement logging on your database and capture the SQL DML that's being emitted by your favourite webapp framework for a specific write use case (including the transaction boundaries). Turn off your logging again for max. performance. Normally your web framework will be capable of handling 5-20 non-cacheable user requests/sec that result in direct write requests to the database (measure with httperf or whatever). Now shut down your webapp, and run that queries directly against your database, using the correct transaction boundary. Then run queries in parallel. Get some solid numbers.
You may discover that the transaction and its queries complete in a few msecs, and you get a lot more requests/s to the DB right away. From a plain DB perspective on a developer machine. Now who's bottlenecking? Certainly not your DB.
If your queries really run slow, the first step is to ensure correct DB design and DB configuration. Your db design must be sane (2NF/3NF or BCNF) and indices must have been set correctly. Technical issues like full table locks with MyISAM matter. String operators and functions in general, LIKE queries etc. Yes, of course you will have to be able to optimize or circumvent really "hard" queries.. At the end of the day, with sane database design, you can get really decent performance.. on a totally normal, average box. And the database will prevail. A whole lot more is going to die before that CUSTOMERS table won't exist any more.
Of course, Disc IO is the upper limit for DB IO (if the db really fsyncs on every transaction), but even with std 7200RPM Discs on developer machines you can get fast performance with correct db design.
But don't tell me your webapp is only capable of handling just a few req/s per second for your average webapp because of your database.
It's highly probable that everything else is preventing you from achieving higher throughput...
...Your ORM that thinks its an object runtime
...Your favourite programming language
...Your beloved ultimate framework,
...Your own application code
...Missing caching strategy
...And a whole lot more
PS
I am not talking about ultra-high-traffic sites here, like facebook. I am talking about normal webapps with moderate traffic.. It makes me wonder why everybody tries to solve the problems facebook has to solve (millions of users and gazillions of internet traffic) ...
PPS
If in doubt, measure.
Howto: Make a clear statement in a FAQ Mon, Oct 17. 2011
Vym (View your Mind) Mindmapping tool for Windows Sat, Oct 15. 2011
Hi Folks,
I wanted to do some mindmapping on Windows and liked vym. No official Windows port available. The one I found wasn't able to save my file to the Desktop (I didn't read the known issues - no spaces in path allowed), so I had to use Freemind which is just the typical huge, clunky java "App". I bit the bullet and created 2 diagrams (needed for my parallel programming course at fernuni hagen), anyway after studying I went to work, came back home in the evening and just downloaded the sources and built a version with the current Qt SDK. Some minor problems with a QDBus dependency, fixing stuff while compiling. 10 Minutes.
Then I converted my 2 diagrams and erased Freemind from my HD.
Problem. Solved.
A page with further info and a download can be found here (including GPL sourcecode):
Vym (View your mind) for Windows
Have fun!
I wanted to do some mindmapping on Windows and liked vym. No official Windows port available. The one I found wasn't able to save my file to the Desktop (I didn't read the known issues - no spaces in path allowed), so I had to use Freemind which is just the typical huge, clunky java "App". I bit the bullet and created 2 diagrams (needed for my parallel programming course at fernuni hagen), anyway after studying I went to work, came back home in the evening and just downloaded the sources and built a version with the current Qt SDK. Some minor problems with a QDBus dependency, fixing stuff while compiling. 10 Minutes.
Then I converted my 2 diagrams and erased Freemind from my HD.

Problem. Solved.
A page with further info and a download can be found here (including GPL sourcecode):
Vym (View your mind) for Windows
Have fun!
« previous page
(Page 1 of 3, totaling 34 entries)
next page »