Shayan's Software & Technology

My adventure as a Software Engineer continues..,

Parallel Programming: Profiling the Open-Source Image Processing Application CImg

Overview

The purpose of this experiment is to determine if the “Filled Triangles” function inside the CImg_demo.cpp file which can be found in the CImg Open Source Image Processing Library is suitable and worthwhile to optimize. The optimization will involve using the GPU to handle some tasks and by using Parallel Programming Techniques. This optimization would be done using nVidia’s CUDA Framework on a CUDA enabled GPU.

The advantage the GPU has over the CPU is the fact that it has many cores. Modern CPU’s only have about 7(Intel Core i7 comes to mind). The efficiency is utilizing the many core chip which may have over 100s of cores to run a specialized thread for each core and thus performing tasks in parallel and increasing overall efficiency of the application. This reduces waiting time of threads to execute which is the case in most serial applications. Determining what can be executed in parallel requires some analysis which is what we will be doing shortly.

Here’s a Link to the Open Source CImg Image Processing Framework:

http://cimg.sourceforge.net/

Here’s a Link to my CImg GitHub Repository:

https://github.com/ShayanZafar/CImg

The Code – Filled Triangles

This function when executed will produce colorful rotating and moving Triangles of various sizes on the screen.

In order to determine whether or not an application is eligible for an upgrade we need to determine the current running time of the application and apply Amdahl’s Law to determine if any speedup is achievable based on the results. The function I will be looking at is locating in Cimg/examples/CImg_demo.cpp and it is called ‘Filled Triangles’. I have modified this code to generate the triangles for 1000 iterations of the while loop. The primary reason for this was due to the fact that this application originally required user input to determine when to start and stop. I had to modify it to work automatically when executed and terminate after this given condition is met. Here is a version of that code, I have modified it to extract some basic timing data:


// Include static image data, so that the exe does not depend on external image files.
#include "img/CImg_demo.h"
#include
#include

using namespace std;
// Include CImg library header.
#include "CImg.h"
using namespace cimg_library;
int main() {

// start timing
time_t ts, te, l1s, l1e, l2s;
ts = time(NULL);

// Create a colored 640x480 background image which consists of different color shades.
CImg background(640,480,1,3);
cimg_forXY(background,x,y) background.fillC(x,y,0,
x*std::cos(6.0*y/background.height()) + y*std::sin(9.0*x/background.width()),
x*std::sin(8.0*y/background.height()) - y*std::cos(11.0*x/background.width()),
x*std::cos(13.0*y/background.height()) - y*std::sin(8.0*x/background.width()));
background.normalize(0,180);

// Init images and create display window.
CImg img0(background), img;
unsigned char white[] = { 255, 255, 255 }, color[100][3];
CImgDisplay disp(img0,"[#6] - Filled Triangles (Click to shrink)");

// Define random properties (pos, size, colors, ..) for all triangles that will be displayed.
float posx[100], posy[100], rayon[100], angle[100], veloc[100], opacity[100];
int num = 1;
std::srand((unsigned int)time(0));

l1s = time(NULL);
for (int k = 0; k<100; ++k) {
posx[k] = (float)(cimg::rand()*img0.width());
posy[k] = (float)(cimg::rand()*img0.height());
rayon[k] = (float)(10 + cimg::rand()*50);
angle[k] = (float)(cimg::rand()*360);
veloc[k] = (float)(cimg::rand()*20 - 10);
color[k][0] = (unsigned char)(cimg::rand()*255);
color[k][1] = (unsigned char)(cimg::rand()*255);
color[k][2] = (unsigned char)(cimg::rand()*255);
opacity[k] = (float)(0.3 + 1.5*cimg::rand());
}
l1e = time(NULL);
// elapsed time
cout << setprecision(3);
cout << "Elapsed time : " << difftime(l1e, l1s) << endl;

// measuring time it takes for triangle animations in 1000 iterations
int i = 0;

l2s = time(NULL);
// Start animation loop.
while (!disp.is_closed() && !disp.is_keyQ() && !disp.is_keyESC() && i < 1000) {
img = img0;

i++;
  // Draw each triangle on the background image.
        for (int k = 0; k<num; ++k) {
            const int
            x0 = (int)(posx[k] + rayon[k]*std::cos(angle[k]*cimg::PI/180)),
            y0 = (int)(posy[k] + rayon[k]*std::sin(angle[k]*cimg::PI/180)),
            x1 = (int)(posx[k] + rayon[k]*std::cos((angle[k] + 120)*cimg::PI/180)),
            y1 = (int)(posy[k] + rayon[k]*std::sin((angle[k] + 120)*cimg::PI/180)),
            x2 = (int)(posx[k] + rayon[k]*std::cos((angle[k] + 240)*cimg::PI/180)),
            y2 = (int)(posy[k] + rayon[k]*std::sin((angle[k] + 240)*cimg::PI/180));
            if (k%10) img.draw_triangle(x0,y0,x1,y1,x2,y2,color[k],opacity[k]);
            else img.draw_triangle(x0,y0,x1,y1,x2,y2,img0,0,0,img0.width()-1,0,0,img.height()-1,opacity[k]);
            img.draw_triangle(x0,y0,x1,y1,x2,y2,white,opacity[k],~0U);

            // Make the triangles rotate, and check for mouse click event.
            // (to make triangles collapse or join).
            angle[k]+=veloc[k];
            if (disp.mouse_x()>0 && disp.mouse_y()>0) {
                float u = disp.mouse_x() - posx[k], v = disp.mouse_y() - posy[k];
                if (disp.button()) { u = -u; v = -v; }
                posx[k]-=0.03f*u, posy[k]-=0.03f*v;
                if (posx[k]<0 || posx[k]>=img.width()) posx[k] = (float)(cimg::rand()*img.width());
                if (posy[k]<0 || posy[k]>=img.height()) posy[k] = (float)(cimg::rand()*img.height());
            }
        }

// Display current animation framerate, and refresh display window.
img.draw_text(5,5,"%u frames/s",white,0,0.5f,13,(unsigned int)disp.frames_per_second());
img0.resize(disp.display(img).resize(false).wait(20));
if (++num>100) num = 100;

// Allow the user to toggle fullscreen mode, by pressing CTRL+F.
if (disp.is_keyCTRLLEFT() && disp.is_keyF()) disp.resize(640,480,false).toggle_fullscreen(false);
}
te = time(NULL);

// elapsed time
cout << setprecision(3);
cout << "Drawing Triangles Loop: "<< difftime(te, l2s) << "Elapsed time : " << difftime(te, ts) << endl;

return 0;
}

Potential Candidates

Upon analyzing this function I discovered two possible areas where I could optimize the code using threads sent to the GPU. The first is a for loop which sets the attributes for 100 triangles in serial. This task can be done in parallel using 100 threads on the GPU.

for (int k = 0; k<100; ++k) {
        posx[k] = (float)(cimg::rand()*img0.width());
        posy[k] = (float)(cimg::rand()*img0.height());
        rayon[k] = (float)(10 + cimg::rand()*50);
        angle[k] = (float)(cimg::rand()*360);
        veloc[k] = (float)(cimg::rand()*20 - 10);
        color[k][0] = (unsigned char)(cimg::rand()*255);
        color[k][1] = (unsigned char)(cimg::rand()*255);
        color[k][2] = (unsigned char)(cimg::rand()*255);
        opacity[k] = (float)(0.3 + 1.5*cimg::rand());
    }

The second instance where this is possible is a bit tricky. It involves another serial for loop. The purpose of this loop is to draw each of the triangles on the screen and manipulate them later on. I am not 100 percent sure this can be done in parallel in practice but in theory it should be possible because the application is drawing out each triangle one by one.

  // Draw each triangle on the background image.
        for (int k = 0; k<num; ++k) {
            const int
            x0 = (int)(posx[k] + rayon[k]*std::cos(angle[k]*cimg::PI/180)),
            y0 = (int)(posy[k] + rayon[k]*std::sin(angle[k]*cimg::PI/180)),
            x1 = (int)(posx[k] + rayon[k]*std::cos((angle[k] + 120)*cimg::PI/180)),
            y1 = (int)(posy[k] + rayon[k]*std::sin((angle[k] + 120)*cimg::PI/180)),
            x2 = (int)(posx[k] + rayon[k]*std::cos((angle[k] + 240)*cimg::PI/180)),
            y2 = (int)(posy[k] + rayon[k]*std::sin((angle[k] + 240)*cimg::PI/180));
            if (k%10) img.draw_triangle(x0,y0,x1,y1,x2,y2,color[k],opacity[k]);
            else img.draw_triangle(x0,y0,x1,y1,x2,y2,img0,0,0,img0.width()-1,0,0,img.height()-1,opacity[k]);
            img.draw_triangle(x0,y0,x1,y1,x2,y2,white,opacity[k],~0U);

            // Make the triangles rotate, and check for mouse click event.
            // (to make triangles collapse or join).
            angle[k]+=veloc[k];
            if (disp.mouse_x()>0 && disp.mouse_y()>0) {
                float u = disp.mouse_x() - posx[k], v = disp.mouse_y() - posy[k];
                if (disp.button()) { u = -u; v = -v; }
                posx[k]-=0.03f*u, posy[k]-=0.03f*v;
                if (posx[k]<0 || posx[k]>=img.width()) posx[k] = (float)(cimg::rand()*img.width());
                if (posy[k]<0 || posy[k]>=img.height()) posy[k] = (float)(cimg::rand()*img.height());
            }
        }

Profiling The Application

In order to profile I will be using the gprof GNU profiler. In order to profile this application I need to insert ‘-pg -g’ into the compilation statement of G++. and “-pg” into the linker ld call. It should be noted that gprof will not work on a mac machine with an Intel processor. I found this out the hard and annoying way.

After the application has compiled. It is time to Run the program, I named my executable CImg_demo. After running the program run the following command:

gprof -b -p <name of exe> > <name of new flat file to be generated>.flt

This is my profiling output:

Flat profile:

Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  us/call  us/call  name    

 82.26      2.55     2.55  4820368     0.53     0.53  frame_dummy

 12.26      2.93     0.38                           draw_line

  2.58      3.01     0.08    10965     7.30     7.33  draw_image

  1.94      3.07     0.06                           draw_triangle

  0.32      3.08     0.01   298115     0.03     0.03  cimg_library::CImg

  0.32      3.09     0.01                             item_3d_reflection()

  0.32      3.10     0.01                             fillC

  0.00      3.10     0.00    14040     0.00     0.00  assign

  0.00      3.10     0.00    13270     0.00     0.00  assign

  0.00      3.10     0.00     9053     0.00     0.00  cimg_library::cimg::X11_attr()

  0.00      3.10     0.00     7130     0.00     0.00  ~CImg()

  0.00      3.10     0.00     2305     0.00     0.00move_to(cimg_library::CImg<float>&)

  0.00      3.10     0.00     1793     0.00     0.00  assign

  0.00      3.10     0.00     1024     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00     1000     0.00     0.53  render

  0.00      3.10     0.00     1000     0.00    80.40  _draw_text

  0.00      3.10     0.00     1000     0.00     0.00  assign

  0.00      3.10     0.00     1000     0.00    81.11  cimg_library::

  0.00      3.10     0.00      769     0.00     0.70  cimg_library::CImg

  0.00      3.10     0.00      769     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00      769     0.00     0.70  cimg_library::CImg

  0.00      3.10     0.00      768     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00      702     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00      513     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00      512     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00      189     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00      189     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00       67     0.00     0.00  cimg_library::CImg

  0.00      3.10     0.00        6     0.00     0.00  cimg_library::CImgList

  0.00      3.10     0.00        3     0.00     0.35  cimg_library::CImgDisplay

  0.00      3.10     0.00        2     0.00     0.00  cimg_library::CImgList
  0.00      3.10     0.00        2     0.00     0.00  cimg_library::CImgList

  0.00      3.10     0.00        1     0.00     0.00_GLOBAL__sub_I__Z22item_blurring_gradientv

  0.00      3.10     0.00        1     0.00     0.00 CImgDisplay::_map_window()

  0.00      3.10     0.00        1     0.00     0.00  CImgDisplay::_assign

  0.00      3.10     0.00        1     0.00     0.00  CImg<unsigned char>::~CImg()

  0.00      3.10     0.00        1     0.00   136.13 CImgList<float>::font

  0.00      3.10     0.00        1     0.00   136.13  CImgList<float>::_font

  0.00      3.10     0.00        1     0.00     0.00  CImgList<float>& cimg_library::CImgList

  0.00      3.10     0.00        1     0.00     0.00  CImgList<float>::CImgList

  0.00      3.10     0.00        1     0.00   136.13  CImgList

Summary of Findings

The execution of the program takes roughly 3.10 – 20 seconds (depending on how long you are measuring the calculations of triangle animations). it should be noted that this application initially was an application that relied upon user input for execution and for termination. I have modified this initial behavior by ensuring the while loop (which generates the triangles) executes only for a maximum of 1000 iterations. The time measured in this assignment is for every 1000 iterations of this loop.

 Profiling Results

The results if the initial profile shows that the execution time is most greatly consumed when drawing the triangles out to the screen one at a time. It seems like this can be optimized by offloading this drawing to n threads based on n triangles to be drawn. But this is subject to change because of any additional complexity that may be introduced that may include interoperability with both the GPU and CPU.

There is another for loop which sets the dimensions for each triangle one by one in linear time O(n ). This process can also be out-sourced to the GPU in n threads for n triangles. I would need to determine if this process also involves interoperability between the CPU and GPU.

The complexity of the entire program is O(n^3). There is a for loop for setup, a while loop for accepting user input and another for loop for drawing the triangles.

Also the times recorded can be increase if the maximum loop iterations increase ie: 10000,100000,1000000. This will identify the same relationship but with higher task time.

Amdahls Law Calculations

Amdahls Law measures the potential efficiency that can be achieved by adding numerous cores to an existing application that only uses 1 core in the CPU.

Since there are 100 Triangles generated then we can theoretically create 100 threads for each triangle. The draw_line, draw_triangle, and draw_image functions take up 16 percent(0.38 + 0.08 + 0.06 / 3.10) of the execution time of the application. Plugging that into the equation using 100 cores we get:

S100 = 1/ 1 – 0.16 + 0.16 / 100

= 1.18 or 1.2 speedup is theoretically achievable rounded up PER 1000 iterations of the while loop to draw these triangles.

Will I work on this Project? If I can optimize this function or any other function within the CImg library I will continue with this project. If it is not possible to optimize this project within the given time of the course then it will be difficult to continue on with this project and I will have to work with someone else’s project. But my initial plan is to continue with this project unless I am told otherwise.

 Issues Encountered

The profiling tool gprof does not work on the macbooks with an Intel processor installed (I have Intel Core i5). This was verified by numerous internet resources and annoying personal experience.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: