Friday, March 21, 2014

Mutex Vs Binary Semaphore

Most of the time people get confused with these two. So here I will try to explain the difference using this answer at Stack Overflow.

So Mutex is a lock which have the owner association, So it means whoever holds the mutex lock will only be able to unlock this mutex. So Mutex is used when you want to lock the resources.

While Binary Semaphore is used for signalling mechanism, It doesn't lock the resource. So any other thread can also unlock the semaphore held by your thread. In other words it signals your thread to do something.

Thursday, March 20, 2014

GDB : Debugger for simple c file

Hi,

It may be out of track article for kernel, but it could be good if you have single file.
So GDB is a debugger utility in linux for debugging C/ C++ program.

Step 1 : So first how you will enable your program to do so :
In your compilation command, use -g option, for example :
gcc -g program.c -o program

Step 2 : Now to run the utility, just type gdb on the shell. It will give you something like (gdb) shell

Step 3:  Load the file which you want to debug,
             (gdb) file filename.c

Step 4 : Put break points at lines
             (gdb) break filename.c : Line number
             Put break points at function
             (gdb) break func_name

Step 5 : type run. It will run till it reaches your first break point.

Step 6 : Now suppose you want to check the value of some variable use print
             (gdb) print x

Step 7 : If you want to just go to next break point, just type
             (gdb) continue

Step 8 : If you want to go step by step, use either next/step command.
             (gdb) next
                 or
             (gdb) step
             Only difference is, next will treat any sub-routine call as a single instruction.

Now apart from break point there is one more concept as Watch point, it works on variable instead of function or line number. It will stop code execution when the value of variable on which watchpoint is set, has been changed. Syntax is
(gdb) watch variable_name

There are other command also

1. backtrace - produces a stack trace of the function calls that lead to a seg fault (should remind you of Java exceptions)
2. where - same as backtrace; you can think of this version as working even when you’re still in the middle of the program
3. finish - runs until the current function is finished
4. delete - deletes a specified breakpoint
5. info breakpoints - shows information about all declared
breakpoints

Also we can use condition in breakpoints, for example you want to break only when i < 1,so do like this :
(gdb) break program.c : 7 if i < 1.

Tuesday, March 18, 2014

Build kernel from source code

If you are newbie to the kernel, It is a matter of time when you have to download kernel source and build it. Although working in Linux kernel domain for 2 years, I never did it earlier, so finally today I did it because of eudyptula-challenge.org.

So here are the steps, you should do to build your kernel. 

Step 1 : Download the source code from git.

Step 2 : Set up the config file. It is the file which tells at the time of booting what all configuration needed to be enabled. So easiest way to create config file sudo make localmodconfig

Step 3 : Further if you want to modify any change in config file, use sudo make menuconfig. Now change any configuration you want, you can select y/n/m.

Step 4 : Now its time to build your kernel, you can use either make or you can use make deb-pkg

Step 5 : To install the image, 
- If you used make earlier, then call make install.
- If you used make deb-pkg, then install the image first:
sudo dpkg -i linux-image-3.14.0-rc6-00145-ga4ecdf8_3.14.0-rc6-00145-ga4ecdf8-8_i386.deb
Then install the headers :sudo dpkg -i linux-headers-3.14.0-rc6-00145-ga4ecdf8_3.14.0-rc6-00145-ga4ecdf8-8_i386.deb

you can replace the version accordingly in above commands.

After reboot, you will be able to see the new linux kernel in your grub list.




ARM 3-Stage Pipeline

What is Pipeline?
Pipeline is used to improve the perforamnce of the overall system by allowing multiple instructions(which are in different stage) parallely. So first understand the 3 stages of pipeline :

Fetch--------> Decode -----------> Execute

As per their names, In Fetch, we fetch the instructions, in Decode we decode the instruction and finally execure in Execute stage.

So keeping these 3 stages in mind, Whenever instuction 1 is in Executing stage(as it started first), instruction 2 will be in Decoding stage and instruction 3 will be in Fetching stage.

So if you see this, then there will not be any improvement for a single instruction, as every instruction will be taking the same 3 cycles to execute. But the overall performance will be improved.

Problem with pipeline: 
As like all other good things in the world, pipeline also have it share of cons. Pipeline creates problem when there is a branch instruction, because whenever branch instruction is in executing stage, It has to go to some other instruction instead of the instruction which processor had taken in fetch/decode stage while branch instruction was in fetch/decoding stage. I know it may be confusing, so lets understand this with an example:

[As I dont know assembly language, instruction syntax can be different what you expect]
1. ADD R1,R2,R3 [Add R2, R3 and keep it in R1]
2. ADD R2,R3,R4
3. SUB R3, R4,R5
4. JMP X
5. ADD R2,R3,R4
6. SUB R3, R4,R5
7. X:
8. ADD R1,R2,R3


So you can see when instruction 4(JMP) will be in execution stage, instruction 5 will be in decode stage and instruction 6 in fetch stage. But as isntruction 4 is a jump statement, thee is no use of executing instruction 5 and 6.
           In that case pipeline will be flush, and next 2 cycle will be waste. As for instruction 7 to execute, you will need 2 cycle, and as there is no eligible instruction to execute in pipeline, there will not be any output during this cycle.

So it is not necessary that pipeline will always improve performance.

Branch Prediction:
To overcome this problem caused by pipeline, Architecture came up with branch prediction. It tries to predict which could be executed next, . There are different ways to achieve this, I am not covering those here. you can refer wikipedia or ARM information center.

Sunday, March 16, 2014

Cache Management in ARM

What is Cache?
Cache is a place where a processor can put data and instruction. 

Why we need Cache?
As we know accessing memory to fetch instructions and data is way slower than processor clock, so fetching any instruction from the memory can take multiple clock cycle. To improve this scenario, ARM provides the concept of Cache, it is a small, fast block of memory which sits between processor core and memory. It holds copies of items in main memory. That way we have kept one intermediate small memory which is faster than main memory.

How it works?
If we have cache in our system, there must be a cahce controller. Cache controller is a hardware which manages cache without the knowledge of programmer. So whenever processor wants to read or write anything, first it will check it in the cache, this is called cache lookup. If it is there it will return the result to processor. If required instruction/data is not there, it is  known as cache miss. And request will be forwarded to main memory.
                            Apart from checking whether data/instruction existence in the cache, there is one more thing in core, which is a write buffer. Write buffer is used to save further time for processor. So suppose processor wants to execute a store instruction, what it can do is, it can put the relevant information such as which location to write, which buffer needs to copy, and what is the size of buffer which is getting copied into this write buffer, after that processor is free to do take next instruction, Write buffer will then put this data into the memory.

Level of Cache:
There are generally two caches(L1 and L2), some architecture have only one cahce(L1) too.
L1 caches are typically connected directly to the core logic that fetches instructions and handles load and store instructions. These are Harvard caches, that is, there are separate caches for instructions and for data that effectively appear as part of the core. Generally these have very small size 16KB or 32KB. This size depends on the capability of providing single cycle access at a core speed of 1GHz or
more.
             Other is L2 cache, these are larger in size, but slower and unified in nature. Unified because it uses single cahce for instructions and data. L2 cache could be part of core or be a external block.

Cache Coherency:
As every core has its own L1 cache, It needs a mechanism to maintain the coherency between all the caches, because if one cache is updated and other is not, This will create data inconsistency.
There are 2 ways to solve this scenario:

1. Hardware manage coherency : It is the most efficient solution. So the data which is shared among caches will always be updated. Everything in that sharing domain will always contain the same exact value for that share data.
2. Software manager coherency : Here the software, usually device drivers, must clean or flush dirty data from caches, and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.

Wednesday, March 12, 2014

why to use request_threaded_irq?

I know for many this function may be very clear, but I found it really difficult to digest this function. So lets start it :

Why we need it ?
From the starting, it was desired for kernel to reduce the time for which processor stays in interrupt context. To solve this initially they introduced Top half and bottom half concept, in which they keep time critical task in top half and rest of the work they kept in bottom half. When processor is executing the interrupt handler it is in interrupt context with interrupts disabled on that line, which is not good because if it is a shared line, other interrupts won't be handled during this time, which in turn effect the overall system latency.
               To overcome this problem, kernel developers came up with the request_threaded_irq() method which futher reduces the time.

How it works?
Before going further with the functionality, lets check the function definition first

int request_threaded_irq (unsigned int irq,
 irq_handler_t handler,
 irq_handler_t thread_fn,
 unsigned long irqflags,
 const char * devname,
 void * dev_id);



Difference between this function and usual request_irq function is the extra thread_fn. 
Now lets understand the functionality, request_threaded_irq() breaks handler code in two parts, 1st handler and 2nd thread function. Now main functionality of handler is to intimate Hardware that it has received the interrupt and wake up thread function. As soon as handler finishes, processor is in process context. 
so processor is free to receive new interrupts. It improves the overall latency.

Misc:
Some driver code uses request_threaded_irq() with NULL as a value for handler, in that scenario kernel will invoke the default handler, which simply wakeup the thread function.

When to use request_threaded_irq instead of bottom halves ?
Answer lies in the driver's requirement, if it wants to sleep put the code in thread fn and user threaded function. 

Thursday, March 6, 2014

DMA : Direct Memory Access

Why we use it ?
In normal scenario whenever data transfer is done, processor should be informed. If there is more data, it could lead a lot of time of CPU.  To overcome this issue, DMA concept came in picture, using DMA, computer can access system memory without telling Processor. 
Earlier when DMA was not there, CPU will be busy when transfer is happening, but now with DMA, CPU can do other operations while transfer is happening.

Misc:
DMA has its configuration, which describe what it will be transfering, from where it will be taking data. what is the size? There are many other settings too. There will be one internal buffer also, which will be taking data from source and then putting it at destination.

When we want to start the transfer, there will be one bit associated, when you write it, it will intiate the transfer.