[TECH] What does computing the edit distance between a string and its reverse tell us?

•April 19, 2008 • Leave a Comment

I came across a problem recently which asks for the following “Given a string find out the minimum number of characters to be inserted to make it a palindrome, and also give the string”. There might be several ways to solve this problem I have a simple algorithm to solve this problem in O(n2). The observation is the following reccurence.

Let LCSr[x,y] = longest common sub sequence between S1…x and Sr1…y , where Sr is the reverse of
the string S.

Let X=x1x2….xk be the be the palindrome with minimum # of inserted characters to make it a palindrome, then X has to be formed from S as follows.

1. Find a index k in S such that the cost of transforming S[1..k] to S[k+1...n] is minimum and the palindrome would be
X = S’[1...k]S’r[k+1...n] or X = S’[1..k-1]S[k]S’r[k+1..n]

2. The index k such that LCSr[k][len-k] is maximum.



/*Builds the LCS between the forward and backwards and finds
a index 'p' which will give minimum # of operations to transform
forward string to backward string.*/
void FindLCSReverse(char *str,unsigned int len){
	int i,j;
	int current_min,max_lcs;
	int imax,jmax,palindex,palindex_r;
	unsigned int operations;
	char path=0;
	for(i=0;iLCS[i][j-1])?LCS[i-1][j]:LCS[i][j-1];
			}
		}
	}
	/*Compute the p which gives minimum inserts to transform
	 *forward string to backward
	 */
	max_lcs=0;
	imax=len-1;jmax=0;
	for(i=len-1;i>=1;i--){
		if(LCS[i][len-i-1] > max_lcs){
			max_lcs = LCS[i][len-i-1];
			imax = i;
			jmax = len-i-1;
		}
		if(i>0 && LCS[i-1][len-i-1] >= max_lcs){
			max_lcs = LCS[i-1][len-i-1];
			imax = i-1;
			jmax = len-i-1;
		}
	}
	/*Now find the actual string*/
	i=imax;
	j=jmax; 

	palindex_r=0;
	while(i!=0 || j!=0){
		if(i>0 && j>0){
			if(buf[i] == buf[len-j]){
				palindrome_rev[palindex_r++] = buf[i];
				i--; j--;
				continue;
			}
			current_min = (LCS[i-1][j] > LCS[i][j-1])?LCS[i-1][j]:LCS[i][j-1];
			if(current_min == LCS[i-1][j]){
				if(current_min == LCS[i][j-1] && buf[i] < buf[len-j]){
					palindrome_rev[palindex_r++] = buf[len-j];
					j--;
				}else{
					palindrome_rev[palindex_r++] = buf[i];
					i--;
				}
			}else if(current_min == LCS[i][j-1]){
				if(current_min == LCS[i-1][j] && buf[len-j] < buf[i]){
					palindrome_rev[palindex_r++] = buf[i];
					i--;
				}else{
					palindrome_rev[palindex_r++] = buf[len-j];
					j--;
				}
			}
		}else if(j==0){
			palindrome_rev[palindex_r++] = buf[i];
			i--;
		}else if(i==0){
			palindrome_rev[palindex_r++] = buf[len-j];
			j--;
		}
		/*path=1 (left) path=2 (down) path=3 (diag)*/
	}

	for(i=0;i<palindex_r;i++){
		palindrome[palindex_r-1-i] = palindrome_rev[i];
	}
	palindrome_rev[palindex_r] = '';
	palindrome[palindex_r] = '';

	/*printf("imax = %u jmax = %u len=%u \n",imax,jmax,len);*/
	if(imax+jmax==len-1){
		printf("%s%s\n",palindrome,palindrome_rev);
	}else{
		printf("%s%c%s\n",palindrome,buf[imax+1],palindrome_rev);
	}
}

Download this here

[TECH] Reverse Engineering and Creating Crawler BOTS.

•April 16, 2008 • Leave a Comment

I have my share of pleasure reverse engineering the underlying details of SCOPUS. SCOPUS as you might know is the most popular scholarly database used for citation searching. I was trying to solve this problem “Given a research paper X produce a set of research papers in the same connected component of the CITATION GRAPH”, Let me define a CITATION GRAPH (CITE(V,E)) , V = {set of all research papers} and E={(i,j)} set of all directed edges from research paper ‘i’ to ‘j’ such that paper ‘j’ refers paper ‘i’ in its references. This edge information comes from SCOPUS however SCOPUS gives only one level (depth 1) in the connected component of all related papers, my goal is to get all the related papers (related in the sense fall in the same connected component of the CITE graph).

The reverse engineering the underlying comes handy when we want to automate the process of searching all these from the browser ourself.

I’m too tired to explain the details of the program which I had written using perl+LWP to create a CRAWLER BOT which gets all the related papers but if you need similar stuff sure the code can help click here

Unfortunately I don’t get enough time to write blogs but in past few weeks I had some very interesting technical stuff I want to write.

[TECH] A simple algorithm to find the coefficient’s of characteristic polynomial.

•March 13, 2008 • 1 Comment

Its often the case that the we need coefficients of the characteristic polynomial (xA = λx) rather than just the Eigen values and Eigen vectors of the matrix A, The following simple algorithm which can determine the coefficients of the polynomial from its roots (Eigen values) would be extremely handy, its a simple dynamic programming algorithm and also illustrates how quickly we can code
dynamic programming algorithms in Matlab because the matrices resize automatically.



%
% eigen_poly_coeff: input 'A' should be a square matrix
% The program finds the eigen values of A, and constructs
% the caracteristic polynomial
%
% output is the order of coefficients in the increasing
% degree order.
%
function retval = eigen_poly_coeff (A)
v = eig(A)';
B = size(v);
coeff_matrix = eye(1,(B(1,2)+1));
coeff_matrix(1,1) = v(1,1)*-1;
coeff_matrix(1,2) = 1;

for i=3:1:(B(1,2)+1)
   coeff_matrix(1,i) = 1;
   for j=(i-1):-1:1
    coeff_matrix(1,j) = (coeff_matrix(1,j)*
                v(1,(i-1))*-1);
    if(j-1 >=1)
      coeff_matrix(1,j) =
             coeff_matrix(1,j)+coeff_matrix(1,j-1);
    end
   end
end
 retval = coeff_matrix;
endfunction


[TECH] Algorithmic details of UNIX Sort command.

•March 10, 2008 • Leave a Comment

I happened to look at the algorithmic details of UNIX Sort, a LINUX version of
the classic UNIX sort is a part of GNU coreutils-6.9.90. This is classic
example of the standard External R-Way merge , to sort a data of size N bytes with a main memory size of M so it creates N/M runs and merges R at a time, the number of passes through the data is log(N/M)/log(R)
passes.In fact the lower bound(runtime) for external sorting is Ω((N/M)log(N/M)/log(R)). All the external memory sorting algorithms provided in the literature are optimal so the fight here is minimizing the constant before the number of passes.

UNIX sort treats keys are lines (strings), the algorithm followed by
unix sort is in fact the R-Way merge. Let the input file size be
IN_SIZE.

1. Choosing Run Size:
——————————–
The sizes of the initial runs are chosen from the total physical
memory (TOTAL_PHY) and available memory (AVAIL_PHY).
RUN_SIZE = (MAX(TOTAL_PHY/8,AVAIL_PHY))/2

maximum of 1/8th of TOTAL_PHY and AVAIL_PHY and divided by 2.
See function “default_sort_size (void)” in the code.

2. Creating Runs:
————————-

Unix sort creates a temporary file for every run. So it creates
IN_SIZE/RUN_SIZE (celing) temporary files. Internally it uses merge
sort to sort internally it uses an optimization mentioned in Knuth
volume 3 (2nd edition), problem 5.2.4-23.

3. Merging:
—————-

The number of runs merged at any time is hard coded in the program
see macro NMERGE , NMERGE is defined to be 16 so it merges
exactly 16 runs at any time.

[TECH] Sorting partially sorted sequences.

•March 6, 2008 • Leave a Comment

Partially Sorted Sequence: A sequence k1,k2,…kn of n keys such that any key kj differs from its sorted position by atmost d as an example for d=3 we have 3,5,6,1,2,4
I came across this problem recently, turns out that we can solve this problem in several ways in O(nlog(d)) time. I found an interesting and simple way to solve this problem, a simple observation reveals that the smallest element
in the entire sequence of n keys exists in the first d keys.


/*Build a heap(min) H with first d elements
 *in O(d) time
 */
 for(i=d;i<n;i++){
    DeleteMin(H); /*output this key*/
    InsertHeap(H,ki);
 }
/*Note the Heap will have atmost d keys in it
at any time so the Insert and Delete can be done in
O(log(d)) worst case.
*/

(n-d)log(d) = O(nlog(d))

[TECH] Converting an Las Vegas algorithm from an Expected Time Bound to High Probability bound.

•March 4, 2008 • Leave a Comment

Some times its not always possible to derive the High Probability bounds (1-n)
for Las Vegas algorithms, I found an interesting way to convert any Las Vegas algorithm
with expected bounds on the resources to the High Probability bounds. Assume that Tn be the expected runtime(or any resource) for a Las Vegas algorithm. The
plain vanilla Las Vegas algorithm have the following structure.


while(1){
  /*Select a random sample*/
  if(answer found) quit;
}

Basically any analysis of the above randomized algorithm should tell how long
should we run the loop. So let Tn gives the expected time on how
long this loop runs. What we can do is as follows

  • Let the algorithm run for exactly 2Tn steps, if the algorithm
    quits before that we are fine, if the algorithm does not quit after Tn
    times then stop and restart.


counter = 0;
while(1){
 /*Pick a random sample*/
 if(counter++ == 2Tn){
    counter=0; /*restarting*/
 }
 if(answer found) quit;
}

We can now show that the above algorithm will now give high probability bounds. Let
X be the random variable which indicates the runtime of the algorithm. Then using Markov inequality P[X ≥ aTn] ≤ 1/a

  • P[X ≥ 2Tn] ≤ 1/2
  • Lets assume that the algorithm runs for k ≥ 2Tn time steps, then we should have done a reset exactly k/(2Tn) times.
  • The probability of doing a reset equals P[X ≥ 2Tn] = P[reset].
  • Lets assume all the events are independent then the probability of resetting k/(2Tn) P[reset k/(2Tn) times] = P[algorithm running k time steps]
  • P[reset k/(2Tn)] ≤ (1/2)*(1/2)*(1/2)…k/(2Tn) times
  • P[reset k/(2Tn)] ≤ (1/2)k/(2Tn)
  • we want the above probability to be very small (n), to
    make that (1/2)k/(2Tn) = n). So
    the value of k for which this happens is 2Tnαlog(n)

  • So with high probability after 2Tnαlog(n) time steps
    if we quit the loop then we give the correct answer.

  • So we just have taken an expected bound Tn and converted into ∼O(log(n)Tn) time algorithm with high probability rather than
    expected bounds.

[TECH] Finding max and min in exactly 3n/2-2 Element Comparisions.

•March 1, 2008 • Leave a Comment

We know that the lower bound on the minimum number of comparisons to find max
and min in an array of n. In fact the recurrence T(n) = 2T(n/2)+2
solves exactly to 3n/2-2 the following is an interesting way to find max
and min in exactly 3n/2-2 comparisons non-recursively(assume ‘n’ is even). Also note by comparison we mean the element comparisons as they are significant.


current_min = a[0]; current_max=a[1];
/* 1 comparison here */
if(current_min < current_max){
  current_min = a[1]; current_max = a[0];
}
/*(n-1)-2+1 times*/
for(i=2;i<n-1;i+=2){
  /*3 Element Comparisons here */
  if(a[i] < a[i+1]){
      if(a[i] < current_min) current_min = a[i];
      if(a[i+1] > current_max) current_max = a[i+1];
  }else{
      if(a[i+1] < current_min) current_min = a[i+1];
      if(a[i] > current_max) current_max = a[i];
  }
}
/*return current_max current_min*/

Analysis: Total Comparisions = 1 + (((n-1)-2+1)/2)*3 = 3*n/2-2.

[TECH] Standard I/O Buffers won’t get flushed when program crashes…

•February 12, 2008 • Leave a Comment

Today I had a interesting problem, assume that you have a program printing out some
details on to the screen and suddenly the program crashes and you want to capture what
ever it has printed what would you do? you would try to redirect the output to a file and look at that file and what if the contents of the redirected file is empty? how do you explain this? try the following program and try to redirect what ever it prints using ‘>’ or a ‘|’ (“./a.out > out” or “./a.out | more”)


char *crash_me = "crash this";
int main(){
    int i;  

    for(i=0;i out
#3$./a.out | more

We can see that the ‘out’ file and ‘more’ don’t show any thing which was in fact printed if we just run the program normally, how can we explain this? what exactly is
happened? actually this is what has happened in the first case the terminal output is
line buffered when ever terminal sees ‘\n’ it prints that but in the next two cases
the output is written to a file which is not line buffered (but gets written when
the buffer is full) and since the program is crashed before the buffer gets full nothing is printed in case of #2 and #3. I guess one should definitely have a signal
handler for SIGSEGV and other signals which end the program and do a explicit flush as below.


/*The buffers should be flushed before
 *program crashes.
 **/
void FlushBuffers(int sig){
    fprintf(stderr,"Segmentation Fault\n");
    fflush(stdout);fflush(stderr);
    exit(1);

}
char *crash_me = "crash this";
int main(){
    int i;
    assert(signal(SIGSEGV,FlushBuffers)!=SIG_ERR);
    for(i=0;i<10;i++){
        printf("Before crash line %d\n",i+1);
        if(i==9){
            crash_me[6] = 'C';
        }
    }
}

I’m thinking of making a list of this kind of problems on UNIX

[TECH] SafeRead and SafeWrite

•February 11, 2008 • Leave a Comment

‘read’ and ‘write’ are most frequently used system calls, its really a blunder to have something like the following in the code.


/*Read and Write blunders*/
if(read(fd,buffer,len) < 0 ){
  perror("Read ERROR:");
}

if(write(fd,buffer,len) < 0){
  perror("Write ERROR:");
}

The above code involving ‘read’ and ‘write’ syscalls may seem perfectly fine but unfortunately thats not true, NEVER IGNORE THE RETURN VALUE of a syscall, its very
common to functions like ‘read’ and ‘write’ to return values less than ‘len’ in several situations like program interrupted by a signal or timeout or if the ‘len’ is
very large say in ‘Mb’ make sure you have the following ‘SafeRead’ and ‘SafeWrite’ in
your coding libraries.


ssize_t SafeWrite(int fd,void *buf,size_t wlen){
 ssize_t len; char *bbuf = (char *)buf;
 size_t writeln=0;
 while(writeln < wlen){
   len = write(fd,&(bbuf[writeln]),wlen-writeln);
   if(len<=0) return len;
   writeln += len;
 }
 return writeln;
}

ssize_t SafeRead(int fd,void *buf,size_t rlen){
 ssize_t len;char *bbuf = (char *)buf;
 size_t readln=0;
 while(readln < rlen){
   len = read(fd,&(bbuf[readln]),rlen-readln);
   if(len<=0) return len;
   readln += len;
 }
 return readln;
}

Its quite often that the while loop only executes once, but its always possible that due to some reason the ‘read’ and ‘write’ syscalls may return less than the number of bytes you read or write.

[TECH] A Contradictory gcc message.

•January 31, 2008 • Leave a Comment

I have been experimenting on some of my PDM (parallel disk model) sorting code
and wanted to create huge files with billions of keys I wanted to create a file
with 1 billion integers (1024*1024*1024*4), the program stopped after some time saying
that it “File size limit exceeded” , however in my shell (csh) the ‘limit’ command showed unlimited.



[vamsik@abadon PDMSorting]$ ./pdm_sort
Setting RAND_SEED to 1201763856
Filesize limit exceeded
[vamsik@abadon PDMSorting]$ ls -al key_file.txt
---------- 1 vamsik fuse 2147483647 Jan 31 02:18 key_file.txt
[vamsik@abadon PDMSorting]$

I saw that the file was created without any permissions, so I tweaked
my ‘umask’ but nothing changed (I was doing a open with (O_WRONLY|O_CREAT)). You might be wondering what exactly I’m trying to say in this post, in fact the story just started, rather than setting my ‘umask’ to ‘umask 22′, I have set it to
‘umask 755′ and as usual I was doing a build with ‘make’ this is what happened.


[vamsik@abadon PDMSorting]$ make
gcc  -O2  -I../myutil/            -c ExternalSort.c
ExternalSort.c:1: fatal error: can't open /tmp/cct9kuSr.s for writing: Permission denied
compilation terminated.
The bug is not reproducible, so it is likely a hardware or OS problem.
make: *** [ExternalSort.o] Error 1
[vamsik@abadon PDMSorting]$

Looks strange right? , see the funny thing it says “The bug is not reproducible, so it is likely a hardware or OS problem.” , its really a stupid error message
how can it say it cannot be reproduces? , just set ‘umask 755′ and do a ‘gcc’ on any
file its going to say the same thing, in fact the message is a utter contradiction because we can reproduce this by changing the ‘umask’. I’m going to report this to the ‘gcc’ maintainers or submit a patch to the maintainers.