PezBlog: 2011

Wednesday, 23 November 2011

Algo Puzzle #1

I am a jovial fan of algorithmic puzzles so I decided to create some myself and put them on my blog. I will post them one at a time.

You will find the first puzzle down below and the solution will be posted in the comments section over the weekend. The puzzles will gradually get more difficult ;)

Algo Puzzle #1

Write an algorithm that sums a set where each number is from 1 to n in O(1) time. More specifically find an algorithm that computes it with only 3 operations regardless of the set's size.

Friday, 11 November 2011

Swapping elements in an array without a temporary variable

The common idiom to swap two elements in an array is to use a temporary variable that is consequently pushed on the stack, which looks like the following:

int[] elements = {1,2,3,4,5,6,7,8,9,10}
int tmp = elements[0];
elements[0] = elements[9];
elements[9] = tmp;

In that example I was swapping the first and last elements. However this is not the most optimized way as the elements can be swapped without that temporary variable being created.

In true blue peter style, here's one I created earlier:

elements[0] = elements[0]^[elements[9];
elements[9] = elements[0]^[elements[9];
elements[0] = elements[0]^[elements[9];

It works in the same way that algebra works, where equality can be changed w.r.t the variable.

a = b - c
b = a + c
c = b - a

The bitwise operators only work on integers and not all real numbers (more speficially, floating point numbers) so the arithmetic operators must be used instead to swap real numbers in an array:

float[] data = {1.1f, 2.234f, 4.5454f, 1.110f, 9.999f, 3.1562f};
data[0] = data[0] - data[5];
data[0] = data[5] + data[0];
data[5] = data[0] - data[5];

Using these algorithms will provide micro-optimization but at the expense of clarity.

Tuesday, 20 September 2011

Debugging failed record inserts into kdb+

Don't expect a lot of help if there is an type mismatch error with an insert into kdb+ database. All Q returns is the infamous `type error; which doesn't give you any indication on its cause. This error is due to a type mismatch between the data and the columns it is trying to write to. For instance it is expecting a String and it receives a boolean.

Here's what you usually get. Let's say I have a orders table like this and try to insert the first record into it.

q)orders:(time:`time$(); sym:`$(); client:`$(); side:`$();price:`real$(); qty:`int$());
q)`orders insert (`CLIENT1; "USDEUR"; `A; 1.56; 56.6);
`type

This sucks as it doesn't tell you where the type mismatches are and forced to manually check every column type against the element's type; which could take a while if there are many columns to check. If a production issue occurred relating to this issue, I decided to write a Q function which provides a list of columns that have a type mismatch and if it its good, it returns `good.

inspect:{ [host;port;table;data]
t:"bxhijefcsmdzuvt"!1 4 5 6 7 8 9 10 11 13 14 15 17 18 19h;
h:hopen `$":",host,":",string port;
m:h"meta ",table;
vt:abs type each data;
ct:abs t[exec t from value m];
r:(exec c from key m) where vt {x<>y}' ct;
$[0 < count r; r;`good]}

This can now be used to find why the insert is failing.

q) insert["host"; 8888; "orders"; (`CLIENT1; "USDEUR"; `A; 1.56; 56.6)]
`sym`qty

q) insert["host"; 8888; "orders"; (`CLIENT1; `USDEUR; `A; 1.56; 56)]
`good

There you go; it has computed which columns have the type mismatch and under further scrutiny, one can indeed see a blatant type mismatch between symbol->string and `int->`real for sym and qty respectively.

Thursday, 1 September 2011

Constituent clarification

I noticed that there was some confusion between two colleagues' queries between the use of time.ss and time.second so I thought I would blog about it briefly.

q)t:09:00:02.000
q)t.ss
2
q)t.second
09:00:02

select count i by time.ss from ... will count the number of records that have identical number of seconds where select count i by time.second from ... will group records by its timestamp ignoring milliseconds.

Tuesday, 30 August 2011

Put a sock(et) in it

If you have a Java program that connects to a kdb+ database then ensure you turn off the Nagle algorithm; which is an optimization algo set by default to ensure not to send payloads smaller than the headers. For instance sending a payload of 1 byte and the TCP headers are 20 bytes would be inefficient so it is best to add more data to the payload before sending. However for latency sensitive applications this efficiency is not desirable. It can be turned off using the following snippet of code:

kx.c connection = ...
c.s.setTcpNoDelay(true)
c.k(".test.c:20j")
c.k(".test.c")

Bingo! Now it will turn off the packet efficiency algo (at the expense of pipe congestion) but it is best to send a payload straight away rather than delaying it. This advice is applicable for any sort of messaging within a Java program with real-time constraints.

Sunday, 28 August 2011

Arithmetic overflow in Q

Given the loose typing system of Q, I would have thought that it would promote variables when arithmetic overflow is produced but instead it computes the carry. Come on Arthur, don't be shy, take advantage of your type system and promote its type behind the scenes.

Let me give an example:

q) x:2147483646*2147483646
q) x
4
q)type x
-6h

What it should really do is this:
q) x:2147483646*2147483646
q) x
4611686009837453316j
q)type x
-7h

The only way to do this is explicitly promoting to long before the arithmetic.

q) x:2147483646j*2147483646j
q) x
4611686009837453316j
q)type x
-7h

Which one do you prefer?

Monday, 8 August 2011

Covariance matrices in kdb+

A covariance matrix is at the heart of many data analysis and time series models - as it is a measure of how two data sets change together. In Q, the natural data structure is a vector, which can be nested to represent a matrix. A covariance function is already provided that takes two vectors and computes its covariance, but as a scalar and not as a matrix.

Here is a function I wrote that creates a covariance matrix given a set of vectors. It will create a nxn matrix:

q) covar:{{x[y] cov/: x}[x;] each til (x#:)}

Here is how to use it:

q) a:1+til 10
q) b:10-til 10
q) c:2*1+til 10
q) d:2*1-til 10
q) e:4*1+til 10
q) f:4*1-til 10
q) covar (a;b;c;d;e;f)

8.25 -8.25 16.5 -16.5 33 -33
-8.25 8.25 -16.5 16.5 -33 33
16.5 -16.5 33 -33 66 -66
-16.5 16.5 -33 33 -66 66
33 -33 66 -66 132 -132
-33 33 -66 66 -132 132

Next blog post will be how to compute its eigenvectors and eigenvalue.

Wednesday, 20 July 2011

Factorials in Q

I want to compute a binomial coefficient for a program I am working on for helping poker players during play and noticed that a factorial function is not built into the Q language and would have thought that "!" would have been overloaded to support factorials so I decided to write my own one:

.q.fctrl:{prd 1+til x}

Voila; now you have a factorial function that takes a scalar and computes its factorial

q)fctrl 5
120

Monday, 11 July 2011

IPC Port Handles

A word to the wise about the scope of a IPC handle variable. Put the handle always as a local variable as putting it as a global scoped variable is dangerous. If the underlying remote Q process is bounced then the handler will not rebind to the new process making it dead! Be wary of putting it is a global variable, catching the exception and rebinding yourself as this is a premature optimization.

Thursday, 9 June 2011

Use meta with judiciousness

The common usage of meta in KDB is to provide the metadata of a table; however do not use it against a partioned table!

To empirically verify this for yourself, do this:

\ts meta partionedtable

The best way to retrieve the columns of a partioned table is to do something like this depending on how you are partioning:

columns:{[d]key exec from partionedtable where date=d, i=0}

Best way to learn Q / KDB programming

Developers with only an imperative programming background (C, C++, Java, etc.) have a learning curve with the highest gradient and to overcome, there are two programming paradigms that need to be understood:

1. Function programming: high-order functions, projections (currying), closures, etc.
2. Array Processing Language: right-to-left evaluation, vector algebra

Once a good grasp of these paradigms has been developed then the rest should be a nice, relaxing downhill descent.

Passing a table as an argument to a remote function in KDB

Executing queries against a historical (disk-based) KDB database can be computationally expensive - both in time and space. Rather than tautologically execute the query, it is best to cache the result if the query needs to executed again. I wanted to pass the result (kdb table) to a remote function on another Q process on another server. Passing the table to an IPC handle is trivial but to pass the table as an argument to a remote function is tricky.

In true blue peter style, here is one I made earlier:

h:hopen `someserver:3333
tbl:select avg price from ([]price:10?1.10)
h "add ",-3!`int$-8!tbl

This code serializes the table, then casts the serialized binary data to a list of integers to be passed to the remote function called add

On the other side of the IPC handle, the add function will look like:

add:{[data] tbl: -9!4h$data; ... }

Enjoy!

Friday, 29 April 2011

Rounding in KDB

Kdb doesn't have a reserved keyword for rounding rational numbers to the nearest integer; which is rather strange given that it is quite a common computation. One way of doing it is by using casting:

q) "i"$0.1
0
q) "i"$0.9
1

I decided to write function that can be used to do rounding; which is more terse than casting.

.q.round:{floor x+0.5}

q) round 0.1
0
q) round 0.6
1

Sunday, 13 March 2011

Xor in KDB

I wrote an exclusive or operator for KDB after not appearing to find one. Here is what I wrote:

.q.xor:{((x&y)~:)&x|y}

Let's give some examples of how you can use it with both lists and scalars:

q) 00001111b xor 11110000b
11111111b

q) 1b xor 1b
0b

q) 1b xor 00001111b
11110000b

Friday, 4 March 2011

Bloom Filters

A common and severe bottleneck in distributed systems is I/O. Many applications in industry are based on storing data in a relational database that persists to a disc-based file system and the data is retrieved using many non-contiguous disk reads. The most common solution is to use a cache for the data so the data is read from memory to avoid reading from disk. The improvements are significant.

What if you wanted to detect the presence of a record in the database without I/O? You could use a cache but it's not space-efficient as you would need to increase the size of the cache as data is added to the database. A much better solution is to use space-efficient bloom filter, which uses a constant size of memory as the data grows to infinity.

A bloom filter is a probabilistic data structure that enables to detect the presence of an element in an universal set in constant space and time making it extremely scalable as the data tends to petabytes. The wikipedia article will do a better job of describing it: http://en.wikipedia.org/wiki/Bloom_filter. The most important property is that it cannot give false negatives but false positives; which means that it will never falsely tell you an element does not exist in the set it represents but can give you a false positive.

Let's give a real word use-case where you might want to use it. Let's say we are storing orders in a order management system, which persists all ordes to a relational database. We could use a cache to store the primary keys of each order but this solution would run into problems if the cache runs out of memory due too story too many keys. A much efficient solution would to use a bloom filter to detect if any order is in the database. If it is in the database then we can retrieve it. It may give you false positives but will never tell you a false negative.

I will be open sourcing my implementation that I use for my own startup project.

Approximating Pi using a Monte Carlo algorithm in Q

Probabilistic data structures and algorithms have unique special properties that complement distributed systems and mathematical modelling. I will be writing more blog posts to elaborate on these properties and their applications for high performance computing.

I decided to write a function in Q that uses monte carlo simulation to approximate Pi as Q is very concise and highly optimized for performing vector algebra. The result was just one line of code which is blazing fast.

pi:{[n]x:n?1.0;y:n?1.0;4*(count where 1>=sqrt (x*x)+y*y)%n}

Voila! A Pi approximation in one line of Q.

PezBlog