PezBlog

Wednesday, 31 October 2012

Computing logarithms to arbitrary bases in Java

I noticed that Java's math API does not provide a method to compute logarithms for arbitrary bases so I decided to write a function to provide this ability. As an API design, you cannot expect developers to know how to compute logarithms of arbitrary bases using the natural logarithm. It is quite common to compute logarithms to specific bases like 2, 8 and 16.

public static double log(double x, double b) {
return log(x) / log(b)
}

All I am doing is taking the natural log of x over the natural log of the base value. This is not the optimal solution as it would be best to write a native method but it is close enough.

Simples!

Friday, 3 August 2012

Grouping in kdb using datetime

This blog post could also be called "becareful what you group for"

The original timestamp of kdb is based on a floating point number, which they call a datetime, where the date and time is on the left and right hand side of the decimal point respectively. This causes a lot of issues when comparing timestamps for equality and especially when grouping by timestamps. A colleague who adjusts risk of the portfolio got bitten by this so it is worth blogging.

The issues are not isolated to kdb but it is the inherent nature of floating point numbers as it has to make discrete approximations to the real number line. The most simple example is trying to represent 0.1 in base 2,as it is an irrational number so it has to make an approximation that is close to 0.1. Two variables representing 0.1 could have two different approximation values, and this can cause a lot of issues if you analysing data based on datetimes as it compares the approximation values and not by 0.1.

The proof is in the pudding, so I will demonstrate with an example.

/P 0q) x:1.0-0.90.1q)y:1.0-0.90.1
/P 1q) x:1.0-0.90.99999987q) y:1.0-0.90.99999789 As you can see when the precision is set to true, x and y are not the same values. Just think about this situation: what will happen if we wanted to group by datetimes but the same logical timestamp could have two (approximation) values. If we are grouping by timestamps then it is not guaranteed that it will group correctly because a logical datetime could have different approximation values.

The lesson to come away from this blog is not to use datetimes for equality logic both implicitly and explicitly.
The new timestamps that kdb introduced are based on 64 bit unsigned integers (computer scientists call them longs) so are reliable to use for grouping or logical equality.

Monday, 21 May 2012

Measuring latency in Java

I noticed a common pattern of Java code that attempts to measure latency within a Java process, which will give false measurements of elapsed time.

long start = System.currentTimeMillis();
doSomething();
long end = System. currentTimeMillis();
long time = end - start;

The problem with this snippet of code is that it measures latency by adjusted wall clock time. Wall clock time will continously drift from correct time and the operating system periodically sends a delta to skew it to keep it synchronized with real time; so can potentially skew your measurement.

The code can be changed by replacing System.currentTimeMillis() with System.nanoTime() as the former will include the drift adjustments and the latter gives measurements without including the skew. For example, it could skew the wall clock time by +20 milliseconds; thus adding an illusionary added 20 units of latency to the measurement.

The lesson from this is if you need to measure elapsed time in Java, ALWAYS use System.nanoTime()

long start = System.nanoTime();
doSomething();
long time = System.nanoTime() - start;

Wednesday, 18 April 2012

Sampling moments in Q

I was checking over a colleague's Q code that computes moments of a sample distribution taken from a statistical population. I realized that population moments were being computed against the sample - very naughty! The lesson of this blog is to be careful using out-of-the-box Q functions like var against a sample distribution.

Let's take a look at an example

q)sample:3 4 6 7 10
q)var sample
6f

This is the population variance being computed against the sample - this is an inaccurate moment! Instead create your own high-order function to compute sample moments.

q)svar:{(sum d*d:x-avg x)%-1+count x}
q)svar sample
7.5

Simples!

Wednesday, 23 November 2011

Algo Puzzle #1

I am a jovial fan of algorithmic puzzles so I decided to create some myself and put them on my blog. I will post them one at a time.

You will find the first puzzle down below and the solution will be posted in the comments section over the weekend. The puzzles will gradually get more difficult ;)

Algo Puzzle #1

Write an algorithm that sums a set where each number is from 1 to n in O(1) time. More specifically find an algorithm that computes it with only 3 operations regardless of the set's size.

Friday, 11 November 2011

Swapping elements in an array without a temporary variable

The common idiom to swap two elements in an array is to use a temporary variable that is consequently pushed on the stack, which looks like the following:

int[] elements = {1,2,3,4,5,6,7,8,9,10}
int tmp = elements[0];
elements[0] = elements[9];
elements[9] = tmp;

In that example I was swapping the first and last elements. However this is not the most optimized way as the elements can be swapped without that temporary variable being created.

In true blue peter style, here's one I created earlier:

elements[0] = elements[0]^[elements[9];
elements[9] = elements[0]^[elements[9];
elements[0] = elements[0]^[elements[9];

It works in the same way that algebra works, where equality can be changed w.r.t the variable.

a = b - c
b = a + c
c = b - a

The bitwise operators only work on integers and not all real numbers (more speficially, floating point numbers) so the arithmetic operators must be used instead to swap real numbers in an array:

float[] data = {1.1f, 2.234f, 4.5454f, 1.110f, 9.999f, 3.1562f};
data[0] = data[0] - data[5];
data[0] = data[5] + data[0];
data[5] = data[0] - data[5];

Using these algorithms will provide micro-optimization but at the expense of clarity.

Tuesday, 20 September 2011

Debugging failed record inserts into kdb+

Don't expect a lot of help if there is an type mismatch error with an insert into kdb+ database. All Q returns is the infamous `type error; which doesn't give you any indication on its cause. This error is due to a type mismatch between the data and the columns it is trying to write to. For instance it is expecting a String and it receives a boolean.

Here's what you usually get. Let's say I have a orders table like this and try to insert the first record into it.

q)orders:(time:`time$(); sym:`$(); client:`$(); side:`$();price:`real$(); qty:`int$());
q)`orders insert (`CLIENT1; "USDEUR"; `A; 1.56; 56.6);
`type

This sucks as it doesn't tell you where the type mismatches are and forced to manually check every column type against the element's type; which could take a while if there are many columns to check. If a production issue occurred relating to this issue, I decided to write a Q function which provides a list of columns that have a type mismatch and if it its good, it returns `good.

inspect:{ [host;port;table;data]
t:"bxhijefcsmdzuvt"!1 4 5 6 7 8 9 10 11 13 14 15 17 18 19h;
h:hopen `$":",host,":",string port;
m:h"meta ",table;
vt:abs type each data;
ct:abs t[exec t from value m];
r:(exec c from key m) where vt {x<>y}' ct;
$[0 < count r; r;`good]}

This can now be used to find why the insert is failing.

q) insert["host"; 8888; "orders"; (`CLIENT1; "USDEUR"; `A; 1.56; 56.6)]
`sym`qty

q) insert["host"; 8888; "orders"; (`CLIENT1; `USDEUR; `A; 1.56; 56)]
`good

There you go; it has computed which columns have the type mismatch and under further scrutiny, one can indeed see a blatant type mismatch between symbol->string and `int->`real for sym and qty respectively.