Fleece’ Performance
Fleece is a Lua to JSON converter that is an order of magnitude faster than other libraries. In the following I would like to take you through a quick tour on where the performance comes from. This could be instructive about the Lua core, to better utilize Fleece, or help making Fleece better.
Speed is an enabler, just as design is. For the Eonblast Eonbeam server, Fleece makes quite a difference and allows for a technical path that would otherwise have to be ruled out: namely, to hold the main state in Lua, instead of Erlang. There are a number of implications from that but a main point is that the overall architecture gains in clarity.
So the point of Fleece is something practical and from that point of departure no holds are barred. It’s not a style competition but an exercise in speed.
Performance
Part of Fleece’ speed advantage over other JSON packages is derived from the fact that it accesses the Lua table data at the core C level, below the Lua C API, which frees a significant number of cycles. It also uses a string buffer that is custom designed for the specific task of building a JSON string. Fleece uses custom programmed, faster float and integer conversion. And at the expense of some memory consumption, it keeps some string and lookup buffers around once initialized, for the next encoding. Performance can be tuned by tweaking buffer sizes at compile time, by setting defines in src/fleece-intern.h. The defaults are set for Fleece to produce standard JSON reasonably fast.
First, for a real quick try out, you could do:
git clone git://github.com/Eonblast/fleece-lite.git fleece cd fleece make linux # (or macosx or macosx-old) make test make bench make bench2
To benchmark vs a C library, build LuaJSON that comes bundled with Fleece in etc/:
make linux-test # (or macosx-test or macosx-old-test) make bench3
Here is what you would see:
---------------------------------------------------------------------------------
1000 elements in t[i]=i
100x luajson.stringify(t) 3000ns/element [1,2,3,4,5,6,7,8,9,1..
100x json4.encode(t) 7800ns/element [1,2,3,4,5,6,7,8,9,1..
100x fleece.json(t) 300ns/element 10%, 3% [1,2,3,4,5,6,7,8,9,1..
---------------------------------------------------------------------------------
1000 elements in t['x'..i]=i
100x luajson.stringify(t) 4700ns/element {"x97":97,"x366":366..
100x json4.encode(t) 10600ns/element {"x97":97,"x366":366..
100x fleece.json(t) 400ns/element 8%, 3% {"x518":518,"x744":7..
---------------------------------------------------------------------------------
1000 elements in t[i]=randstr(i)
100x luajson.stringify(t) 2800ns/element ["d","lnfbrryjnvabnr..
100x json4.encode(t) 9000ns/element ["otkxtbnyemkrqdcyhh..
100x fleece.json(t) 200ns/element 7%, 2% ["fgyqrqmmvkuouatzye..
---------------------------------------------------------------------------------
1000 elements in t[randstr(i)]=randstr(i)
100x luajson.stringify(t) 5300ns/element {"bbwd":"kkpgojbhnor..
100x json4.encode(t) 12600ns/element {"wvubdmsybdgtycqjg"..
100x fleece.json(t) 500ns/element 9%, 3% {"hfaxoxd":"yyndsgxy..
---------------------------------------------------------------------------------
Fleece is around 10 times faster than LuaJSON and sometimes even 100x faster than the native implementation JSON4. More details on the benchmarks:
http://eonblast.github.com/fleece-lite/test.html
Core Data Access
If done using the Lua API, Iterating over the data of a Lua table to make a JSON string from them is a read-only task that goes through a lot of moves designed for broader application. The API uses the stack a lot and for every value produces overhead structures and copies of data resides in the table. To avoid this cost, Fleece iterates over the table data the same way that the Lua core does for re-hashing tables, which is a frequent, optimized task.
Below you see a abbreviated listing of the core of the Fleece source that traverses over a Lua table. Internally, a Lua table is composed of two parts, a C array part and a hash part. This makes sense, because a Lua table is used both for strict arrays and also associative arrays that may have any type of object as keys. Intuitively, the array data of a table could be expected to fully end up in the C array, but that’s not true for all of it. There are reasons for Lua array data ending up in the C hash part. E.g. ‘holes’ in the array, or indices < 1.
Consequently, the table traversal shown below has two parts, one takes care of the internal C array, the other of the internal hash. To make it more readable, I have cut some counters and flags from the source that take care of the decisions of wether to produce a JSON array, or a JSON object from a Lua table. I have also cut lines that deal with 'rewinding' a JSON array creation in cases where it becomes clear 'on the fly' that a JSON object must be created instead. You find the current, full source here:
https://github.com/Eonblast/fleece-lite/blob/master/src/fleece-stringify.c
The first part below, using variables lg, ttlg, i and lim deals with finding the 'slices' of data in the C array of the Lua table implementation. It is verbatim Lua core and part of the bare necessities to traverse the C array part.
Note the !ttisnil(v) condition in both array and hash part. This is the place where the internal 'empty' table slots are being skipped. Unlike when programming in Lua, underneath, there can be 'invisble' entries in tables that hold nils. They have to be treated as if they were not there. It is, of course, not appropriate to make them into JSON null entries per default. This comparing to nil, and skipping, is also exactly what the Lua core does to count entries in tables.
Lua Table Traversion fleece-stringify.c
#define stringify_value_macro(_ctrl, _o) \
{ \
...
else if (ttistable(_o)) \
{ \
buffer_add_char(_ctrl, '{'); \
...
stringify_array_part(_ctrl, hvalue(_o), &count, &pure, &tried, force); \
...
stringify_hash_part(_ctrl, hvalue(_o), &count, &pure); \
...
*(dynp(_ctrl->parts_list_last) -1)='}'; \
...
} \
...
/* This is Lua 5.1.4 specific */
void stringify_array_part (insp_ctrl *ctrl, const Table *t, size_t *count,
int *pure, int *tried, int force)
{
int lg;
int ttlg; /* 2^lg */
int i = 1; /* count to traverse all array keys */
...
for (lg=0, ttlg=1; lg<=MAXBITS; lg++, ttlg*=2) { /* for each slice */
int lim = ttlg;
if (lim > t->sizearray) {
lim = t->sizearray; /* adjust upper limit */
if (i > lim)
break; /* no more elements */
}
/* elements in range (2^(lg-1), 2^lg] */
for (; i <= lim; i++) {
TValue * v = &t->array[i-1];
if(!ttisnil(v)) {
if(force || *count != i) {
...
buffer_assert(ctrl, 14);
size_t len;
itoa(i, dynp(ctrl->parts_list_last), &len);
ctrl->parts_list_last->len += len;
ctrl->total_len += len;
buffer_add_char(ctrl, ':');
...
}
...
stringify_value_macro(ctrl, v);
buffer_add_char(ctrl, ','); // N.B. can be overwritten with }, ]
}
}
}
}
/* This is Lua 5.1.4 specific */
void stringify_hash_part (insp_ctrl *ctrl, const Table *t, size_t *count, int *ppure)
{
int i = sizenode(t);
...
while (i--) {
Node *node = &t->node[i];
if(!ttisnil(key2tval(node)) && !ttisnil(gval(node))) {
...
stringify_value_macro(ctrl, key2tval(node));
buffer_add_char(ctrl, ':');
stringify_value_macro(ctrl, gval(node));
buffer_add_char(ctrl, ',');
...
}
}
...
}
There is one additional function in play that was borrowed from the Lua core source, index2adr(). It is a brief conversion to access table keys directly, as a value. It is not exported in Lua 5.1.4, so it had to be copied directly into the fleece library.
Obviously the above will not be compatible with future versions of Lua, or LuaJIT. But it’s not a lot to adapt and the gain is significant.
How much it saves depends heavily on the the contents of the table to be encoded but accounts roughly for a third of the speed gains.
String Buffers
Another source of performance gains was the introduction of a highly specialized string buffer. It is custom made for producing a JSON string and was improved incrementally in a sufficiently empiric approach. The path was to:
Ideally, you want to allocate one buffer beforehand that is large enought to hold the whole eventual JSON string. In practice you can’t know how big that should be unless you go and count first. But for many tables, that one-two punch turns out to be way slower than simply allocating a modest buffer and give it a go, adding more when necessary.
So the basic buffer structure is just that, a potentially more lavish first buffer (called ‘lucky’ in the source because you are lucky, performance-wise, if that buffer turns out big enough). And then smaller buffers that are allocated when you need them. You can tweak their size in fleece-intern.h for now. There should be parameters for that at some point.
The aim is to touch each bit of data only once, or twice if need be. The second touching being the eventual concatenating of the buffers to return the result back to Lua. If the ‘lucky’ buffer was lucky and big enough, even that step is skipped.
After the buffer mechanism had reached a stable condition, half of it was turned into macros to ‘inline’ them. A function call basically only takes place when a new, additional buffer has to be created. All writing usually takes place using macros, copying directly onto the current buffer.
To reduce repetitive calculations on remaining buffer sizes, the buffer is operated on the assumption that it must always have three reserve characters left, on top of what other space reserve it is tested for at any time. These tests are taking place before a Lua value’s string representation is written into the buffer. The reserve space can then be ‘blindly’ used up for two of ‘,:”"‘. Only two of these ever appear consecutively in a JSON string, before the next actual value, or a bracket is written. During which the next reserve space is taken care of, for the next ‘punctuation’ characters. To maintain this tacit assurance saves a lot of cycles, because the buffer size calculations before writing ‘,:”"‘ can be dropped altogether.
Replacing C memcpy by a native C variant showed gains on a slower machine but Fleece is to be optimized for servers. What turned out to be efficient though, is a Duff’s Device-like switch that casts memory areas into structures and by that tries to make C optimize the moving of small memory blocks at compile time. This looks funny but works quite well. It can come to bear quite heavily since especially string table keys are often short. For longer blocks, the C memcpy implementation, as was to be expected, turns out to be the most efficient. You can find the complete buffer handling source at:
https://github.com/Eonblast/fleece-lite/blob/master/src/fleece-buffers.c and https://github.com/Eonblast/fleece-lite/blob/master/src/fleece-buffers.h
Faster Memcpy fleece-buffers.h
struct char2 { char mem[2]; }; typedef struct char2 *buf2;
struct char3 { char mem[3]; }; typedef struct char3 *buf3;
struct char4 { char mem[4]; }; typedef struct char4 *buf4;
struct char5 { char mem[5]; }; typedef struct char5 *buf5;
struct char6 { char mem[6]; }; typedef struct char6 *buf6;
struct char7 { char mem[7]; }; typedef struct char7 *buf7;
struct char8 { char mem[8]; }; typedef struct char8 *buf8;
struct char9 { char mem[9]; }; typedef struct char9 *buf9;
struct char10 { char mem[10]; }; typedef struct char10 *buf10;
struct char11 { char mem[11]; }; typedef struct char11 *buf11;
struct char12 { char mem[12]; }; typedef struct char12 *buf12;
struct char13 { char mem[13]; }; typedef struct char13 *buf13;
struct char14 { char mem[14]; }; typedef struct char14 *buf14;
struct char15 { char mem[15]; }; typedef struct char15 *buf15;
struct char16 { char mem[16]; }; typedef struct char16 *buf16;
#define fast_memcpy_macro(_dest, _src, _count) { \
\
if(_count<=16) { \
switch(_count) { \
case 1: *(char*)_dest = *(char*)_src; break; \
case 2: *(buf2)_dest = *(buf2)_src; break; \
case 3: *(buf3)_dest = *(buf3)_src; break; \
case 4: *(buf4)_dest = *(buf4)_src; break; \
case 5: *(buf5)_dest = *(buf5)_src; break; \
case 6: *(buf6)_dest = *(buf6)_src; break; \
case 7: *(buf7)_dest = *(buf7)_src; break; \
case 8: *(buf8)_dest = *(buf8)_src; break; \
case 9: *(buf9)_dest = *(buf9)_src; break; \
case 10: *(buf10)_dest = *(buf10)_src; break; \
case 11: *(buf11)_dest = *(buf11)_src; break; \
case 12: *(buf12)_dest = *(buf12)_src; break; \
case 13: *(buf13)_dest = *(buf13)_src; break; \
case 14: *(buf14)_dest = *(buf14)_src; break; \
case 15: *(buf15)_dest = *(buf15)_src; break; \
case 16: *(buf16)_dest = *(buf16)_src; break; \
} \
} \
else { \
memcpy(_dest, _src, _count); \
} \
}
For completeness, the faster mempy replacement, for a slow machine (Apple PPC) was:
void fast_memcpy(char *dest, const char* src, size_t count) {
char* dst8 = (char*)dest;
char* src8 = (char*)src;
--src8;
--dst8;
while (count--)
*++dst8 = *++src8;
}
I am afraid I lost the reference where I got it from. There is some intelligent reasoning behind it, as to why it is done this way. In any event, it is not used in Fleece, as for faster CPUs it seemed to be slower than gcc memcpy.
Another measure that resulted in significant gains was to pull the counting and copying of a string into one. Since counting becomes only relevant when characters are present that need to be escaped (otherwise destination length will equal source length), the idea was that the loop over the string could be used to both count and copy. In the case that too many control characters are encountered, the copying is canceled, only the counting is continued to the end. Sufficient buffer is then allocated to cope with the string to be added and the copying is restarted. This time, without counting. This is separated into individual maximally fast functions, which pays off for this really hot spot. It is actually completely optimized only in the x86 inline assembler version, which presents the final optimization step.
Number conversion
Finally, number conversion algorithms play a role in JSON string creation.
The overall strategy concerning numbers is:
The standard way to produce the string representation of a number is using sprintf(). As could be expected, that introduces overhead that is not needed and not wanted at such a central place in the loop. If you encode a million numbers, obviously every cycle saved is worth a million cycles, so this deserved attention.
Fleece comes with four additional functions that can emulate sprintf(). But for compatibility, you can also make it use sprintf() and be sure to get a more standard conformant result. At present, you can change the default precision only at compile time by changing the constant in fleece-intern.h.
Don't give too much credit to the standard here: standards in floating point conversions, and their internal representation, are a bit screwed up. So not following the standard in this case may be less daring than it may sound.
Fleece also takes care of floats that have no decimal fraction, by casting them into ints and then using the faster integer to ASCII conversion.
And it creates two internal look up tables of numbers 1 through 999 to avoid divisions in the conversion of integers in the range of - to +1,000,000. The tables allow for the replacement of up to three costly divisions by one single, much faster pointer addition. Fleece allocates and fills two 4K tables for this and leaves these tables up once created. In a future version there should be an option to switch the use or the keeping of these tables off. For integers outside the range of -+1M, a 'normal' division strategy is applied.
The integer conversion looks like the following code snippet. I trimmed some comments and assertions. You can find the original source here:
https://github.com/Eonblast/fleece-lite/blob/master/src/fleece-numbers.c
Integer Conversion fleece-numbers.c
void itoa(int pvalue, char* pstr, size_t *rlen) {
static char num[] = "0123456789";
int value = pvalue;
char *str = pstr;
// take care of sign
if(value<0) { *str++ = '-'; value = -value; }
/* -999 - +999 */
if (value<1000) {
if(!mux) initmux();
char *q = mux + value*4;
char qq;
*str++=*q++; if(qq=*q++) { *str++=qq; if(qq=*q++) { *str++=qq; } }
*str=0;
*rlen = str - pstr;
goto ret;
}
if (value<1000000) {
if(!mux) initmux();
char *q = mux + (value/1000*4);
char qq;
*(str++)=*q++; if(qq=*q++) { *(str++)=qq; if(qq=*q++) { *(str++)=qq; } }
value %= 1000;
q = mux0 + (4*value);
*str++=*q++;
*str++=*q++;
*str++=*q++;
*str=0;
*rlen = str - pstr;
goto ret;
}
/* #4 any length
Source: Stuart Lowe's optimization of the Kernighan & Ritchie sample
at http://www.strudel.org.uk/itoa/ */
char* wstr=str;
int sign;
div_t res;
// Conversion. Number is reversed.
do {
res = div(value,10);
*wstr++ = num[res.rem];
} while((value=res.quot));
if(sign<0) *wstr++='-';
*wstr='\0';
*rlen = wstr - pstr;
strreverse(str,wstr-1);
ret:;
...
return;
}
This concludes our trip through the inner workings of Fleece. Thanks for your interest, I'll be happy to hear back from you.
Henning
[1] Duff's Device: http://en.wikipedia.org/wiki/Duff's_device