reference, declarationdefinition
definition → references, declarations, derived classes, virtual overrides
reference to multiple definitions → definitions
unreferenced
    1
    2
    3
    4
    5
    6
    7
    8
    9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   29
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40
   41
   42
   43
   44
   45
   46
   47
   48
   49
   50
   51
   52
   53
   54
   55
   56
   57
   58
   59
   60
   61
   62
   63
   64
   65
   66
   67
   68
   69
   70
   71
   72
   73
   74
   75
   76
   77
   78
   79
   80
   81
   82
   83
   84
   85
   86
   87
   88
   89
   90
   91
   92
   93
   94
   95
   96
   97
   98
   99
  100
  101
  102
  103
  104
  105
  106
  107
  108
  109
  110
  111
  112
  113
  114
  115
  116
  117
  118
  119
  120
  121
  122
  123
  124
  125
  126
  127
  128
  129
  130
  131
  132
  133
  134
  135
  136
  137
  138
  139
  140
  141
  142
  143
  144
  145
  146
  147
  148
  149
  150
  151
  152
  153
  154
  155
  156
  157
  158
  159
  160
  161
  162
  163
  164
  165
  166
  167
  168
  169
  170
  171
  172
  173
  174
  175
  176
  177
  178
  179
  180
  181
  182
  183
  184
  185
  186
  187
  188
  189
  190
  191
  192
  193
  194
  195
  196
  197
  198
  199
  200
  201
  202
  203
  204
  205
  206
  207
  208
  209
  210
  211
  212
  213
  214
  215
  216
  217
  218
  219
  220
  221
  222
  223
  224
  225
  226
  227
  228
  229
  230
  231
  232
  233
  234
  235
  236
  237
  238
  239
  240
  241
  242
  243
  244
  245
  246
  247
  248
  249
  250
  251
  252
  253
  254
  255
  256
  257
  258
  259
  260
  261
  262
  263
  264
  265
  266
  267
  268
  269
  270
  271
  272
  273
  274
  275
  276
  277
  278
  279
  280
  281
  282
  283
  284
  285
  286
  287
  288
  289
  290
  291
  292
  293
  294
  295
  296
  297
  298
  299
  300
  301
  302
  303
  304
  305
  306
  307
  308
  309
  310
  311
  312
  313
  314
  315
  316
  317
  318
  319
  320
  321
  322
  323
  324
  325
  326
  327
  328
  329
  330
  331
  332
  333
  334
  335
  336
  337
  338
  339
  340
  341
  342
  343
  344
  345
  346
  347
  348
  349
  350
  351
  352
  353
  354
  355
  356
  357
  358
  359
  360
  361
  362
  363
  364
  365
  366
  367
  368
  369
  370
  371
  372
  373
  374
  375
  376
  377
  378
  379
  380
  381
  382
  383
  384
  385
  386
  387
  388
  389
  390
  391
  392
  393
  394
  395
  396
  397
  398
  399
  400
  401
======================================
XRay Flight Data Recorder Trace Format
======================================

:Version: 1 as of 2017-07-20

.. contents::
   :local:


Introduction
============

When gathering XRay traces in Flight Data Recorder mode, each thread of an
application will claim buffers to fill with trace data, which at some point
is finalized and flushed.

A goal of the profiler is to minimize overhead, so the flushed data directly
corresponds to the buffer.

This document describes the format of a trace file.


General
=======

Each trace file corresponds to a sequence of events in a particular thread.

The file has a header followed by a sequence of discriminated record types.

The endianness of byte fields matches the endianess of the platform which
produced the trace file.


Header Section
==============

A trace file begins with a 32 byte header.

+-------------------+-----------------+----------------------------------------+
| Field             | Size (bytes)    | Description                            |
+===================+=================+========================================+
| version           | ``2``           | Anticipates versioned  readers. This   |
|                   |                 | document describes the format when     |
|                   |                 | version == 1                           |
+-------------------+-----------------+----------------------------------------+
| type              | ``2``           | An enumeration encoding the type of    |
|                   |                 | trace. Flight Data Recorder mode       |
|                   |                 | traces have type == 1                  |
+-------------------+-----------------+----------------------------------------+
| bitfield          | ``4``           | Holds parameters that are not aligned  |
|                   |                 | to bytes. Further described below.     |
+-------------------+-----------------+----------------------------------------+
| cycle_frequency   | ``8``           | The frequency in hertz of the CPU      |
|                   |                 | oscillator used to measure duration of |
|                   |                 | events in ticks.                       |
+-------------------+-----------------+----------------------------------------+
| buffer_size       | ``8``           | The size in bytes of the data portion  |
|                   |                 | of the trace following the header.     |
+-------------------+-----------------+----------------------------------------+
| reserved          | ``8``           | Reserved for future use.               |
+-------------------+-----------------+----------------------------------------+

The bitfield parameter of the file header is composed of the following fields.

+-------------------+----------------+-----------------------------------------+
| Field             | Size (bits)    | Description                             |
+===================+================+=========================================+
| constant_tsc      | ``1``          | Whether the platform's timestamp        |
|                   |                | counter used to record ticks between    |
|                   |                | events ticks at a constant frequency    |
|                   |                | despite CPU frequency changes.          |
|                   |                | 0 == non-constant. 1 == constant.       |
+-------------------+----------------+-----------------------------------------+
| nonstop_tsc       | ``1``          | Whether the tsc continues to count      |
|                   |                | despite whether the CPU is in a low     |
|                   |                | power state. 0 == stop. 1 == non-stop.  |
+-------------------+----------------+-----------------------------------------+
| reserved          | ``30``         | Not meaningful.                         |
+-------------------+----------------+-----------------------------------------+


Data Section
============

Following the header in a trace is a data section with size matching the
buffer_size field in the header.

The data section is a stream of elements of different types.

There are a few categories of data in the sequence.

- ``Function Records``: Function Records contain the timing of entry into and
  exit from function execution. Function Records have 8 bytes each.

- ``Metadata Records``: Metadata records serve many purposes. Mostly, they
  capture information that may be too costly to record for each function, but
  that is required to contextualize the fine-grained timings. They also are used
  as markers for user-defined Event Data payloads. Metadata records have 16
  bytes each.

- ``Event Data``: Free form data may be associated with events that are traced
  by the binary and encode data defined by a handler function. Event data is
  always preceded with a marker record which indicates how large it is.

- ``Function Arguments``: The arguments to some functions are included in the
  trace. These are either pointer addresses or primitives that are read and
  logged independently of their types in a high level language. To the tracer,
  they are all simply numbers. Function Records that have attached arguments
  will indicate their presence on the function entry record. We only support
  logging contiguous function argument sequences starting with argument zero,
  which will be the "this" pointer for member function invocations. For example,
  we don't support logging the first and third argument.

A reader of the memory format must maintain a state machine. The format makes no
attempt to pad for alignment, and it is not seekable.


Function Records
----------------

Function Records have an 8 byte layout. This layout encodes information to
reconstruct a call stack of instrumented function and their durations.

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bits)  | Description                                   |
+===============+==============+===============================================+
| discriminant  | ``1``        | Indicates whether a reader should read a      |
|               |              | Function or Metadata record. Set to ``0`` for |
|               |              | Function records.                             |
+---------------+--------------+-----------------------------------------------+
| action        | ``3``        | Specifies whether the function is being       |
|               |              | entered, exited, or is a non-standard entry   |
|               |              | or exit produced by optimizations.            |
+---------------+--------------+-----------------------------------------------+
| function_id   | ``28``       | A numeric ID for the function. Resolved to a  |
|               |              | name via the xray instrumentation map. The    |
|               |              | instrumentation map is built by xray at       |
|               |              | compile time into an object file and pairs    |
|               |              | the function ids to addresses. It is used for |
|               |              | patching and as a lookup into the binary's    |
|               |              | symbols to obtain names.                      |
+---------------+--------------+-----------------------------------------------+
| tsc_delta     | ``32``       | The number of ticks of the timestamp counter  |
|               |              | since a previous record recorded a delta or   |
|               |              | other TSC resetting event.                    |
+---------------+--------------+-----------------------------------------------+

On little-endian machines, the bitfields are ordered from least significant bit
bit to most significant bit. A reader can read an 8 bit value and apply the mask
``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned
shift right by ``0x04`` to obtain the function_id field.

On big-endian machine, the bitfields are written in order from most significant
bit to least significant bit. A reader would read an 8 bit value and unsigned
shift right by 7 bits for the discriminant. The function_id field could be
obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``.

Function action types are as follows.

+---------------+--------------+-----------------------------------------------+
| Type          | Number       | Description                                   |
+===============+==============+===============================================+
| Entry         | ``0``        | Typical function entry.                       |
+---------------+--------------+-----------------------------------------------+
| Exit          | ``1``        | Typical function exit.                        |
+---------------+--------------+-----------------------------------------------+
| Tail_Exit     | ``2``        | An exit from a function due to tail call      |
|               |              | optimization.                                 |
+---------------+--------------+-----------------------------------------------+
| Entry_Args    | ``3``        | A function entry that records arguments.      |
+---------------+--------------+-----------------------------------------------+

Entry_Args records do not contain the arguments themselves. Instead, metadata
records for each of the logged args follow the function record in the stream.


Metadata Records
----------------

Interspersed throughout the buffer are 16 byte Metadata records. For typically
instrumented binaries, they will be sparser than Function records, and they
provide a fuller picture of the binary execution state.

Metadata record layout is partially record dependent, but they share a common
structure.

The same bit field rules described for function records apply to the first byte
of MetadataRecords. Within this byte, little endian machines use lsb to msb
ordering and big endian machines use msb to lsb ordering.

+---------------+--------------+-----------------------------------------------+
| Field         | Size         | Description                                   |
+===============+==============+===============================================+
| discriminant  | ``1 bit``    | Indicates whether a reader should read a      |
|               |              | Function or Metadata record. Set to ``1`` for |
|               |              | Metadata records.                             |
+---------------+--------------+-----------------------------------------------+
| record_kind   | ``7 bits``   | The type of Metadata record.                  |
+---------------+--------------+-----------------------------------------------+
| data          | ``15 bytes`` | A data field used differently for each record |
|               |              | type.                                         |
+---------------+--------------+-----------------------------------------------+

Here is a table of the enumerated record kinds.

+--------+---------------------------+
| Number | Type                      |
+========+===========================+
| 0      | NewBuffer                 |
+--------+---------------------------+
| 1      | EndOfBuffer               |
+--------+---------------------------+
| 2      | NewCPUId                  |
+--------+---------------------------+
| 3      | TSCWrap                   |
+--------+---------------------------+
| 4      | WallTimeMarker            |
+--------+---------------------------+
| 5      | CustomEventMarker         |
+--------+---------------------------+
| 6      | CallArgument              |
+--------+---------------------------+


NewBuffer Records
-----------------

Each buffer begins with a NewBuffer record immediately after the header.
It records the thread ID of the thread that the trace belongs to.

Its data segment is as follows.

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bytes) | Description                                   |
+===============+==============+===============================================+
| thread_Id     | ``2``        | Thread ID for buffer.                         |
+---------------+--------------+-----------------------------------------------+
| reserved      | ``13``       | Unused.                                       |
+---------------+--------------+-----------------------------------------------+


WallClockTime Records
---------------------

Following the NewBuffer record, each buffer records an absolute time as a frame
of reference for the durations recorded by timestamp counter deltas.

Its data segment is as follows.

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bytes) | Description                                   |
+===============+==============+===============================================+
| seconds       | ``8``        | Seconds on absolute timescale. The starting   |
|               |              | point is unspecified and depends on the       |
|               |              | implementation and platform configured by the |
|               |              | tracer.                                       |
+---------------+--------------+-----------------------------------------------+
| microseconds  | ``4``        | The microsecond component of the time.        |
+---------------+--------------+-----------------------------------------------+
| reserved      | ``3``        | Unused.                                       |
+---------------+--------------+-----------------------------------------------+


NewCpuId Records
----------------

Each function entry invokes a routine to determine what CPU is executing.
Typically, this is done with readtscp, which reads the timestamp counter at the
same time.

If the tracing detects that the execution has switched CPUs or if this is the
first instrumented entry point, the tracer will output a NewCpuId record.

Its data segment is as follows.

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bytes) | Description                                   |
+===============+==============+===============================================+
| cpu_id        | ``2``        | CPU Id.                                       |
+---------------+--------------+-----------------------------------------------+
| absolute_tsc  | ``8``        | The absolute value of the timestamp counter.  |
+---------------+--------------+-----------------------------------------------+
| reserved      | ``5``        | Unused.                                       |
+---------------+--------------+-----------------------------------------------+


TSCWrap Records
---------------

Since each function record uses a 32 bit value to represent the number of ticks
of the timestamp counter since the last reference, it is possible for this value
to overflow, particularly for sparsely instrumented binaries.

When this delta would not fit into a 32 bit representation, a reference absolute
timestamp counter record is written in the form of a TSCWrap record.

Its data segment is as follows.

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bytes) | Description                                   |
+===============+==============+===============================================+
| absolute_tsc  | ``8``        | Timestamp counter value.                      |
+---------------+--------------+-----------------------------------------------+
| reserved      | ``7``        | Unused.                                       |
+---------------+--------------+-----------------------------------------------+


CallArgument Records
--------------------

Immediately following an Entry_Args type function record, there may be one or
more CallArgument records that contain the traced function's parameter values.

The order of the CallArgument Record sequency corresponds one to one with the
order of the function parameters.

CallArgument data segment:

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bytes) | Description                                   |
+===============+==============+===============================================+
| argument      | ``8``        | Numeric argument (may be pointer address).    |
+---------------+--------------+-----------------------------------------------+
| reserved      | ``7``        | Unused.                                       |
+---------------+--------------+-----------------------------------------------+


CustomEventMarker Records
-------------------------

XRay provides the feature of logging custom events. This may be leveraged to
record tracing info for RPCs or similarly trace data that is application
specific.

Custom Events themselves are an unstructured (application defined) segment of
memory with arbitrary size within the buffer. They are preceded by
CustomEventMarkers to indicate their presence and size.

CustomEventMarker data segment:

+---------------+--------------+-----------------------------------------------+
| Field         | Size (bytes) | Description                                   |
+===============+==============+===============================================+
| event_size    | ``4``        | Size of preceded event.                       |
+---------------+--------------+-----------------------------------------------+
| absolute_tsc  | ``8``        | A timestamp counter of the event.             |
+---------------+--------------+-----------------------------------------------+
| reserved      | ``3``        | Unused.                                       |
+---------------+--------------+-----------------------------------------------+


EndOfBuffer Records
-------------------

An EndOfBuffer record type indicates that there is no more trace data in this
buffer. The reader is expected to seek past the remaining buffer_size expressed
before the start of buffer and look for either another header or EOF.


Format Grammar and Invariants
=============================

Not all sequences of Metadata records and Function records are valid data. A
sequence should be parsed as a state machine. The expectations for a valid
format can be expressed as a context free grammar.

This is an attempt to explain the format with statements in EBNF format.

- Format := Header ThreadBuffer* EOF

- ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End

- BodySequence := NewCPUId | TSCWrap | Function | CustomEvent

- Function := (Function_Entry_Args CallArgument*) | Function_Other_Type

- CustomEvent := CustomEventMarker CustomEventUnstructuredMemory

- End := EndOfBuffer RemainingBufferSizeToSkip


Function Record Order
---------------------

There are a few clarifications that may help understand what is expected of
Function records.

- Functions with an Exit are expected to have a corresponding Entry or
  Entry_Args function record precede them in the trace.

- Tail_Exit Function records record the Function ID of the function whose return
  address the program counter will take. In other words, the final function that
  would be popped off of the call stack if tail call optimization was not used.

- Not all functions marked for instrumentation are necessarily in the trace. The
  tracer uses heuristics to preserve the trace for non-trivial functions.

- Not every entry must have a traced Exit or Tail Exit. The buffer may run out
  of space or the program may request for the tracer to finalize toreturn the
  buffer before an instrumented function exits.