This is a working draft, check the "New In Version 1.1" section for the main changes, please feel free to email any suggestions to the above address, if you have any.
This document assumes a familiarity with the CSV escaping rules as specified in RFC 4180 ,specifically noting that leading and traling spaces within a field should NOT be trimmed when parsing into a string. (Wikipedia)
The CSV and CSVX formats include quotes as a special character therfore when quoting text italicised text is used in their place. For example Hello is a string not containing quotes but "Hello" contains quotes. Where a literal linebreak is needed to be quoted a tilda is used e.g. Line 1~Line 2.
The act of reading a CSVX byte-stream and converting it to a representative in-memory object is referred to as "Parsing", the act of converting that object into a new CSVX byte-stream is referred to as "Rendering".
CSVX, Comma-Seperated-Values-eXtended is an extension to the de-facto CSV standard to include meta data and columnar data typing. As it is an extension to CSV the standard is designed to exhibit backwards and forwards compatibility with the following requirements:
A CSVX Stream is delimited into Blocks, each being a section of the Stream having a particular purpose and laid out and parsed according to stRict criteria. This specification dryly explains each block and the layout and parsing rules for each block, with examples where neccesary. These blocks are marked within the stream with the following headers.
Each block is marked in the Stream by it's ASCII literal byte-sequence between two newlines - with the exception of the CSVX header which begins a stream. Parsers may localise newlines but this specification recommends that LF (0x0A) be used. In order to prevent a stream from containing character sequences which mimic the block headings from breaking the Stream the following rules are mandated:
Where NAME corresponds to any of CSVX,META,USER,HEAD and DATA. This allows strings to contain the headers whilst preserving the original CSV escape rules unchanged. It also allows nesting of headers, i.e. [HEAD] will be rendered as [[HEAD]].
The CSVX block contains no internal information save for a version number, this version number is mandatory and a parser encountering a CSVX stream lacking a version number MUST either reject the Stream or fall-back to a basic CSV mode of operation.
00 | CSVX 01 | Version Number
The version number is a string and must be parsed and compared as such (i.e. not as a decimal number), it is expected that future extensions of the specification introduce new version numbers and that CSVX parsers are able to respond to this.
As no other information is enclosed in the CSVX Block the following line in the Stream should be one of the other section headers.
Whilst all sections are optional each section, if included, must still respect the ordering of the above list.
The META section contains information about the Stream itself, it is represented as rows of key,value pairs - CSV escaped where needed. Each key MUST have at least one associated value, a conforming parser should show an error alert when encountering an orphaned key. Keys in the META section that are NULL or not needed can be exluded when rendering a Stream.
The data stored in the META section is specific to the CSVX specification, however a parser should store and reform any unknown key/value pairs to preserve forward-compatibility. To prevent unauthorised specification extension any parser storing such key/value pairs should store them silently and NOT expose them, either by appending them to the USER collection or via any other means. E.g. a parser encountering a 1.0 CSVX Stream with 1.1 Page directives.
CSVX 1.1 META Title,My [CSVX] Stream DateCreated,2008-01-01
A basic CSVX META block, note that the Title parses as My CSVX Stream
All META keys are optional and currently only the following are supported:
| Key | Data Type | Min / Max | Description |
|---|---|---|---|
| Title | string | Max: 64 chars | Short name of the Stream.* |
| Author | string | Max: 64 chars | Name of the Stream creator. |
| Description | string | Max: 256 chars | Descriptive name of the Stream |
| DateCreated | datetime | Date the CSVX Stream was originally formed | |
| DateModified | datetime | Date of most recent change to the stream | |
| UID | string | Max: 256 chars | Unique identifier for the stream - When used for static published data this may be a URL pointing to the origin of the stream itself. |
| Session | string | Max: 256 chars | A UID specific to the current user session if used in a client-server system. |
* May only contain ASCII Chars a-z,A-Z,0-9 and _. May not start with a number. May also include a . which parsers may treat as a namespace seperator
Further key/value pairs are available for inclusion in the META section and are namespaced with the inclusion of a .. Parsers which Represent CSVX Streams as native objects may sub-parse these into similar Class/object heirarchies or simply internally represent the . with a _.
The Page Namespace allows you to send or request data in a client-server setup in a controlled, page-able manner. Processing rules for policing these - e.g. what to do if a client requests too-high a page number - in such a setup is outside the scope of this specification. One way to handle this would be to re-send the stream containing only the Page values to remind the client of its available scope.
| Page.Count | integer | Min: 1 | Total number of pages of data that exist, taking Page.Size into account.
If set to zero or null then the parser may assume it has all the data in the set on the current page. |
| Page.Size | integer | Min: 1 | Number of data rows in the DATA section.
If null or zero all data from the represented data set must be included. |
| Page.Current | integer | Min: 1 | 1-based Current Page Number.
If set to zero or null only the HEAD section should be sent with no correspongind DATA. This can then be used for representing schema information. Clients may also request the last page by sending -1 here. |
CSVX 1.1 META Table,tblCustomers Page.Current,0
Very simple paging example, this is a client requesting that the server send the META information for the Customers table but no DATA.
CSVX 1.1 META Table,tblCustomers Page.Count,10 Page.Size,20 Page.Current,0 HEAD CustomerID,Name,CountryCode u,s32,c2
This is the servers response, with a default page size.
CSVX 1.1 META Table,tblCustomers Page.Size,50 Page.Current,1
This is a client requesting the first page of results, but asking for 50 results per page instead of the default 20.
The USER section allows extra data regarding the stream to be stored in key,value pairs. The format and rules are as for the META but a key may hold a NULL value by simply including nothing to the right of the first comma.
The USER key/values can be in any format, provided they are CSV escaped where needed. However parsers should store their contents in a string collection and not attempt to determine the datatypes of each value.
CSVX 1.1 META Title,UserData Description,A [CSVX] Stream containing [USER] entries USER Draft Version,2 Edited By,"John,Dave,Chris" Note To John,How do these numbers grab you?
USER Example. Note that the value of 2 shown here would be interpreted as a string.
The HEAD Block containins lines describing the nature of each field / column in the datatable. This expands on the key/value nature of the META and USER sections by allowing multiple rows for each "key". If a row is to be represented without an earlier one being present that row must be included but left blank. Trailing unused rows need not be included
00 | HEAD 01 | Field Names 02 | Field Types 03 | Field Flags
The Field Names line is a CSV formatted list of names, these act as the Keys to identify each column, so must be unique in the row. Field names must not begin with a digit or an underscore, these must be encoded in square brackets, [_FIELD_NAME_] is therefore a valid Field Name entry and should be parsed and exposed as _FIELD_NAME_.
CSVX 1.1 HEAD ID,Name,Age,Gender,Registered DATA 1,John,20,M,1 2,Jane,30,F,1 3,Dave,40,M,0An example CSVX stream with named columns
The Field Types line is a list of data types for each Field.
They are currently formatted as a single type-defining character followed by the number of bytes (NOT characters, in the case of UTF strings) to be used according to the following rules:
Field Types:
| Char | Default Bytecount | Description | Valid DATA Block Characters / Format |
|---|---|---|---|
| b | - | Bit (True / False) | [10] |
| c | - | Currency (Decimal) | [-0123456789.] |
| d | - | Date | ccyy-MM-dd |
| e | - | Epoch (Date + Time) | ccyy-MM-ddTHH:mm:ss.sss |
| f | - | Float | [-0123456789.E] |
| i | 4 | Integer | [-0123456789] |
| s | 32767 | String | CSV escaped |
| t | - | Time | HH:mm:ss.sss |
| u | 4 | Unsigned Integer | [0123456789] |
Field Type Examples
| Code | Description |
|---|---|
| s | A 32767 (max) character string |
| s0 | A 32767 (max) character string |
| s8 | An 8 (max) character string |
| u1 | A one byte unsigned integer (0 to 255) |
| i2 | A two byte signed integer (-32768 to 32767) |
Example:
CSVX 1.1 META Title,Customers HEAD ID,Name,Registered,Country u,s32,b,c2 DATA 1,John,1,GB 2,Jane,,DE 3,Dave,0,DENote that Jane's Registered flag is NULL.
Each field can have flags associated with it to describe the constraints placed upon the data in that column, these are primarily concerned with the policing of duplicate and NULL vaules and marking out primary keys. If a stream does not contain any field flags then each field must be assumed to be Nullable and Not-Unique.
| Flag | Title | Description |
|---|---|---|
| a | Auto Increment | When adding new rows this should be auto-incremented from the highest available value in the set, or 1 for an empty stream. |
| n | Nullable | The data in this column is allowed to be NULL |
| p | Primary Key | A primary key column, must be included when Delta encoding, more than one "p" column is allowed and each combined grouping in a row must be unique. |
| r | Random | When adding new rows this should be auto-assigned a random 64-bit number. |
| u | Unique | No two rows may share the same value |
DRAFT NOTES: We want these to be powerful, but we don't want too many permutations. What's missing or needed? How about t for Timestamp, for example? Suggestions to CSVX@cyclomedia.co.uk with CSVX in the subject please.
The data section should be a straightforward CSV formatted table of data whose columns and data types are formatted as specified in the HEAD section if included.
Data should only use the "valid" characters as listed in that table and as such floating points should be expressed in the fashion 1.234E5 where the number of digits would be prohibitive to write. A CSVX stream may be parsed and re-rendered to show different encodings without preserving the original text provided no information is lost. The previous example may, therefore, be re-rendered as 123400 whilst still remaining conformant.
A CSVX Stream may contain Delta information, with each row and column flagged as Updated, Deleted or Inserted accordingly. A delta encoded CSVX stream requires at least one column to be flagged as a primary key column so that the recieving party knows how to apply the delta'ed data.
To trasmit a delta stream you need to include a column named __DELTA__, for clear human-readibility would ideally be the first/leftmost column in the stream but this is not required. Each row would then have an entry depicting its state from the following table.
| Entry | SQL Equivalent | Description |
|---|---|---|
| No change to this row (ideally this row would not included in a Delta-Stream!) | ||
| + | INSERT | A New row, with (a) temporary Primary Key(s) |
| = | UPDATE | An existing row having had an entry changed |
| - | DELETE | An existing row, to be deleted |
Using these it would therefore be possible for a user in a client-server system to modify a data table on the server sending only minimal information. In order for the server to acknowledge INSERTs the server-end of the conversation needs to be able to update Primary Key values. To do so the Primary Key(s) should be included unchanged alongside a second column of their new values braced by double underscores.
The following example returns an update of the Customers table to the server which then in turn returns an acknowledgement
CSVX 1.1 META Table,tblCustomers HEAD ID,Name,Registered,Country u,s32,b,c2 p,,, DATA 1,John,1,GB 2,Jane,,DE 3,Dave,0,DEInitial stream sent from server to client
CSVX 1.1 META Table,tblCustomers HEAD __DELTA__,ID,Name ,u,s32 ,p, DATA -,2,Jane =,3,David +,4,BillStream returned from client to server with changes in place. John is unchanged, Jane has been deleted, Dave has been renamed to David and Bill has been added.
CSVX 1.1 META Table,tblCustomers HEAD __DELTA__,ID,__ID__ ,u,u ,p,p DATA =,4,5Acknowledgement from the server. The client's Primary Key for Bill has been corrected by the server to the value assigned there.
CSVX 1.1 META Table,tblCustomers HEAD ID,Name,Registered,Country u,s32,b,c2 p,,, DATA 1,John,1,GB 3,David,0,DE 5,Bill,,Final state of the client's internal data.
Note: Concurrency policing is outside the scope of this specification, which is only concerned with the transportation of the raw data.
CSVX (in relation to computer data storage and transmission formats) and the contents of this document are copyright © 2008 Ric Hardacre BSc.
The author (Ric Hardacre) is not aware of any existing copyright or patents in any country which may be relevant in relation to this specification or the methods involved in the implementation of this specification. Therefore it is expected that this specification is freely implimentable by any developer for use in any software, commercial or otherwise. However the author does not offer protection against legal action arising due to implimentation of this specification nor does the author offer any warranties against data loss, damage or any other losses incurred by any means due to any implementation of this standard.
There are no licensing fees associated with using CSVX in any software. However the author reserves the right to require software that advertises an ability to process or handle CSVX data to comply with the specification as laid out here and may therefore require the removal of said advertising associated with said product if this is not demonstrably the case.
I don't believe that there are any copyrights or patents that cover CSVX, but I may be wrong, so please email me and let me know if you find any. In any case if you use CSVX and Bad Things happen to you, your data, property or pets then please don't blame me! You can write software that uses CSVX and yell it from the rooftops and plaster it all over your website, and it's OK if there's a bug in there that means it doesn't get things quite right - so long as you get around to fixing it. But if you attempt to Embrace and Extend CSVX I can tell you not to use the term "CSVX" anywhere in association with that software.