1LOGIWEB(5) File formats and protocols LOGIWEB(5)
2
3
4
6 logiweb - Logiweb file formats and protocols
7
9 Logiweb is a system for storing, locating, and transmitting Logiweb
10 pages. Logiweb pages may contain free text mixed with machine intelli‐
11 gible objects like computer programs, testsuites, and formal proofs.
12
13 Logiweb defines a referencing scheme in which each Logiweb page has a
14 unique Logiweb reference. A Logiweb reference is typically around 30
15 bytes long. A Logiweb reference contains, among other, a RIPEMD-160
16 hash key of the referenced page.
17
18 The Logiweb standard comprises the following file formats and network
19 protocols.
20
21 The Logiweb vector format
22 The Logiweb rack format
23 The Logiweb protocol
24
25 The vector format is a condensed representation suited for transmission
26 over the Internet. Furthermore, the authenticity of a Logiweb page in
27 vector format can be verified using the RIPEMD-160 hash function. Logi‐
28 web vectors can be retrieved from untrusted repositories and transmit‐
29 ted over untrusted networks.
30
31 The rack format is a comprehensive, precompiled representation suited
32 for storing on a local disk. Furthermore, the rack format is a good
33 starting point for conversion to other formats. The Logiweb Abstract
34 Machine lgwam(1) needs pages in rack format in order to execute pro‐
35 grams defined on the pages.
36
37 Earlier versions of Logiweb supported XML, MathML, Lisp S-expressions
38 and several other formats. The current version leaves such support to
39 the user through the capabilities for user defined rendering of the
40 Logiweb compiler lgc(1). Only support for latex(1), bibtex(1), makein‐
41 dex(1), dvipdfm(1), and cc(1) has survived in the present version.
42
43 The Logiweb protocol allows to locate a Logiweb page given its Logiweb
44 reference. More specificially, the Logiweb protocol allows to translate
45 a Logiweb reference into an http URL. Then http may be used for
46 retrieving the referenced page and RIPEMD-160 may be used for verifying
47 the authenticity of page.
48
49 Logiweb version 0.1.1 to 0.1.10 included a Logiweb server which imple‐
50 mented the Logiweb protocol. From Version 1.0.1, the Logiweb server is
51 excluded from the distribution. That is because installation of the
52 Logiweb server requires considerable privileges and network insight.
53
54 For Version 1.0.1 a Logiweb server does not even exist, and users of
55 the Logiweb compiler lgc(1) are refered to PATH based translation of
56 references to http URLs. Furthermore, the Logiweb protocol will change.
57 That is because it has proved to be too difficult to get an assigned
58 UDP port and thus the protocol will be modified to be channeled via
59 http. That will reduce efficiency but also has some benefits.
60
61 The Logiweb server will be distributed separately at a later stage.
62 When Logiweb servers are installed in the future, repositories which
63 run a server will appear to lgc(1) users to contain all Logiweb pages
64 in the world. The present but obsolete Logiweb protocol is documented
65 in logiweb(7).
66
68 The file formats and protocols are byte oriented, so we first define
69 bytes and vectors (where 'vector' means 'byte vector').
70
71 bit ::= "0" | "1"
72 byte ::= bit*8
73 vector ::= byte*
74
75 A byte denotes a value in 0..255. We refer to the least significant bit
76 of a byte as bit zero and to the most significant bit as bit seven.
77
78 Bytes of vectors are given in network order which is identical to the
79 order in which they are stored on disk. We refer to the first byte of a
80 vector as byte zero.
81
82 We write vectors in mixed endian binary: each byte is written with the
83 most significant bit first and bytes a written in network order.
84
85 We refer to bit m of byte n of a vector as bit m+8*n of the vector.
86 Thus, the bits of a two byte vector are given in this order: 7, 6, 5,
87 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8. This order gives rise to
88 the term 'mixed endian'.
89
90 Mixed endian notation resembles Internet dot notation. Internet dot
91 notation specifies bytes to be written in network order and each byte
92 to be written with the most significant digit first.
93
94 As an example, mixed endian binary 0000 0010 0000 0011 denotes a two
95 byte followed by a three byte.
96
98 We shall refer to non-negative integers as cardinals. We represent car‐
99 dinals and byte strings thus:
100
101 septet ::= bit*7
102 middle-septet ::= "1" septet
103 end-septet ::= "0" septet
104 cardinal ::= middle-septet* end-septet
105 string ::= length vector
106 length ::= cardinal
107
108 A middle or end septet denotes a value in 0..127. As examples, the mid‐
109 dle septet 1000 0010 and the end septet 0000 0010 both denote two.
110
111 Cardinals are expressed little endian base 128. As an example,
112
113 1000 0011 0000 0010
114
115 denotes 3+128*2 = 3+256 = 259.
116
117 A string encodes a sequence of bytes. The length field of the string
118 indicates the number of bytes in the string.
119
120 The grammar is *not* context free because the strings cannot be
121 expressed in a context free manner.
122
124 The vector format of Logiweb pages has the following syntax:
125
126 pagevector ::= bibliography dictionary body
127 bibliography ::= reference reference* zero
128 reference ::= string ; the string must be non-empty
129 zero ::= cardinal ; the cardinal must denote zero
130 dictionary ::= entry* zero
131 entry ::= index arity
132 index ::= nonzero
133 nonzero ::= cardinal ; the cardinal must be non-zero
134 arity ::= cardinal
135 body ::= vector
136
137 The indexes of the dictionary must be strictly decreasing.
138
139 We refer to the first reference of the bibliography as reference zero.
140 Reference zero must be the reference of the page itself.
141
142 Above, references are described as strings. The following definition is
143 more precise:
144
145 reference ::= length scheme ripemd timestamp
146 scheme ::= "0000 0001"
147 ripemd ::= byte*20
148 timestamp ::= mantissa exponent
149 mantissa ::= cardinal
150 exponent ::= cardinal
151
152 The length must equal the total number of bytes in scheme, ripemd, and
153 timestamp.
154
155 Whenever a Logiweb page is published, the time of publication is
156 included as a timestamp in the reference of the page. The timestamp is
157 given in Logiweb time as M*10^(-E) where M and E are the values of man‐
158 tissa and exponent, respectively.
159
160 Logiweb time indicates the number of seconds which have elapsed accord‐
161 ing to International Atomic Time (TAI) since TAI:00:00:00 of Modified
162 Julian Day (MJD) Zero. MJD-0 corresponds to the Gregorian date
163 GRD-1858-11-17. In 1858, UTC lacked 10 seconds behind TAI (even thought
164 they did not know at that time). Hence, Logiweb time measures the num‐
165 ber of seconds according to International Atomic Time since
166 UTC:23:59:50 of GRD-1858-11-16.
167
168 The ripemd code consists of 20 bytes which corresponds to 160 bits. The
169 ripemd code of a page must be computed using the RIPEMD-160 hash func‐
170 tion applied to all bytes following ripemd, including the timestamp of
171 the page itself.
172
173 The body represents a parse tree whose nodes may be Logiweb symbols or
174 strings. A Logiweb symbol comprises a Logiweb reference and an index
175 where the index is a cardinal. To be a valid symbol, the index of the
176 symbol must occur in the dictionary of the referenced page. The arity
177 of a symbol is the arity assigned to the symbol by that dictionary.
178
179 Transformation of the body given as a vector to a parse tree is called
180 'unpacking'. Reference one of a page may define an unpacker, i.e. a
181 user defined unpacking function. See the 'base' and 'lgc' Logiweb pages
182 which come with the distribution for details on this.
183
184 If reference one of a page does not define an unpacker then the body is
185 in default format which is as follows:
186
187 body ::= node*
188 node* ::= symbol | zero string
189 symbol ::= nonzero
190
191 The nodes of the body are given in Polish prefix. The arities of each
192 symbol is taken from dictionaries of the relevant page. Strings have
193 arity zero.
194
195 Each symbol has value 1+R+n*i where i is the index of the symbol, n is
196 the number of references in the bibliography, and R is the relative
197 reference of the symbol. The relative reference R if between 0 and n-1.
198 The reference of a symbol is the R'th element of the bibliography.
199
201 The rack of a page is a structure which represents a compiled version
202 of the page. See the 'base' and 'lgc' pages which come with the distri‐
203 bution for more on this.
204
205 A rack is an inhomogeneous array, i.e. and array whose elements may
206 have different types. Indexes of a rack are called hooks. Racks and
207 hooks are to Logiweb what records and record fields are to languages
208 like C.
209
210 Racks may reside in memory or may be stored to disk. Racks in memory
211 contain more data than racks stored to disk. Hence, racks are pruned
212 when written to disk and the missing parts have to be reconstructed
213 when reading the racks back into memory.
214
215 Pruning affects three hooks of a rack.
216
217 First, a rack in memory has a code hook which contains compiled ver‐
218 sions of value definitions. To keep disk racks architecture indepen‐
219 dent, the code hook is removed when writing a rack to disk and value
220 definitions have to be recompiled when reading the rack back. Using
221 racks still represent a great saving in time since no parsing, unpack‐
222 ing, and macro expansion is needed when loading a rack.
223
224 Second, a rack in memory has a cluster hook which contains racks of all
225 transitively referenced pages. This hook is removed since it contains a
226 lot of data which is fast and easy to reconstruct from the disk ver‐
227 sions of transitively referenced pages.
228
229 Third and finaly, a rack in memory has a diagnose hook. The value of
230 that hook is 'maptagged'. A maptag allows to delay computation of the
231 diagnose in an otherwise eager process. When writing a rack to disk,
232 computation of the diagnose is forced and the maptag is removed. When
233 reading the rack back from disk, a maptag is added to the diagnose
234 hook. Again, see the 'base' and 'lgc' pages that come with the distri‐
235 bution for details.
236
237 Once a rack is pruned for code hook, cluster hook, and diagnose maptag,
238 it is a data structure which is built up from only three data types:
239 cons, cardinals, and a special value named T which corresponds to NULL
240 in C.
241
242 Pruned racks are stored in the following format:
243
244 rack ::= racknode* count
245 count ::= cardinal ; three plus the number of nodes
246 racknode ::= card1 | card2 | cons
247 card1 ::= zero cardinal
248 card2 ::= two string
249 two ::= cardinal ; the cardinal must represent 2
250 cons ::= head tail
251 head ::= cardinal
252 tail ::= cardinal
253
254 A rack with no nodes represents the value T. Racks with at least one
255 node represents the value represented by the last node. Each node rep‐
256 resents a value as indicated in the following.
257
258 A card1 node represents the given cardinal.
259
260 A card2 node represents a cardinal as follows: Take the given string.
261 Discard the length field to obtain a vector. Append a byte with value
262 one to the vector. Then interpret the vector as a little endian cardi‐
263 nal base 256.
264
265 A cons represents the pair of the given head and tail. A head whose
266 value is one represents T. A head whose value H is greater than 2 rep‐
267 resents the value of node H-3. The tail is interpretted similarly. The
268 head and tail must equal one or reference nodes that appear before the
269 cons.
270
272 As can be seen, cardinals can be represented in two ways. All cardinals
273 can be represented as a card1 whereas only some cardinals can be repre‐
274 sented as a card2. When writing racks to disk, one should use the card2
275 representation when possible and the card1 representation otherwise.
276
277 The representation allows to represent sharing in that a node can be
278 referenced from more than one cons node. When writing a rack to disk,
279 one should traverse the rack left-to-right, root-to-leaves, and depth
280 first. One should maximize sharing such that no two nodes represent the
281 same value.
282
283 If followed, the rack writing conventions ensure that the same rack
284 will be written the same way on distinct systems. That simplifies test‐
285 ing but is not necessary for reading racks back to memory.
286
288 Klaus Grue, http://logiweb.eu/
289
291 lgc(1), lgwam(1), lgc(5), lgc.conf(5), logiweb(7)
292
293
294
295Logiweb JULY 2009 LOGIWEB(5)