Patternist -- an XQuery 1.0, XSL-T 2.0 and XPath 2.0 Implementation
This page contains the following sections:
Patternist is an XPath 2.0, XQuery 1.0 and XSL-T 2.0 implementation, licensed under the GNU LGPL license.
Development priority is:
- Conformance and interoperability
- The HCI aspect, that it is user friendly and has good usability
- Clean, compact implementation that compiles without warnings and is well documented
- Licensed under the GNU LGPL license.
- Has Trolltech's Qt 4 QtCore library as dependency.
- Mostly use reference-counted structures via smart pointers.
- As of this time of writing, sloccount reports a code size of about 44000 lines of code, which includes testing utilities.
- The parser is constructed with Bison 2.x.
- The code is aiming towards thread safety.
Running Patternist can be tricky, due to it not being exposed through user-oriented utilities and APIs, but development & debugging interfaces(see KXQTS). Patternist is currently being integrated with Trolltech's Qt library, and will therefore be exposed for user-oriented usage through that. When snapshots and the like are published by Trolltech, that will be announced at http://englich.wordpress.com/.
The source is currently in KDE's Subversion repository, and can be downloaded with the UNIX command:
svn co svn://anonsvn.kde.org/home/kde/trunk/kdenonbeta/kdom/patternist/
Alternatively, it can be browsed via the web interface.
The documentation that you are reading right now, can be generated by running doxygen inside the top-level patternist/ directory without arguments. The generated documentation can subsequently be browsed from patternist/html/index.html.
- Warning:
- Patternist is incomplete, under continuous development and therefore not comformant. The current behavior and what constructs that are currently supported is undefined. The current pass percentage is about 89% of W3C's XQuery Test Suite.
- See also:
- XML Path Language (XPath) 2.0, F Conformance
Understanding XPath 2.0, especially at the level required for implementing, is a big task. I learned XPath 2.0 from the specifications, and that was not the most clever thing to do. The comfort is that the complexity will end(time counted in weeks for me), and that it is possible.
Reading articles about XPath/XSL-T 2.0 at xml.com is probably a good idea, testing with Saxon(http://saxon.sf.net), and to study the following specifications, roughly in order of importance:
Less important documents, depending on what one work on in particular is:
In addition, these articles and papers can be of interest:
Patternist's coding style is as follows.
- An indentation of four(4) whitespaces.
- No whitespace at the end of lines.
- This Vim mode line must appear at the end of every C++ and XML file:
vim: et:ts=4:sw=4:sts=4. Mode lines for other editors can appear too. - Brackets on their own lines. An example:
int myFunction (bool &mysetter)
{
for(int i = 0; i < 10; ++i)
{
}
}
- No space around paranteses, as in the example above. This is an example of invalid style:
void wrongFunction( const QString &argument );
- Naming tries to stick to the definitions of the specifications since one tend to read those as much as the code.
- The concepts are confusing enough, and the APIs large enough, leading to that identifiers are not compressed into cryptical names such as
DynCtxt or AcctMngr, but spelled out such that it is clear: DynamicContext and AccountManager, for example.
- Use
class declarations instead of struct declarations. They are the same except for the default scope, and sticking to one type make things simplier. - When assinging enum values, indent the value such that they nicely line up. This makes it easy to read in the Doxygen browser as well as in the code. Example:
enum MyEnum
{
EnableThis = 1,
Nothing = 2,
ItDepends = 4
};
- Keeping a one-to-one mapping between classes and files is practiced. However, there are exceptions to this, such as BuiltinAtomicTypes.h or many of the function implementations.
- Function definitions are not put in the header even if they are small. The only case an implementation is put in the header, is when it is inlined and must in that case be explicitly marked with the
inline keyword. - No use of C-style cast. For example, instead of writing:
const int myInt = (double)myDouble;
write: const int myInt = static_cast<double>(myDouble);
static_cast and const_cast are safer since they have narrower intentions. Also, C-style casts sometimes leads to invalid code being generated. - Never use
dynamic_cast. It's not used so far, there is no need for it, and it has a speed penalty. - Use constructor initialization instead of initializing the members in the constructor body. For example, instead of writing:
MyClass::MyClass()
{
m_myInt = -1;
}
write: MyClass::MyClass() : m_myInt(-1)
{
}
This yeilds more efficient code(no temporary objects), and safer code(because the members can be declared const). Indeed this does not matter for POD types, but it's practiced for them as well for the reason of consistency.
- Use the constructor for setting members, instead of setters. Matthias Ettrich, in his discussion of the Qt API, recommends the exact opposite, and that's because Qt's classes are mostly worked with when used, as opposed to when being developed, and therefore gain by the improved readability. However, Patternist's classes are a large part of the time being developed, and therefore gain by having a simpler implementation, which they have when their internal variables cannot mutated, and therefore the amount of possible states is reduced.
- Try to avoid getters. Add them only if an explicit requirement exist. Reducing getters reduces the exposure of the class's internals.
- Keep reentrancy in mind. The code is aiming towards being thread safe, and is close to being so.
- Avoid temporary objects. For example, instead of writing:
Expression::Ptr myPtr = m_operands.first();
write: const Expression::Ptr myPtr(m_operands.first());
- Don't introduce compiler warnings. Work is simply incomplete until it compiles without warnings. Patternist's build system has a vastly increased amount of compiler warnings in order to spot mistakes. Currently the code compiles with no warnings(with GCC 4.0, at least).
- Keep
const correctness in mind. Declare variables in functions const where sensible, class members and initialize them with constructor initialization, and even declare function arguments of an enum, boolean, or numeric type const. The latter catches if the function definition attempts to write to an argument(which really is strange), and provides visual consistency in the function declaration. - Keep line length sensible. Anyone should be able to edit the code, not only those who are privileged to have large displays.
- Remember that the Doxygen comments must at somepoint be completed. Perhaps right now is good time. If for some reason the comments are not written now(the author have done that many of times), add a
@todo and scribble down what is the important and good to know about the code. Then one can clean up the notes at a later point, without missing too much important information. - Qt's STL iterators are faster when pre-increments are done instead of post-increments. For consistency and to avoid mistakes, pre-increments operators are always used. Example:
for(int i = 0; i < 10; ++i)
{
}
- In cases bugs can be detected with
Q_ASSERT and Q_ASSERT_X -- add them. This can be that QRegExp and QUrl instances are valid, that pointer arguments to functions never are null, integer variables always are within a certain range and so forth. Currently over 400 asserts are in use, acting as an ICE, internal compiler error, system. - The names of private and protected data members starts with "m_".
- When including Qt headers the 4.0-style is used. For example,
qlist.h isn't included, but QList is. This makes code easier to read, and it follows the Qt documentation. - The header inclusions are sorted alphabetically and are grouped in the following order: Qt, Patternist, and finally, the header file corresponding to the cpp file, if applicable.
- Let every line of code do one thing well, every line should be "atomic". For example, this code:
return m_variable = result;
should be written over two lines, in order to increase readability, and make it easier to debug. A similar case which needs to be rewritten is: - If it is strictly needed to use a macro to simplify code, remember to undefine it with
#undef theMacroName, such that it doesn't cause trouble when compiling the make final target, for example. - Don't use C++'s
% operator, it's not portable. Instead, use std::div or std::ldiv. See gcc-help at gcc dot gnu dot org, thread "Remainder ( % ) operator and GCC". - If the data members of a class are more than two, indent the names such that they nicely line up:
private:
const DayTimeDuration::Ptr m_zoneOffset;
Item::Vector m_rangeVariables;
Expression::Vector m_expressionVariables;
Item::Iterator::Vector m_positionIterators;
- When declaring functions
static inline, do that, instead of inline static, because GCC sometimes complain about that. However, private, static, inline class members are in either case preferred in front of global functions.
Doxygen conventions, are as follows.
@returns and @param paragraphs are terminated with a period.- When XPath or XQuery expressions/queries appears in the Doxygen comments, wrap them in the
tt HTML tag. - Classes and free standing functions should have an
@author tag, specifying who is the main author of it. - No code examples should appear directly in the Doxygen comments, they should be included with
@include or @dontinclude. Put the code in docs/. This document which demonstrates invalid code, is an exception though. - The following terms are marked with
@c or the tt HTML tag:
NaN true and false, when referred to as boolean values- All QNames and item types. For example,
item() and xs:string. Remember to use the tt HTML tag in these cases in order to include non-trivial characters such as paranteses. null stderr, stdout, and stdin
The current Doxygen comments does in some cases not adhere to this, but the idea is to harmonize in that direction over time.
Regression testing and debugging the code is done in the following ways:
- QTestLib unit tests. These are found in
tests/ sub-directories and they regression test the API on a low level. Add tests when deemed appropriate, or when something is delicate and easily can become broken. Run the tests by running make check in xpath/. - KXQTS, a set of "KDE XQuery Test Suite" tools. It is a collection of programs for debugging the code.
KXQTS, located in kxqts/, is documented in the KXQTS Doxygen module.
The following individuals, appearing in no particular order, have contributed to Patternist and significantly improved it. A big thank-you extends to them for their efforts:
- Stefan Wachter, for discussing and providing suggestions on how to implement casting and operator detection with visitor and double dispatch patterns
- Maksim Orlovich, for general advice and initial XPath tokenizer code
- Michael Kay, for discussions on implementation approaches and interpretations of the specifications
- Stefan Monov, for inventing the name, Patternist
- Author:
- Frans Englich <englich@kde.org>
Generated on Thu Feb 8 14:54:18 2007 for Patternist by
1.5.1