Technical Status Report - 95Q2 Ian Davis University of Waterloo Progress Report In the period July 1st - September 30th 1994 our primary focus was on development of a model requirements statement (110), design of the DML (150), improvement of the paper entitled 'Text/Relational Database Management Systems: Harmonizing SQL and SGML.' (120 & 150), and software development which will lead to the creation of a hybrid query processor (320, 331, 341). On September 28th the Hybrid Query Processor Working Group reviewed our first draft of a model requirement statement (110) and was presented with a revised and improved version of our SQL/SGML paper (120 & 150). At that meeting we indicated the intent to provide release 0 versions of validators (161) which accepted our proposed DML language extensions to other members of CSSC within the next two weeks. A SQL/HQP execution engine which supports all standard nested SQL constructs (except recursive unions) and most standard data types, expressions and functions was designed, developed and documented (331). This engine can only perform selection, but within this constraint handles arbitrary predicates, projection, set operations, sorting, aggregation, key joins, merge joins, theta joins, etc. It can handle nulls, arbitrary length data, exact and inexact numeric values, dates, and times. These types of items can be sorted using arbitrary length keys. Arbitrary numbers of keys may be sorted. The SQL/HQP engine has all the necessary hooks to operate within a federated database environment (341). It currently provides no explicit support for text, and makes no attempt to optimise subqueries by using appropriate buffering strategies. It accepts a plan expressed using a low level functional language, which may be encoded either as human readable expressions (capable of being entered by a developer, stored and retrieved) or as a graph capable of being operated on by the engine. A testbed was developed which parsed expressions, converted them into graphs, and then executed this resulting plan. Software was also developed which allowed an arbitrary plan to be displayed visually (or stored) as a nested expression contained within a character string. Software to provide a primitive distributed ODBC interface to both Oracle and Fulcrum's Search Server Database system was implemented using Sun RPC. This interface facilitates retrieval of tuples from both engines concurrently, and reports errors appropriately up to the application using it, when these errors occur at any layer within it. This ODBC interface was then integrated with the SQL execution engine (341). Using this interface it was then possible to extracted tuples from Fulcrum's Search Server engine (located on a Unix Sun platform) and tuples from Oracle (located on an AIX platform), and join these tuples using our SQL/HQP execution engine (located on an Alpha platform). Work on parsing the relevant sections of the SQL-92 grammar (containing our text extensions) continued. SQL statements were converted into parse trees. These parse trees were then converted into operational structures. Work has begun on refining these operational structures, on performing some optimization within them, on exploring how these operational structures can be decomposed into queries which can (after translation back into SQL expressions) be directed at underlying engines, and on translating the conceptual operations described in these operational structures into a concrete plan which can then be submitted to the SQL/HQP execution engine. All of the above work is still very preliminary and exploratory in nature. Achievements The major progress in our (100) activities has been the development of a draft model requirement statement, refinement of the proposed DML language, and enhancements to the SQL/SGML paper. These papers were distributed to other members/participants of the HQP Working Group. The major progress in our (300) activities has been the development of a functional SQL/HQP engine, and the integration of this engine with Fulcrum's Search Server engine, and Oracle's RDBMS engine. Robert Arn's visit of September 15 was constructive, and his willingness to assist us much appreciated. The HQP Working Group meeting on September 28th went well, and minutes of that meeting will be distributed shortly. Problem review Various issues were raised at the HQP Working Group which must now be addressed. There is a requirement for a formal data model (which we have agreed to produce) even though it is not explicitly included in our statement of work. There is a recognition that one can expect (in some circumstances) that a document manager might be required to interface between application software being developed and the HQP. No one however has the mandate to explore this issue, nor develop document management software should this be required. Current schedules indicate that DDL and DML design are to be completed by this December. Given the amount of work still to be done, and concerns raised at the most recent HQP Working Group which must be addressed, this deadline will be difficult to meet unless rapid progress is made prior to our next HQP Working Group meeting in October. Work on producing a Hybrid Query Processor requirement statement has (at the request of the HQP Working Group) been postponed until a more general statement of our overall functional requirements including a formal data model for text has been tabled. This decision will necessarily delay production of a Hybrid Query Processor requirement statement, but not necessarily the production of the processor. Some of the issues raised by the HQP Working Group on the design of the DML may pertain to the API. There is a need to begin focusing on the API used both to retrieve and to navigate within text. Coming events plan Weekly meetings of the research group within the University will continue throughout the next period. It is expected that a release 0 validator will be made available to members of CSSC, within the next two weeks. This validator will validate a SQL select statement containing the proposed language extensions. The next meeting of the HQP Working Group will be October 25th. The venue has not yet been decided. Project Status 1 100 Model Requirements Statement 94Jan01 94Dec30 85% 6 110 Model Requirements Statement 94May01 94Sep30 100% 7 110 Formal Data Model 94Oct01 94Dec30 0% 8 120 DDL Design 94Jan01 94Dec30 30% 21 120 DDL Design Pre-release 94Jan01 94Jun30 25% 22 120 DDL Design Revisions 94Jul01 94Sep30 5% 23 120 DDL Design finalised 94Oct01 94Dec30 0% 24 130 DDL Proof of Concept 94Jan01 94Dec30 5% 25 130 DDL Proof of concept 94Oct01 94Dec30 5% 26 131 DDL Interface Validator 94Oct01 94Dec30 10% 29 140 DDL Specification 95Jul01 96Jun30 0% 32 140 DDL Specification 95Jul01 96Jun30 0% 33 150 DML Design 94Jan01 94Dec30 80% 40 150 DML Design pre-release 94Jan01 94Jun30 100% 41 150 DML Design revisions 94Jul01 94Sep30 100% 42 150 DML Design finalised 94Oct01 94Dec30 0% 43 160 DML Proof of Concept 95Jan01 96Dec30 2% 44 161 DML API pre-release Version 0 95Jan01 95Jun30 10% 45a 161 DML Core Interface Validator 95Jul01 96Jan31 0% 45b 161 DML Interface Validator 96Feb01 96Jun30 0% 45c 161 HQP-Application Integration 96Jul01 96Dec30 0% 48 170 DML Specification 95Jul01 96Jun30 0% 51 170 DML Specification 95Jul01 96Jun30 0% 52 180 API Design 94Oct01 95Dec30 0% 55 180 API preliminary activities 94Oct01 94Dec30 0% 56a 180 API Design for review 95Jan01 95Sep30 0% 56b 180 API Design after review 95Oct01 95Dec30 0% 57 190 API Specification 96Jan01 96Dec30 0% 60 190 API Specification 96Jan01 96Dec30 0% 61 195 Hybrid Data Model Review 94Jan01 96Dec30 0% 200 Text Retrieval Engine Extensions 94Jul01 96Dec30N/A 93 310 HQP Requirements Statement 94Jan01 94Dec30 10% 98 310 HQP Requirements Statement 94Oct01 94Dec30 0% 99 310 HQP Requirement Revisions 95Jan01 95Mar30 0% 100 320 HQP Design 94Jan01 95Mar30 25% 105 320 Federated engine design 94Jan01 94Dec30 30% 106 320 HQP Design 95Jan01 95Mar30 0% 107 330 HQP Prototypes 94Jan01 96Jun30 5% 108 330 HQP Background activity 94Jan01 94Jun30 100% 109 330 HQP Version 0 pre-release 94Jan01 95Jun30 5% 110a 330 HQP Version 1 release 95Jul01 96Jan31 0% 110b 330 HQP Version 2 after review 96Feb01 96Jun30 0% 115 340 HQP Integration with Agents 95Apr01 96Jun30 10% 116a 341 HQP/Oracle Integration 95Apr01 95Sep30 10% 116b 341 HQP/Fulcrum Integration 95Oct01 96Jun30 5% 116c 341 HQP/Pat Integration 95Oct01 96Jun30 2% 119 350 HQP User Interface 94Jan01 96Jun30 N/A 125 395 HQP Review 94Jan01 96Jun30 0% 400 Inter Application Protocol 94Jan01 96Mar30 N/A 500 IAP Prototype Applications 95Jul01 96Sep30 N/A 600 Tool Prototypes 95Apr01 96Sep30 N/A 700 Formal Testing 96Jan01 96Sep30 N/A 800 Beta Testing 96Jul01 96Dec30 N/A 900 Project Coordination 93Apr01 97Mar31 15% Comments on Project Status It is impossible to provide an accurate statement which relates the amount of work completed to that remaining, both because this remaining effort is not well understood and because it is subject to a review and refinement process that is largely outside of our control. The best guess estimates given should be considered to be exactly that. The project schedules are also somewhat subject to interpretation. Many of the milestones used in the production of this schedule pertained to deliverables and delivery dates. Attempting to use these same milestones to imply termination of well defined sub-activities (as is currently being done) is likely to lead to confusion. This is particularly true in an environment where our mandate is to develop prototype software (which by implication is necessarily incomplete) and where we are attempting to release this software as soon as it is possible to do so, rather than when some activity associated with the development of that software is deemed 100% complete. This is what is implied by the term pre-release. For these reasons it is essential that delivery dates and completion dates be better distinguished within the project plans. It is not clear that these two models for monitoring progress are really very compatible. Either partially completed activities are released for review or use as and when required by the schedule, or the entire schedule and thus delivery dates fluctuate as changes are made in order to reflect real progress as a percentage of our overall activities. If there is considered any need to further sub-divide the activities defined in our project schedule, then the criteria used to determine that these sub-activities are complete needs to be defined, and the potential impacts on our already tight project schedules considered. The desire to monitor progress must not interfere with our ability to make progress. Clarification of project schedules As per discussions with the CSSC project office today I am attempting to clarify our activities as outlined within the most recent project plans submitted to me by the CSSC project office, dated 9/10/94. Concern was expressed that these project plans do not adequately distinguish the subactivities identified by id, and that some of the individual items span too long a time frame. I refer to the activities below by id. 6. Model Requirement Statement[A] The intent was to release a first version of the model requirement statement for review. This version was to reflect our understanding of what a model requirement statement was, and to reflect what we considered to be the requirements which needed to be imposed on a data model that attempted to integrate SQL and SGML. Such a model requirement statement was presented to the HQP Working Group September 20th, 1994. 7. Model Requirement Statement[B] It was the intent to revise the model requirement as requested by CSSC, and finalise production of this document by December 1994. We are instead considering the model requirement statement as distributed final, and moving on to produce a document describing the formal data model we intend to use. 8. DDL design. We are (as the project plans show) addressing DDL and DML design in parallel. The DML which seems to be harder to design is currently receiving the most attention. The DDL design has not yet been adequately addressed. It is not clear how significant this is. 21 and 40. DDL/DML Design[A] These items reflect design work leading to pre-release of a working paper describing proposed DDL and DML language extensions to SQL. We submitted a first draft covering some aspects of a possible DML Design for discussion some months ago. 22 and 41 DDL/DML Design[B]. These items reflected an intent to release a first draft of the a DDL/DML design by the end of September 1994. This document was expected to address concerns raised within the earlier draft of this paper. We submitted a second draft of a paper describing the DDL/DML design as it then stood to the HQP Working Group on September 28th. We are expecting other members involved in the HQP Working Group to provide feedback on the directions we are taking before anything is released to a wider audience. This second draft does not address update, or the design of the DDL. Until we receive feedback on this second draft of our DML proposals it is hard to quantify if the work we have undertaken in this area can be considered complete. It is I think fair to state that we are in compliance with our earlier schedules in terms of deliverables. I am however concerned about the number of issues which remain unresolved within our overall DML design (eg update) and our ability to resolve all of these outstanding issues by December 31, 1994. 23 and 42. DDL/DML Design[C] These items reflect our intention to finalise the design of the DML and the DDL by the end of this year. An interim meeting of the HQP Working Group has been scheduled for October 25th, to assist in achieving this goal. 25. DDL Interface Validator Programs[A] This item reflected our intent to pre-release validators accepting our proposed DDL language extensions by the end of the third quarter of this year. I have indicated in my status report that I consider us to be behind schedule in addressing DDL extensions, and that we have not yet produced a DDL validator. However we have indicated that we intend to pre-release (as soon as a mechanism for releasing software is established) a DML validator which performs syntax checking of any SELECT statement. Pre-release of this DML validator is considerably ahead of schedule, since work on a DML validator was not scheduled to begin until 1995. 26. DDL Interface Validator Programs[B] It is our intent that the validator released at the end of this year will validate (at the syntactic level) all SQL (both DML and DDL) later to be accepted by a HQP. It may not validate field names, the typing of these fields, etc. 32, 51 and 60. DDL/DML/API specification. These items reflect our willingness to assist in the process of producing specifications to be submitted to formal bodies. However, since it was not our intent to provide the lead in this area, we cannot comment on any sub-activities possibly involved in this exercise. 44 and 45. DML Interface Validator Programs[A]. We have indicated an intent to release a DML validator which validates the syntax of a SELECT statement containing our currently proposed DML extensions shortly. Our statement of work indicates that we will develop validators for all interfaces which at a minimum perform syntax checking for each interface. We are assuming that our activities in this area will also involve extending our validators so that this software allows applications to interface effectively with the HQP. Providing access to the HQP system tables would also legitimately fall under the scope of this work. The schedules as currently drafted suggest that a preliminary version of the HQP will be available by June95. This is an extremely ambitious objective. Since the API will contain HQP components within it, the API will be released as object. We assume that those wishing to interface to the HQP have access to appropriate platforms on which to test our interfaces to it. This interface will use as its transport layer SUN RPC. Because of the stated desire to subdivide lengthy items, item 45 has been subdivided into 45a, 45b, and 45c taking approximately the same effort. 45a would provide on completion as a deliverable a core API (as for example ODBC does) and support for selection. 45b would address additional API functionality and potentially provide support for update. Item 45c recognises the need for ongoing support and development of software interfaces as work continues on both the development of applications about the HQP, and the TRE's below the HQP. 55. API Design[A]. This item indicates an intent to begin considering the API this quarter. It seemed dangerous not to address the issue of API design at all until the DDL and DML design is complete, when there may be some dependencies between the DDL/DML design and the API design. No external results or deliverables are to be associated with our preliminary research into this issue. 56. API Design[B]. This item indicates an intent to continue API design throughout 1995. We cannot reasonably expect to finish this activity until some time after item 44 has at least been reviewed. Our work in this area will be in cooperation with Open Text and Fulcrum. It is our intent to provide a finalised definition of the API by the end of 1995. 98 and 99. HQP Requirements Statement [A and B] It was our intention to submit a HQP Requirement statement for review by the end of this quarter. A very preliminary draft of a possible HQP Requirement Statement was written. However, at the request of the HQP Working Group, the production of this statement has been postponed until we as a group have a clearer idea of our formal data model for text, and a better understanding of what it is we wish at a high level to accomplish. 105 and 106. HQP Design [A and B] We are actively developing a design for a basic HQP which provides support for both a hybrid table capable of containing text and other types of relational data, and for supporting a federated system of relational databases. Much of the design of the HQP will necessarily (given our schedules) proceed concurrently with development of this HQP. Item 106 is not a separate activity. I believe item 106 was added merely to highlight the fact that the design of the HQP is at least partially dependent on the HQP Requirement statement not intended to be finalised until the end of Dec 1994. 108. HQP Prototype by UW [A]. This item implied no stated goal or deliverable. It merely reflected very preliminary research being conducted which pertained to the development of a HQP. Since the current objective appears to be to include only activities within the project plan which imply externally visible goals or deliverables, this item should either be removed from the schedule or merged with 109. 109. HQP Prototype by UW [B]. Completion of 109 is scheduled to occur at the same time as completion of item 44. It is to be expected that this release of the HQP will support simple select statements, joins etc. It is not expected that this release of the HQP will include very much support for SGML text. It is likely to interface with both Oracle and Fulcrum's Search Server Engine. It may have a limited ability to interface with Open Text's Pat engine. Any high level of interface to Pat will be SQL. This interface may depend on a lower level interface which uses the native Pat language. 110. HQP Prototype by UW [C]. In the schedules that we submitted to the project office and which can be found on page 36 of the Minutes of the CSSC Management Committee of August 30, 1994 we indicated the intent to provide version 1 of the HQP for release 96Jan31 and to provide version 2 of the HQP for review 96Jun30. These two activities were merged into one item by the project office. This is why this item spanned such a long period. This item indicates an intent to complete work on the HQP by the end of the June 1996. The completion date is timed to coincide with completion of work of specifying (in a manner suitable for submission to standards bodies) the DDL and DML used within the HQP. The HQP will act as a proof of concept for these standards. 116. HQP Integration with RDBMS It is a fundamental requirement that we integrate our HQP with both a RDBMS and at least one TRE. It is our intent to attempt to integrate the HQP with Oracle, Search Server provided by Fulcrum, and the Pat Engine provided by Open Text. It is clear that we are being very ambitious in attempting this since currently Pat has little or no support for SQL, Search Server has limited support for SGML, and Oracle has no ODBC interface that we can immediately use. Since we have no involvement in 200 series activities we are expecting cooperation from both Fulcrum and Open Text, whenever integration necessitates extensions to the underlying TRE's. We intend to attempt integration of both engines within the specified timeframe. While our statement of work would suggest that we intend to integrate our HQP only with Oracle, the schedule needs to reflect our intent to attempt to integrate our HQP with three different database engines. 200, 350, 400-900 series items. It is not our intent to provide a lead in any area of the project plan dated 9/10/94 not described above. The University of Waterloo is not directly responsible for the development of the HQP User Interface; In the CSSC Statement of Work Revision 2 of 13 Apr 94 the development of this software is identified as work to be undertaken by InContext and Softquad. We have no direct involvement in 200 series activities and no intention to provide more than a review of 400, 500 and 600 series activities. We may or may not be involved in 700 series activities. Since it is not the intent to perform any beta test of our implementation of a HQP, it is unlikely that we will be providing a leadership role in 800 series activities. 900. Project Coordination. We expect to provide assistance in both Project Coordination and Project Administration throughout the lifetime of this project. We are directly involved in coordination efforts through management, executive, and various working group meetings. We intend to comply with all requests from the CSSC project office, and to provide status reports documenting progress every three months. I hope that this adequately clarifies our activities. Please do not hesitate to contact me if further clarification of our currently scheduled activities is required. Ian Davis. October 8th, 1994