Technical Status Report -- 94Q4 Frank Wm. Tompa University of Waterloo Progress report In the period January 1 - March 31, 1994, primary effort contin- ued on defining the data manipulation language extensions to SQL (160), starting implementation of validators for the data defini- tion and data manipulation languages (130), and continuing design of the hybrid query processor (310 and 320). In defining the semantics of notational extensions to SQL that permit convenient access to fragments of text data described by SGML, we had defined a set of virtual tables that represent the nodes of an SGML parse tree, and we had defined a procedural language (the "marking" language) to select nodes identifying data that was to be converted from text form to tuple values. As mentioned in the previous progress report, this approach was semantically sound, but not well-matched to SQL's algebraic nature. During the last fiscal quarter, we designed and adopted a far better approach. The structure of SGML-based text fields can be semantically represented by three virtual tables: TEXT_NODES (nodeid, genid, content) TEXT_ATTRIBUTES (nodeid, attr, value) TEXT_STRUCTURE (a_nodeid, d_nodeid) where TEXT_NODES contains one tuple per node in the parse tree, consisting of a unique nodeid, the type of node (``generic iden- tifier'' in SGML), and the TEXT content subsumed by that node; TEXT_ATTRIBUTES relates SGML's attribute-value pairs to corresponding nodes; and TEXT_STRUCTURE contains nodeid pairs representing all ancestor-descendant relationships in the parse tree. We then replaced the marking language with pure SQL, which makes it significantly easier to explain the semantics of text manipulation to a user familiar with relational manipulation. We implemented a parser for most of the extended SQL data manipu- lation language, such that it can serve as a syntactic validator as well as the basis of translators for the application-HQP interface language and the HQP-TRE interface language. We continued to design the hybrid query processor, concentrating first on support of a simple SELECT-FROM-WHERE statement involv- ing selections and projections from conventional relational fields as well as from TEXT fields. For the first implementa- tion, the dot notation to access subtexts from a structured text field will be supported, but direct application access to TEXT using the above virtual tables will not be permitted until later implementations. We also began the design and implementation of the interface(s) between the hybrid query processor and a text retrieval engine. In this interface the extended SQL data manipulation language must be interpreted by more conventional text access commands. To accommodate a variety of text engine capabilities, one layer will be built to manipulate text with knowledge of SGML-like grammars, a second layer will interpret grammar-based access in terms of independently defined regions, and a third layer will interpret region-based access in terms of flat text manipula- tions. Thus three interfaces will actually be made available: one for grammar-aware text engines, one for region-based text engines, and one for text engines that can handle unstructured text only. Achievements and problems review The major achievements have been the refined design of DML exten- sions to SQL that will be able to accommodate SGML and the imple- mentation of a validator for the language. Two of the problems remain from the start of the quarter, but both are about to be addressed: 1. It is difficult to design language extensions that will be convenient for typical applications in the absence of real users' needs. With the funding of the Consortium now con- firmed, we expect extensive interactions with CSSC partners to obtain valuable insights into real customers' needs. 3. Significant progress within the University of Waterloo requires the attention of a full-time research project manager. During this quarter, Ian Davis was hired to ful- fill this role, effective April 18, 1994. Coming events plan A paper entitled "Text / Relational Database Management Systems: Harmonizing SQL and SGML" was written, submitted, and accepted for presentation at ADB-94 (Applications of Databases), to be held June 21-23 in Sweden. This will be a first public disclo- sure of our design of extensions to SQL to accommodate structured text. Weekly meetings of the research group within the University will continue throughout the next period. A meeting involving Paul Cotton (Fulcrum), Saman Farazdaghi (Open Text), Frank Tompa (UW), and Ian Davis (UW) is planned on May 9 to begin collaboration as the working group responsible for areas 100 through 300. On May 10, Paul Cotton will meet with the complete UW research group to discuss interfaces to Fulcrum's text engine and implications on the design of the hybrid query processor.