The TCE Corporate Technical Memory: Groupware on the Cheap


Mark L. Fisher
mark dash fisher at mindspring dot com
(formerly: fisherm@tce.com at Thomson, Inc.)

History

2003/06/10:
Revised for my personal web page. Note: this paper does not cover the current architecture of CTM, as CTM has moved to a Perl5 module with an Oracle database for metadata.
1999/03/02:
Fixed up some links which had decayed.
1998:
Slighly reformatted for inclusion in the Corporate Technical Memory documentation set.
1997:
First published in the International Journal of Human-Computer Studies.

Abstract

The Thomson Consumer Electronics Corporate Technical Memory is an electronic reference document repository used to store locally developed technical know-how as a set of files that can be browsed as well as searched. Implemented as a World-Wide Web application, CTM is not constrained to a limited set of accepted file formats or restricted in its indexing of binary data, due respectively to the use of a Web browser as the client software and the incorporation of 'document abstracts' (brief descriptive HTML files automagically linked to their corresponding binary files) . CTM attempts to solve the problem of finding and distributing local technical expertise in a system-independent fashion.

Corporate Technical Memory Specifications

Here are the original specifications of the TCE Corporate Technical Memory, reprinted from the Information Tools QLP Team Corporate Technical Memory charter:

System Goal

The basic system goal is to create an electronic repository of technical knowledge for use by the engineering community at TCE. This repository would improve the communication of technical developments and knowledge within TCE and reduce redundant work.

In order for such a system to be most helpful it must be easily accessible. This implies access from our widely available PCs.

QLP Team Goal

It is recognized that the implementation of this system is beyond the capabilities of the QLP team. The goal of our team is to develop an ideal system definition and present a proposal to R&D management. If such a proposal was accepted, the QLP team could provide guidance during the design and implementation of the system.

Inputs to System

System Definition

Access

Technical Merit

History

Part of continuous process improvement at Thomson Consumer Electronics is the Quality Leadership Process (QLP). QLP teams are drawn from employees in the same department, trained in quality improvement techniques, then unleashed to find and fix quality problems. The Information Tools QLP Team, following through with the departmental mandate to "stop re-inventing the wheel", created the specifications for the Corporate Technical Memory in January, 1992.

As the Corporate Technical Memory is a document-management application, commercial IBM PC-based document management systems were the logical choice at TCE for the foundation technology. Existing systems, however, were both too inflexible (still tightly tied to their origins as legal department document managers), as well as too expensive ($180,000US+), for a system that would not have an immediate bottom-line impact. Total system cost was driven up by the typical licensing scheme of document management application programs, which require a license for each potential user. Since the set of potential users is the TCE technical community (around 600 people at the time), even if only 5 people would concurrently use the Corporate Technical Memory, licenses for all 600 people would still have to be purchased at a cost of around $180,000US (figuring $300US/license).

TV Product Development in April, 1993 requested that our department, Design Automation, research use of Lotus Notes in building an internal asynchronous discussion application like Usenet News or VAX Notes (the predecessor to Lotus Notes still in use at that time at TCE). This led to a Lotus Notes pilot implementation of the TCE Corporate Technical Memory . Once again, high total system cost because of per-potential user licensing was the showstopper, with Lotus Notes at that time licensed for $295US/per-potential user.

September, 1993 saw the arrival at TCE of the InfoMagic Internet Tools CD-ROM containing World-Wide Web (WWW) and Wide Area Information Servers (WAIS) implementations for UNIX. WAIS technology appeared as if it could provide the full-text search engine needed for the Corporate Technical Memory once a suitable document delivery mechanism was found. The release of Cornell University's Cello World-Wide Web browser in a Microsoft Windows Sockets version provided the client portion of the delivery mechanism.

The first alpha version of CTM was then created in December, 1993 using Cello and a custom document submission program as the client subsystem, and a combination of the CERN httpd Web server, the freeWAIS-0.202 WAIS server, and a Perl daemon script as the CTM server subsystem. During 1994 the Corporate Technical Memory was brought into the Technical Excellence Committee (a non-managerial ongoing committee devoted to promoting technical excellence) as a project of the TEC Technical Library subcommittee. Along the way, the Netscape Web browser was chosen as the production browser because of the end of Cello development. After extensive testing and several delays due to the non-"mission-critical" nature of this application, production release was August 15, 1995.

Foundation: The Web and WAIS

The World-Wide Web

The World-Wide Web [Berners-Lee, Cailliau, Luotonen, Nielsen, and Secret (1994)] , [The World Wide Web Consortium (1996b)] originally developed by Tim Berners-Lee while he was at CERN (the European Laboratory for Particle Physics) in Switzerland, is a transparently distributed client-server TCP/IP-based hypermedia system, used both over the Internet and inside company Intranets. Web clients (called "browsers") can use the native Web HTTP (Hypertext Transfer Protocol [Berners-Lee, Fielding, and Nielsen (1995)] ) protocol, as well as many of the other TCP/IP application protocols, including Gopher [Anklesaria, McCahill, Lindner, Johnson, Torrey, and Alberti (1993)] , FTP [Postel and Reynolds (1985)] , and NNTP (Network News Transport Protocol) [Kantor and Lapsley (1986)] ). HTTP was designed to efficiently transfer documents written in HTML, the Hypertext Markup Language [Berners-Lee and Connolly (1995)], [The World Wide Web Consortium (1996a)] an ASCII markup language that is a document type of SGML, the Standard Generalized Markup Language [Marchal, Benoît (1996)] . Whether a hypertext link is within the same document or to a document halfway across the globe, the same procedure is used to follow the link (usually just a single click in a GUI environment). This link transparency, along with the built-in support of older TCP/IP application protocols in Web browsers, allows HTML documents to act as the "glue" for a system; for example, a vendor Web site that links to the vendor's anonymous FTP server for software drivers and patches.

By the use of MIME (Multipurpose Internet Mail Extensions) [Borenstein, Nathaniel S. and Freed, Ned (1993)] content-type names like "application/postscript", Web browsers can handle not only the small number of file formats the browsers can display in-line (usually ASCII text, HTML, and the GIF graphics format), but can potentially handle any type of file by handing the file contents off to the appropriate viewer program (known as a "helper application" on the Web). Document file format extensibility of this magnitude is uncommon among commercial document-management packages, but necessary for the Corporate Technical Memory, as many specialized file formats are used in engineering applications.

Interaction with the user is handled by HTML fill-out forms fed to the Common Gateway Interface (CGI [National Center for Supercomputing Applications (1996)] ). HTML forms employ standard GUI dialog elements such as text entry boxes and radio buttons. Once the form is filled out, the contents are sent to the specified program invoked by CGI. The CGI program then responds to the form input data, sending data of some form (often HTML) back to the browser, just as if the CGI program's functionality was built into the Web server. HTML forms provide the interface to the various CGI search engines of the Corporate Technical Memory.

Wide Area Information Servers (WAIS)

Full-text searching and indexing in the TCE Corporate Technical Memory is accomplished through the use of the Wide-Area Information Servers (WAIS) [searchtools.com>Site Search Tools Product Reports>WAIS], originally developed by Brewster Kahle while at Thinking Machines Corp. WAIS is a TCP/IP client-server distributed information retrieval protocol based on a superset of the ANSI/NISO Z39.50-1988 information retrieval protocol. Version 0.3 of the freeWAIS public domain implementation of WAIS facilitates creating and updating document indexes together with providing relevance-ranked searches by whole words. Word stemming (where, for example, a search for "enable" would also turn up "enabling technology") is supported by freeWAIS version 0.3 but is not used in the TCE Corporate Technical Memory due to a bug in the freeWAIS word stemming algorithm wherein the final "e" of words ending in "e" is removed. Custom code enables full-text indexing of Microsoft Word for Windows files (a binary format not directly supported by freeWAIS-0.3). ASCII (MIME "text/plain"), HTML, and SGML are indexed directly. Binary files may nonetheless be indexed by the means of 'document abstracts' (see below).

Document relevance is "scored based on the log of the query word frequency, the number of occurrences of the word throughout the entire database, and the size of the given document" (from the freeWAIS documentation). Note that full-text searching (given enough disk space for the index, and the use of word stemming, a thesaurus, and relevance feedback) was proven to be better than keyword-only searching as far back as 1971 by Gerard Salton [Salton, Gerard (1971)] .

CTM Extensions to the Web

File Upload

When the Web implementation of the Corporate Technical Memory started, RFC1867, "Form-based File Upload in HTML", was over a year in the future [Masinter and Nebel (1995)] . Since FTP was by far the most widely deployed file transfer protocol on the Internet, it was chosen as the file upload protocol for CTM. Because no royalty-free FTP library was available at that time for Microsoft Windows, a C++ class was written to drive the Windows FTP program via a file of FTP commands. This level of coupling is too loose for a mission-critical application, but was acceptable for CTM, particularly for the initial implementation where both the clients and the server were on the same LAN (thus dodging the issue of WAN-specific problems).

The next major upgrade to the TCE Corporate Technical Memory will be document submission by use of RFC1867 file upload. This upgrade is scheduled for July, 1996.

Document Submission

Corporate Technical Memory document submissions are processed by the Corporate Technical Memory daemon, ctmd. Documents are first spooled via anonymous FTP to a specific write-only directory, along with a control file containing information about the document. The daemon (1800 lines of Perl) polls the directory once a minute, looking for new control files. When a new control file is spotted, the daemon moves the document file(s) into place, then updates the various document indexes. Results are then emailed back to the user via SMTP [Postel, Jonathan B. (1982)] . Emailing responses is appropriate because of the relative slowness of the WAIS indexer, which takes a few minutes to update the full-text document index (all index entries must be recalculated when a document is added under freeWAIS-0.3). Although FTP to a spool directory is not the most modern method for file upload, it would have been simple to implement cross-platform (if that had been needed); additionally, the use of FTP spooling will make it easy to modernize CTM to use HTML file upload.

Document Management

Documents are stored in subdirectories by the author's initials, then by filename, so that both "Mark Fisher" and "Dan Field" can have documents with the filename guide.doc without fear of filename collisions. URLs (Uniform Resource Locators) [Berners-Lee, Masinter, and McCahill (1994)] in CTM look like (as an example) http://rdis-sun.indy.tce.com/ctm/mlf1/ctm_back.doc, where ctm is the logical document tree root for Corporate Technical Memory documents, mlf1 is the subdirectory for Mark Leighton Fisher's documents (the first person with the initials mlf), and ctm_back.doc is the document filename.

As noted in the Corporate Technical Memory specifications, document revision control is exercised, so that documents may be updated but are never deleted by the Corporate Technical Memory system. When an update occurs, revision control is achieved by ctmd renaming the older versions of the document so that, for example, x.doc would become old/x.doc-001, old/x.doc-001 would become old/x.doc-002, and so on. Though this method depends on certain UNIX-specific filesystem features (long filenames and an unbounded number of directory entries), it works for all types of files, whereas most freeware (and many commercial) revision control systems work only with ASCII text files. (The URL for the previous revision of the example document above would then be http://rdis-sun.indy.tce.com/ctm/mlf1/old/ctm_back.doc-001.)

Document attribute databases (the list of authors and the list of an author's documents) are simple formatted ASCII text files parseable by Perl regular expressions. Another enhancement scheduled for late 1996 is to move the document attribute data into an Oracle relational database, simplifying error recovery by ctmd while adding flexibility to the retrieval process.

Document Abstracts

Most documents in the Corporate Technical Memory are single files, but this is not always the case. Abstracts are optional HTML files associated with a main document file, providing hypermedia facilities to non-HTML documents as well as greatly expanding the potential number of index entries for binary documents. As a result of the mathematical nature of television engineering, many TCE internal documents require much more sophisticated equations than can be expressed currently in HTML without just including bitmap graphics of the equations (which precludes automatic processing of the equations), so Microsoft Word for Windows is used for most technical writing within Thomson. Document abstracts then add the transparently distributed hypermedia capabilities lacking in Word. Document abstracts are also (and perhaps more importantly) indexed in place of their associated binary document files, facilitating much more complete index entries (without an abstract file, just the filename and title are indexed for a binary document file).

Browsing by Index Pages

Several different kinds of index pages are available for browsing documents in the Corporate Technical Memory. Users can browse index pages of documents listed by author, title, title (of documents with abstracts), and main subject area classification. Main subject areas are a short flat list of broad document classifications, like Deflection and Power Supply or Networking Software and Hardware. This simple classification scheme should serve well for quite some time, as more detailed searches can be performed by the full-text search engine. Index pages for browsing are automatically updated by ctmd when a document is created or updated.

Full-text Searches using WAIS

A custom CGI script supplies the Corporate Technical Memory's interface to the WAIS full-text search engine. Search results include the document title, document URL, relevance score (normalized to 1000 for the highest score), and the document file size in bytes.

Updating a freeWAIS Full-text Search Index

Revising documents poses a problem to the CTM's full-text search capabilities, as there is no method within freeWAIS-0.3 to update a document's index entries. To handle this, all documents are re-indexed each night (a process that currently takes around 18 minutes). Also, the full-text search script detects duplicate documents, returning only one document (with the higher score of the two versions).

Searches for Titles, Authors, and Abstracts

CTM also provides facilities for searching the document database for author names, document titles, and titles of documents with abstracts. Searches can be case-sensitive or case-insensitive. Perl regular expressions can be used in a search when needed.

Security

User authentication is not currently handled in CTM, as a user may claim that she is anyone (but see "Improved Security" later). Document protection is secured by the CTM server filesystem security (a Sun box running Solaris 2.3) as well as by CTM document management, since documents are never deleted, only updated. CTM CGI scripts all pass the Perl "taint" checks, thereby preventing users from accidentally (or on purpose) entering search strings that cause untoward effects on the CTM filesystem ("Let's see, how is it that you delete some but not all files under UNIX? Hmmm..."). Perl lexical scoping ("use strict;") is enforced on CTM CGI scripts, thus preventing use of Perl variables before they have been initialized.

In The Future

Isite

The Center for Networked Information Discovery and Retrieval [CNIDR] [The Center for Networked Information Discovery and Retrieval (1996a)] not only maintains freeWAIS, but has now created a successor to freeWAIS called Isite [The Center for Networked Information Discovery and Retrieval (1996b)] based on the ANSI/NISO Z39.50-1992 standard. Isite incorporates many advantages over freeWAIS, including continued development (freeWAIS is no longer under development), Boolean searches, wildcards (right truncation), and document updating. Replacing freeWAIS with Isite in the Corporate Technical Memory is planned for later in 1996.

Word Stemming and a Thesaurus

Although not planned by CNIDR at the time of this writing, for maximum document searching power Isite should be enhanced to include a thesaurus and word stemming, as per [14]. If these requested changes are not made in a timely fashion, publicly-available code may be used by TCE to enhance Isite with these features.

Improved Security

For industrial-strength security, four enhancements are planned for CTM:

  1. Oracle username/password pairs will be used to authenticate Corporate Technical Memory document submitters.
  2. A Web server incorporating the Secure Sockets Layer [Netscape Corporation (1996)] protocol will be used for all updates to the Corporate Technical Memory.
  3. Document submission files will no longer be uploaded to the anonymous FTP or Web directory trees.
  4. Document submission files will be readable only by the Web server userid and placed in a directory readable only by the Web server userid.

Public-key encryption of document submission control files has been considered but rejected, as under a standard (non-military security grade) UNIX or Windows NT system, if the data from a file that is readable only by the owner can be read, it is reasonably safe to assume that the process doing the reading already has the ability to impersonate the file owner, thus that process can discover the private key used for decryption.

Authority Section

The paper predecessor to the Corporate Technical Memory was the Authority File, a filing cabinet in the TCE Technical Library containing copies of journal articles with significant technical merit that had been reviewed by the manager in charge of that technical area. A similar mechanism will be created in CTM, such that "authoritative" papers could be searched or browsed. Since so far, only "authorities" have contributed to the Corporate Technical Memory, this has not been seen as a high priority.

Automatic previous version retrieval

Currently, only the most recent version of a CTM document can be retrieved automatically. Viewing earlier versions requires extensive knowledge of how CTM handles document versioning by renaming documents. Automatic previous version retrieval will likely use either the URL of the current document version, or the document title and author, to return a page with links to all previous document versions and the dates of those versions.

Alerts (Selective Dissemination of Information)

To support long-term information needs (longer than a single search or browsing session with the Corporate Technical Memory), "alerts" [Belkin and Croft (1992)], [Oard (1996)] will be added to CTM. An alert is an automatic per-user notification of new documents on a particular subject or subjects. A simple but functional first implementation will likely consist of per-user profiles containing one or more Isite search strategies, with nightly execution of these strategies against the document database. Results (if any) would then be emailed to the user. If alert use becomes widespread, CTM alerts would need to be re-implemented using more conventional information filtering techniques.

Annotations

Annotations in CTM [Gramlich (1994)] will be the electronic equivalent of notes scratched in the margins of textbooks, except that these notes would be available to all users. Document management systems that provide for annotations facilitate a commentary free-for-all, where any user can amplify or comment on another's work. Although annotations were provided for in the earliest versions of the World-Wide Web software [W3C (1998)] , there is currently no standard method for providing annotations. The Corporate Technical Memory entity-relationship diagram (figure 1) specifies annotations as consisting of world-readable plain text that is tracked by author as well as the date and time of the annotation.

TCEDOC: CTM document management technology

TCEDOC will be a generalization of the Corporate Technical Memory's document management technology. Generalizing the CTM document management technology will be a simple but tedious process of parameterizing all references to CTM in code, CGI-generated HTML, and documentation. Setup of TCEDOC for a particular document database will then require setting the name and location of the document database in a parameter file or files.

Summary and Conclusions

The TCE Corporate Technical Memory is an electronic reference document repository, implemented as a World-Wide Web application. Use of the Web makes CTM a client-server application with universal clients (available on all of TCE's platforms -- Microsoft Windows (3.1/95/NT), UNIX, and Macintosh) that can be configured to view any file format (subject to the availability of a file format viewer for that platform). Using WAIS and some Perl code (around 2800 lines of Perl at the time of this writing) adds to the Corporate Technical Memory's base-level Web functionality: full-text and regular expression searches, version control, and browsing by author, title, and main subject area.

Several conclusions can be made based on the CTM experience:

For a relatively small expenditure of time and money, it has proven possible using the Web, WAIS, and a small amount of additional code to create a general-purpose information sharing mechanism for the technical community at Thomson Consumer Electronics -- the Corporate Technical Memory. The existence of CTM has spurred all Intranet and Internet efforts at TCE. As a testbed for new information delivery technologies (a paper now being revised for inclusion into CTM will use Java [JavaSoft (1996)] for interactive mathematical simulations), CTM will likely do so in the future.

Acknowledgments

I would like to thank my boss, Charles Brombaugh, his boss, Mike Renfro, and his boss, Frank Dittrich, as well as the many members of the Corporate Technical Memory beta-testing team, for their support and encouragement towards the development of the Corporate Technical Memory. I would also like to thank my wife, Melinda, for her support and encouragement, as well as for putting up with a sometimes overly enthusiastic and detailed engineer of a husband.

References

Anklesaria, Farhad; McCahill, Mark; Lindner, Paul; Johnson, David; Torrey, Daniel; and Alberti, Bob (1993). The Internet Gopher Protocol (a distributed document search and retrieval protocol). RFC 1436, <URL:ftp://ds.internic.net/rfc/rfc1436.txt>

Belkin, Nicholas J. and Croft, W. Bruce (1992). Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, v 35 p 29-38, December 1992

Berners-Lee, Tim; Cailliau, Robert; Luotonen, Ari; Nielsen, Henrik Frystyk; and Secret, Arthur (1994). The World-Wide Web. Communications of the ACM, v 37, p 76-82, August 1994

Berners-Lee, Tim and Connolly, Daniel W. (1995). Hypertext Markup Language - 2.0. RFC 1866, <URL:ftp://ds.internic.net/rfc/rfc1866.txt>

Berners-Lee, Tim; Fielding, Roy T.; and Nielsen, Henrik Frystyk (1996). Hypertext Transfer Protocol -- HTTP/1.0. RFC 1945, <URL:ftp://ds.internic.net/rfc/rfc1945.txt>

Berners-Lee, Tim; Masinter, Larry; and McCahill, Mark (1994). Uniform Resource Locators (URL). <URL:ftp://ds.internic.net/rfc/rfc1738.txt>

Borenstein, Nathaniel S. and Freed, Ned (1993). MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies. RFC 1521, <URL:ftp://ds.internic.net/rfc/rfc1521.txt>

The Center for Networked Information Discovery and Retrieval (1996a). The Center for Networked Information Discovery and Retrieval. <URL:http://www.cnidr.org/>

The Center for Networked Information Discovery and Retrieval (1996b). CNIDR Isite. <URL:http://vinca.cnidr.org/software/Isite/Isite.html>

searchtools.com>Site Search Tools Product Reports>WAIS. <URL:http://www.searchtools.com/tools/wais.html>

Gramlich, Wayne C. (1994). Public Annotation Systems. <URL:http://www2.jps.net/~gramlich/projects/public_annotations/index.html>

W3C (1999). W3C>Collaboration, Knowledge Representation and Automatability. <URL:http://www.w3.org/Collaboration/>

JavaSoft (1996). Java(tm) - Programming for the Internet. <URL:http://java.sun.com/>

Kantor, Brian and Lapsley, Phil (1986). Network News Transfer Protocol. RFC 977, <URL:ftp://ds.internic.net/rfc/rfc977.txt>

Marchal, Benoît (1996). An Introduction to SGML: SGML in Plain English. <URL:http://www.pineapplesoft.com/reports/sgml/index.html>

Masinter, Larry and Nebel, Ernesto (1995). Form-based File Upload in HTML. RFC1867, <URL:ftp://ds.internic.net/rfc/rfc1867.txt>

National Center for Supercomputing Applications (1996). The Common Gateway Interface. <URL:http://hoohoo.ncsa.uiuc.edu/cgi/>

Netscape Corporation (1996). The SSL Protocol. <URL:http://home.netscape.com/eng/security/SSL_2.html>

Oard, Doug (1996). Information Filtering Resources. University of Maryland, <URL:http://www.enee.umd.edu/medlab/filter/filter.html>

Postel, Jonathan B. (1982). Simple Mail Transfer Protocol RFC 821, <URL:ftp://ds.internic.net/rfc/rfc821.txt>

Postel, Jonathan B. and Reynolds, Joyce (1985). File Transfer Protocol. RFC 959, <URL:ftp://ds.internic.net/rfc/rfc959.txt>

Salton, Gerard (1971). A New Comparison Between Conventional Indexing (MEDLARS) and Automatic Text Processing (SMART). Cornell University Computer Science Technical Report CS TR71-115, <URL:http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR71-115>

Stein, Lincoln D. (1996). CGI.pm - a Perl5 CGI Library. <URL:http://www-genome.wi.mit.edu/ftp/pub/software/WWW/cgi_docs.html>

The World Wide Web Consortium (1996a). HyperText Markup Language (HTML). <URL:http://www.w3.org/pub/WWW/MarkUp/>

The World Wide Web Consortium (1996b). The World Wide Web Consortium. <URL:http://www.w3.org/>


Last modified June 6, 2003.
Comments? Contact Mark L. Fisher!