Artronic Development (ARDE)

Website Tips and Examples

David W. Eaton
Artronic Development
4848 E. Cactus Rd. - Suite 505-224
Scottsdale, Arizona 85254
602-953-0336
dwe@arde.com
http://www.arde.com
Original: March 1999

Introduction

In previous InterWorks conference presentations we have looked at business case, maintenance, specific techniques, and productivity and task automation aspects of Web sites. This year I will provide some specific tips, examples, and scripts to help your Web site. Specifically, I will discuss:

Some sample scripts
Methods for providing dynamic content
Site architecture and mirroring considerations

While you may decide to use some of the techniques or scripts described here, it is more likely they will serve as fodder for improved ideas and approaches at your own site.

Some sample scripts

As with most other aspects of computing, Web site and page maintenance often is comprised of repetitive tasks. Using scripts can help speed up these tasks as well as improve the reliability of the job at hand. Some examples follow. The scripts discussed here will be made available to conference attendees. This code is provided under the provisions of the Perl "Artistic license" and should be considered "examples", not "product". It is likely you will need or want to alter it at your site to suit your needs.

Transform strings in Web pages (fixstring)

It is frequently necessary to swap one string of characters for another on all (or many) pages of a Web site. For instance, you may need to change the copyright year notation or to change your company phone number. It is easy enough to locate all such cases with an appropriate grep or use of a tool such as Jeffrey Friedl's search script. However, if you have a large number of files to change, the manual process to edit each can be quite time consuming and error prone. Using a script to do the work for you is much faster and more reliable.

I created a perl script named fixstring which I use as a template. I modify the search and swap strings as needed for each change that is required. This lets me make several changes in each file or use compound conditions under which the change gets made. The rest of the script remains unchanged from use to use so the same core services are provided each time.

The script accepts the pathname of a file as an argument. It scans through the specified file looking for the search string. If it locates the string, it swaps the replacement string into the file, remembers that it made a change, and continues through the entire file. When completed, it checks to see if any changes were made. If so, it writes the new content to the same file name but with a ".new" extension, then removes the original file and renames the ".new" version. This limits the amount of time the file is unavailable to other readers and ensures a "good write" completes before the original file is disturbed.

Replace blocks of lines in Web pages (fixblock)

Sometimes a simple string substitution as discussed for fixstring above is not sufficient and multiple lines must be checked and swapped in unison. This may occur if a new navigational scheme is to be used and all page footers need to be swapped at once.

The fixblock script requires that you have architected your pages in such a way that the block needing to be swapped has a definitive start and end designation (which may be the start or end of the file). Normally, it proceeds to pass input directly to output. When it locates the (potential) start of a defined block, it begins appending each input line to an internal variable rather than sending it to output. When the end of the block is reached, the "gathered" lines are checked to see if they meet the criteria of the block(s) to be swapped. If so, the gathered lines are discarded and the replacement ones are sent to output. (Of course, the same technique could be used to perform complex checks on each block and swap several strings within the block rather than replacing it entirely if so desired.)

As with fixstring, the output is written to the same file name but with a ".new" extension, then renamed once the write is confirmed.

Page cleanup (demsify)

Frequently, content for Web pages is supplied in electronic form by people using machines and tools which use character sets which do not conform to the Universal Character Set (UCS), defined in ISO10646. According to the HTML 4.0 Specification from the World Wide Web Consortium, this is equivalent to Unicode 2.0. For example, Microsoft tools tend to include characters in the undefined code range 128-159. These usually display as strange characters on non-Microsoft platforms.

The demsify script checks each line of the specified file for such characters and either provides a warning of the line numbers on which they exist (so you can edit them in your favorite editor tool) or swaps them for predetermined characters so at least the results will be predictable on multiple platforms. In addition, it removes carriage-return/linefeed and linefeed/carriage-return line endings (swaping them for a single newline) and wraps long lines to a predetermined length, making it easier to look at a page via your browser's "view source" function.

Summarize monthly lines from wwwstat (logrollup)

Many sites use wwwstat to analyze their log files and prepare reports. These are often chained together so that previous months' reports may be viewed. However, with each month on its own page, it is not easy to get a higher level picture of the month to month trend of Web activity.

The logrollup script locates all the available old reports (by assuming they are located in a particular directory and may be identified by a particular naming convention). Then it extracts certain high-level information such as total bytes and requests from the report heading area, and creates a simple report with each month on its own line. With modifications to gwstat (see below) the values in this new report may be used to create a graph of the activity over the past few months. When run from a periodic scheduler such as cron, it is easy to provide site managers with a Web page which contains a simple report of site activity.

Combine multiple domain log files (munglog)

Interex is using a technique (see pickpage below) to display a different welcome page for each of several domain names, but permit access to the entire Web site under any of these names. As a result, when analyzing the log files, they needed to be able to distinguish between each of the different welcome pages, yet consider all accesses together for each page of the rest of the site.

The solution was to have each virtual domain write its accesses to its own named log file. Then the munglog script locates each of these log files, translates the access to "/", the welcome page of each, to a full domain name and concatenate the logs into one large, combined log file. It is this combined log file which is read by the metrics analysis routines, such as wwwstat , to produce the daily and monthly reports of Web site use.

A useful related script named ipsetup is discussed in the presentation Scripts for Linux and HP-UX. It may be used on Linux (and other) systems to run a series of ifconfig commands to start and stop multiple virtual domain servicing on a single machine.

Other places to look

When looking for scripts to assist you, don't forget to scan the Comprehensive Perl Archive Network (CPAN) as well as past InterWorks Conference presentations for a wealth of material. Other helpful places include your ISP's technical support area and Matt's Script Archive.

Methods for providing dynamic content

There are a variety of techniques which may be used to provide dynamic content. The trade magazines are full of "Java" and "JavaScript" examples and many discuss methods for extracting information from databases to build pages on demand. However, there are some standard, fairly straight forward methods which you may not have considered.

Select template page based on domain name (pickpage)

Interex registered a number of different domain names to serve as entry points for particular services and events it offers. Some of these entry points were for HP World, ERP World, ERP News, and the NT Forum. However, to ensure that all Interex information is always available, regardless of the entry mechanism used, the entire site's pages are accessible with the same sub-URL under any of these domain names. This is accomplished using a technique (mentioned in the discussion of munglog above) which determines the domain name used to access the welcome page. Then the domain name is used to select the name of a template page within the Interex Web site which should be displayed for that domain name.

Since the document root for each of the different domains is defined to the Web server as the same directory path, every other page within the Interex Web site is available using any of the registered domain names. For example, http://www.hpworld.org/conference/iworks99/, and http://www.erpworld.org/conference/iworks99/ yielded the same InterWorks 99 Conference page as was found at http://www.interworks.org/conference/iworks99/ (all now retired). (Although the www.interworks.org version was delivered from an Iowa mirror of the California Interex site, but that is a topic for a different discussion.)

The pickpage script utilizes a configuration file which contains the translation table from domain name to template file name. Thus, changes to the defined list may be made without disturbing the code which processes it. The format of the configuration file is:

<domainame> <pagepath> # comment

where <pagepath> must end in ".html" or be a directory name. <pagepath> is assumed to be relative to the directory in which the processing script resides unless <pagepath> begins with a "/", in which case it is assumed to be relative to the Web server's DOCUMENT_ROOT environment variable. Sample lines from the configuration file are:

www.interex.org /index.html # Main organization page www.hpworld.org /hpworld/index.html # Dual conference and magazine page

As long as the template pages are written such that all image and link references are from the document root, or "/" level of the site rather than relative to their local directory, they will function correctly from both their normal location in the Web site's heirarchy and from the special access as a welcome page.

If the specified page cannot be found, the script looks for "index.html" in the script's directory and tries to display that. If still not found, a canned error page is displayed.

To implement the script, the Web server must be configured to allow new default pages and the script name must preceed the normal static default. On the interex system, we chose to rename the pickpage script "index.pl". Thus the Apache server configuration entry is:

DirectoryIndex index.pl index.html

When a directory URL is presented to the server, it first looks for an instance of the "index.pl" script. If it does not find it, it reverts to the standard "index.html" file.

It is also necessary to configure the server to allow execution of the script:

AddHandler cgi-script .pl

and to allow execution in the directoy in which the script resides:

Options ... ExecCGI

The script is coded so that it could be used from locations other than the welcome page of a Web site, if desired.

Select template from CGI

Even if you have only a single domain name to service with your Web server, a modification of the technique described for pickpage may be used to display alternate pages. The script could be modified to select a different page template based upon the hour or the day of the week or even based upon a random number. This could be used to cycle through product offerings or tips. Alternatively, the script could be modified to use the input from a form submission to determine what page would be best to display. A multitude of other possibilities exist and are left th the reader's imagination.

Regenerate images used on static pages

Normally, static Web pages are crafted to display particular images and these do not change. However, the page content can be made to appear dynamic by replacing the image file with a different one. Once this is done, new invocations of the page will show the new graphic (subject to the usual image cache problems, of course).

The image might be re-computed and re-generated as the result of form input via a CGI program or it might be the result of a periodic task running under cron. The latter approach is found frequently on sites where a program such as gwstat is used in conjunction with wwwstat to generate graphic reports of Web site use. (Note that the version of gwstat I have used requires gr to generate the graphics, however gr is no longer under development. An alternative might be to try grace, a descendant of ACE/gr.)

Site architecture and mirroring considerations

Site architecture aids site construction and maintenance as well as readership. Both physical (file and directory) architecture and logical (cross links) architecture should be considered and planned carefully.

Provide an index page for each directory

This may be obvious, but it is very important to provide a default page (check your server, often it is index.html) in each subdirectory which is displayed when just the directory name is passed from the browser to the server. This allows visitors to simply erase the file name from their URL and "back up" to what they expect to be your index or pointer into the subject matter. This may be because they found information which is "close", but not quite, what they want on one of your pages. Or it may be because they have a bad or outdated URL which no longer points to a real page -- backing up to the index may provide the pointers they need to locate the information desired.

Use of a default index page also prevents visitors from viewing your site's directory listing and selecting a page which may be under construction and not yet ready for public viewing. (Of course, this assumes your site's search capability follows your page links rather than your directory structures when building it's inventory of possible pages.)

Provide a directory for each topic

Proper segmentation into directories of related information makes it easier to locate similar information and it provides a simpler mechanism for granting access to related files either for content editors (via file permissions) or for site visitors (via server authorization files, passwords, etc.) In addition, it facilitates analysis of site metrics if you are using a program which can "roll-up" the counts for all pages beneath each directory.

It is best to keep each directory addressing a single topic. Try not to mix multiple topics together, even if they are relatively small. As soon as they begin to grow (and they usually do), it is likely you will want to have separate directories. However, by then the various page URLs will be know by search engines and be referenced by other sites' pages and contained in visitors' bookmarks. Changing your site architecture at that point causes considerable disruption to its usefulness or requires extensive use of Redirect in the server configuration files.

Remember to provide adequate cross-reference links both to other pages within the same topic and to other topics which may be related. The easier you can make it for visitors to navigate your logical site architecture, the more valuable your site will be.

A directory for Images

While it is possible to intersperse your images with the text files, separating them into their own commonly named directories provides a number of benefits. The separate directories "unclutter" your text directories, making maintenance easier. Tools which manipulate text, check links, and analyze logs can be configured or written to avoid descending into or considering known image directories, saving time and simplifying reporting.

Mirroring implications

Proper segregation of your files into directories as described above will make it easier to mirror only portions of your site. (Of course in this instance, you must be very careful to avoid site-relative links which cross between the mirrored and non-mirrored segments of your site.)

Slash or relative links (vs domain-specific)

If you are mirroring your site, in general you will want to be careful to use links which do not contain the site's domain name. This keeps all cross references working on the mirror site at which the visitor first began reading your pages. If you use a domain-specific link, they will be shuttled to that particular domain (or physical machine) and remain there for all future site-relative links. This may mean that someone trying to use your European mirror suddenly will be making multiple trans-Atlantic accesses to a US based site and suffering the related latency or expense involved with such links.

Places non-domain-specific links hurt

Despite what we just coverd above, there are times when domain-specific links are not only desired, but required. For example, if you have a CGI application which submits or alters data or pages, such manipulation needs to take place on the "master" site. Otherwise, only that particular mirror will contain the change, and only for a short time at that. Subsequent mirror operations will cover up the change with the original content, causing it to be lost.

As you architect your site, watch for these instances and be certain to craft the links accordingly. You may want to consider placing comments into the HTML to warn future maintainers that a particular link must remain domain-specific.

Unmirrored subdirectories

There are some directories and types of data you probably do not want to mirror. While you may want to store the analyzed reports of site acccess on all mirrors, it is unlikely you want to mirror the actual raw log files. Private files, certain password protected files, and user-specific files are other categories you should consider when designing your mirror.

Each server will normally have its own configuration files. When functional configuration changes are made (e.g. access restriction and page redirection) these changes will need to be made in each mirror's configuration files. This can be facilitated by mirroring the master configuration files, but saving them in a side directory on the mirror, not into the location actually used by the mirror for its own configuration. Then, the changes can be discovered and implemented by hand, or automated in full or in part by a cron job.

Permissions

Since the mirror process may not run as the same user or group as the Web server or the site maintainers, particular attention must be paid to file and directory permissions. It is all too easy for a maintainer to make a local copy of a file but leave it with permissions preventing the mirror process from reading it (even if the Web server can). This then results in extraneous error log entries which must be reconciled. Worse yet, the page may later be determined to be something of value and comissioned for public viewing on the master site, but be unavailable on the mirror. (The error was logged, but it became one of those "expected" errors and was ignored when the page "went public".)

Common path access to supporting files

Several Web functions normally need to locate particular files within the file system. Access override files (e.g. .htaccess) must be able to point to their associated master password and group files. CGI programs may need to locate configuration, log, library, or other files of information. If these resources cannot be located via the same path names on each mirror, it is unlikely that all services will be available. If there is some reason that such files cannot be stored permanently in the same locations everywhere, consider utilizing a series of "mock" directories and soft links to provide the appearance that they do.

For example, if the Web site's perl scripts define perl's location as:

#!/opt/perl/bin/perl

but on one mirror perl is still installed at /usr/local/bin/perl, you can create the /opt/perl/bin directory structure, then provide links such as:

ln -s /usr/local/bin/perl /opt/perl/bin/perl ln -s /usr/local/lib/perl5 /opt/perl/lib

so that the scripts will continue to work.

I have found it handy to define the master Web location as /opt/web, then either provide the directories such as htdocs and cgi-bin directly in that structure or as links to their actual locations. For that matter, /opt/web itself could be a soft link to a different location on a different disk, but common access is still provided and is independent of the particulars of the brand or version of the Web server being used.

If the mirrors are not all running the same manufacturer's OS, you may discover that what are normally considered common system programs are located in different places. I have found this to be true with programs such as sendmail. This was compounded by the fact that some servers may not provide a PATH variable which includes the directory in which programs such as sendmail may be found. The solution is to change the PATH or use links so the required services may be located.

Compatible versions of installed programs

Finally, once all required information may be found on each of the mirrors, it is equally important that compatible revisions of system and application software be available. For example, it may not be important whether perl 5.003 or 5.004 is available, but it probably must be a version of 5.x.

If you have control of each of the mirrors, it should be a simple matter to upgrade certain programs to be of a compatible version. However, if you are sharing a mirror machine with another department or company, that may not be possible. In certain circumstances, it may be necessary to modify your Web site so that it may be accomodated on all mirrors.

Where is this paper?

This presentation was prepared by Artronic Development for the 1999 InterWorks Conference. An outline and an abstract are available for your convenience. This is an HTML document located at URL "http://www.arde.com/Papers/WebTips/paper.html".

Where are these scripts?

At this writing, the scripts may be obtained via http from the Artronic Development site:

webscutils-1.2.tar.gz (35.6kb)

To learn more about the Web, please refer to the other papers which are also on this Web site.

(Dave Eaton has been creating and maintaining Web sites since mid-1994. He provides assistance on the Interex Web site and is the "Webmaster" for numerous other sites.)