David W. Eaton
Artronic Development
4848 E. Cactus Rd. - Suite 505-224
Scottsdale, Arizona 85254
602-953-0336
dwe@arde.com
http://www.arde.com
Original: March 1999
In previous InterWorks conference presentations we have looked at business case, maintenance, specific techniques, and productivity and task automation aspects of Web sites. This year I will provide some specific tips, examples, and scripts to help your Web site. Specifically, I will discuss:
As with most other aspects of computing, Web site and page maintenance often is comprised of repetitive tasks. Using scripts can help speed up these tasks as well as improve the reliability of the job at hand. Some examples follow. The scripts discussed here will be made available to conference attendees. This code is provided under the provisions of the Perl "Artistic license" and should be considered "examples", not "product". It is likely you will need or want to alter it at your site to suit your needs.
It is frequently necessary to swap one string of characters for another on all
(or many) pages of a Web site. For instance, you may need to change the copyright
year notation or to change your company phone number. It is easy enough to locate
all such cases with an appropriate grep
or use of a tool such as
Jeffrey Friedl's
search
script. However, if you have a large number of files to change, the manual
process to edit each can be quite time consuming and error prone. Using a
script to do the work for you is much faster and more reliable.
I created a perl script named fixstring
which I use as a template.
I modify the search and swap strings as needed for each change that is required.
This lets me make several changes in each file or use compound conditions under
which the change gets made. The rest of the script remains unchanged from use to
use so the same core services are provided each time.
The script accepts the pathname of a file as an argument. It scans through the
specified file looking for the search string. If it locates the string, it
swaps the replacement string into the file, remembers that it made a change,
and continues through the entire file. When completed, it checks to see if any
changes were made. If so, it writes the new content to the same file name but
with a ".new
" extension, then removes the original
file and renames the ".new
" version. This limits the
amount of time the file is unavailable to other readers and ensures a "good
write" completes before the original file is disturbed.
Sometimes a simple string substitution as discussed for
fixstring
above is not sufficient and
multiple lines must be checked and swapped in unison. This may occur if a
new navigational scheme is to be used and all page footers need to be swapped
at once.
The fixblock
script requires that you have architected your pages
in such a way that the block needing to be swapped has a definitive start and
end designation (which may be the start or end of the file).
Normally, it proceeds to pass input
directly to output. When it locates the (potential) start of a defined block,
it begins appending each input line to an internal variable rather than
sending it to output. When the end of the block is reached, the
"gathered" lines are checked to see if they meet the criteria of the
block(s) to be swapped. If so, the gathered lines are discarded and the
replacement ones are sent to output. (Of course, the same technique could be
used to perform complex checks on each block and swap several strings within
the block rather than replacing it entirely if so desired.)
As with fixstring
, the output is written to the same file name but
with a ".new
" extension, then renamed once the write is
confirmed.
Frequently, content for Web pages is supplied in electronic form by people using machines and tools which use character sets which do not conform to the Universal Character Set (UCS), defined in ISO10646. According to the HTML 4.0 Specification from the World Wide Web Consortium, this is equivalent to Unicode 2.0. For example, Microsoft tools tend to include characters in the undefined code range 128-159. These usually display as strange characters on non-Microsoft platforms.
The demsify
script checks each line of the specified file for such
characters and either provides a warning of the line numbers on which they exist
(so you can edit them in your favorite editor tool)
or swaps them for predetermined characters so at least
the results will be predictable on multiple platforms. In addition, it removes
carriage-return/linefeed and linefeed/carriage-return line endings (swaping them
for a single newline) and wraps long lines to a predetermined length, making
it easier to look at a page via your browser's "view source" function.
Many sites use wwwstat to analyze their log files and prepare reports. These are often chained together so that previous months' reports may be viewed. However, with each month on its own page, it is not easy to get a higher level picture of the month to month trend of Web activity.
The logrollup
script locates all the available old reports
(by assuming they are located in a particular directory and may be identified
by a particular naming convention). Then it extracts certain high-level
information such as total bytes and requests from the report heading area,
and creates a simple report with each month on its own line. With modifications
to gwstat
(see below)
the values in this new report may be used to create a graph of the activity over
the past few months. When run from a periodic scheduler such as cron
,
it is easy to provide site managers with a Web page which contains a simple report
of site activity.
Interex is using a technique (see pickpage
below)
to display a different welcome page for each of several domain names, but permit
access to the entire Web site under any of these names. As a result, when analyzing
the log files, they needed to be able to distinguish between each of the different
welcome pages, yet consider all accesses together for each page of the rest of the
site.
The solution was to have each virtual domain write its accesses to its own
named log file. Then the munglog
script locates each of these log files,
translates the access to "/", the welcome page of each, to a full domain
name and concatenate the logs into one large, combined log file. It is this combined
log file which is read by the metrics analysis routines, such as
wwwstat
, to produce the daily and monthly reports of Web site use.
A useful related script named
ipsetup
is discussed in the presentation
Scripts for Linux and HP-UX.
It may be used on Linux (and other) systems to run a series of ifconfig
commands to start and stop multiple virtual domain servicing on a single machine.
When looking for scripts to assist you, don't forget to scan the Comprehensive Perl Archive Network (CPAN) as well as past InterWorks Conference presentations for a wealth of material. Other helpful places include your ISP's technical support area and Matt's Script Archive.
There are a variety of techniques which may be used to provide dynamic content. The trade magazines are full of "Java" and "JavaScript" examples and many discuss methods for extracting information from databases to build pages on demand. However, there are some standard, fairly straight forward methods which you may not have considered.
Interex registered a number of different
domain names to serve as entry points for particular services and events it offers.
Some of these entry points were for
HP World, ERP World, ERP News,
and the NT Forum.
However, to ensure that all Interex information is always available, regardless of
the entry mechanism used, the entire site's pages are accessible with the same sub-URL
under any of these domain names. This is accomplished using a technique
(mentioned in the discussion of munglog
above)
which determines the domain name used to access the welcome page. Then the domain name
is used to select the name of a template page within the Interex Web site which should
be displayed for that domain name.
Since the document root for each of the different domains is defined to the Web server as the same directory path, every other page within the Interex Web site is available using any of the registered domain names. For example, http://www.hpworld.org/conference/iworks99/, and http://www.erpworld.org/conference/iworks99/ yielded the same InterWorks 99 Conference page as was found at http://www.interworks.org/conference/iworks99/ (all now retired). (Although the www.interworks.org version was delivered from an Iowa mirror of the California Interex site, but that is a topic for a different discussion.)
The pickpage
script utilizes a configuration file which contains the
translation table from domain name to template file name. Thus, changes to the defined
list may be made without disturbing the code which processes it. The format of the
configuration file is:
<domainame> <pagepath> # comment
where <pagepath>
must end in ".html
" or be a
directory name.
<pagepath> is assumed to be relative to the directory
in which the processing script resides unless <pagepath> begins with
a "/", in which case it is assumed to be relative to the Web server's
DOCUMENT_ROOT
environment variable. Sample lines from the configuration
file are:
www.interex.org /index.html # Main organization page
www.hpworld.org /hpworld/index.html # Dual conference and magazine page
As long as the template pages are written such that all image and link references are
from the document root, or "/" level of the site rather than relative to
their local directory, they will function correctly from both their normal location
in the Web site's heirarchy and from the special access as a welcome page.
If the specified page cannot be found, the script looks for "index.html
"
in the script's directory and tries to display that. If still not found, a canned error page
is displayed.
To implement the script, the Web server must be configured to allow new default pages
and the script name must preceed the normal static default. On the interex system, we chose
to rename the pickpage
script "index.pl
". Thus the
Apache server configuration entry is:
DirectoryIndex index.pl index.html
When a directory URL is presented to the server, it first looks for an instance of the
"index.pl
" script. If it does not find it, it reverts to the
standard "index.html
" file.
It is also necessary to configure the server to allow execution of the script:
AddHandler cgi-script .pl
and to allow execution in the directoy in which the script resides:
Options ... ExecCGI
The script is coded so that it could be used from locations other than the welcome page of a Web site, if desired.
Even if you have only a single domain name to service with your Web server, a modification
of the technique described for pickpage
may be used
to display alternate pages. The script could be modified to select a different page
template based upon the hour or the day of the week or even based upon a random number.
This could be used to cycle through product offerings or tips. Alternatively, the script
could be modified to use the input from a form submission to determine what page would be
best to display. A multitude of other possibilities exist and are left th the reader's
imagination.
Normally, static Web pages are crafted to display particular images and these do not change. However, the page content can be made to appear dynamic by replacing the image file with a different one. Once this is done, new invocations of the page will show the new graphic (subject to the usual image cache problems, of course).
The image might be re-computed and re-generated as the result of form input via a CGI
program or it might be the result of a periodic task running under cron
.
The latter approach is found frequently on sites where a program such as
gwstat
is used in conjunction with
wwwstat
to generate graphic reports of Web site use.
(Note that the version of gwstat
I have used requires
gr to generate
the graphics, however gr
is no longer under development.
An alternative might be to try
grace, a descendant of
ACE/gr
.)
Site architecture aids site construction and maintenance as well as readership. Both physical (file and directory) architecture and logical (cross links) architecture should be considered and planned carefully.
This may be obvious, but it is very important to provide a default page
(check your server, often it is index.html
)
in each subdirectory which is displayed when just the directory name is passed from
the browser to the server. This allows
visitors to simply erase the file name from their URL and "back up" to
what they expect to be your index or pointer into the subject matter. This may be
because they found information which is "close", but not quite, what they
want on one of your pages. Or it may be because they have a bad or outdated URL
which no longer points to a real page -- backing up to the index may provide the
pointers they need to locate the information desired.
Use of a default index page also prevents visitors from viewing your site's directory listing and selecting a page which may be under construction and not yet ready for public viewing. (Of course, this assumes your site's search capability follows your page links rather than your directory structures when building it's inventory of possible pages.)
Proper segmentation into directories of related information makes it easier to locate similar information and it provides a simpler mechanism for granting access to related files either for content editors (via file permissions) or for site visitors (via server authorization files, passwords, etc.) In addition, it facilitates analysis of site metrics if you are using a program which can "roll-up" the counts for all pages beneath each directory.
It is best to keep each directory addressing a single topic. Try not to mix multiple
topics together, even if they are relatively small. As soon as they begin to grow
(and they usually do), it is likely you will want to have separate directories.
However, by then the various page URLs will be know by search engines and be
referenced by other sites' pages and contained in visitors' bookmarks. Changing
your site architecture at that point causes considerable disruption to its
usefulness or requires extensive use of Redirect
in the server
configuration files.
Remember to provide adequate cross-reference links both to other pages within the same topic and to other topics which may be related. The easier you can make it for visitors to navigate your logical site architecture, the more valuable your site will be.
While it is possible to intersperse your images with the text files, separating them into their own commonly named directories provides a number of benefits. The separate directories "unclutter" your text directories, making maintenance easier. Tools which manipulate text, check links, and analyze logs can be configured or written to avoid descending into or considering known image directories, saving time and simplifying reporting.
Proper segregation of your files into directories as described above will make it easier to mirror only portions of your site. (Of course in this instance, you must be very careful to avoid site-relative links which cross between the mirrored and non-mirrored segments of your site.)
If you are mirroring your site, in general you will want to be careful to use links which do not contain the site's domain name. This keeps all cross references working on the mirror site at which the visitor first began reading your pages. If you use a domain-specific link, they will be shuttled to that particular domain (or physical machine) and remain there for all future site-relative links. This may mean that someone trying to use your European mirror suddenly will be making multiple trans-Atlantic accesses to a US based site and suffering the related latency or expense involved with such links.
Despite what we just coverd above, there are times when domain-specific links are not only desired, but required. For example, if you have a CGI application which submits or alters data or pages, such manipulation needs to take place on the "master" site. Otherwise, only that particular mirror will contain the change, and only for a short time at that. Subsequent mirror operations will cover up the change with the original content, causing it to be lost.
As you architect your site, watch for these instances and be certain to craft the links accordingly. You may want to consider placing comments into the HTML to warn future maintainers that a particular link must remain domain-specific.
There are some directories and types of data you probably do not want to mirror. While you may want to store the analyzed reports of site acccess on all mirrors, it is unlikely you want to mirror the actual raw log files. Private files, certain password protected files, and user-specific files are other categories you should consider when designing your mirror.
Each server will normally have its own configuration files. When functional
configuration changes are made (e.g. access restriction and page redirection)
these changes will need to be made in each mirror's configuration files. This
can be facilitated by mirroring the master configuration files, but saving them
in a side directory on the mirror, not into the location actually used by the
mirror for its own configuration. Then, the changes can be discovered and
implemented by hand, or automated in full or in part by a cron
job.
Since the mirror process may not run as the same user or group as the Web server or the site maintainers, particular attention must be paid to file and directory permissions. It is all too easy for a maintainer to make a local copy of a file but leave it with permissions preventing the mirror process from reading it (even if the Web server can). This then results in extraneous error log entries which must be reconciled. Worse yet, the page may later be determined to be something of value and comissioned for public viewing on the master site, but be unavailable on the mirror. (The error was logged, but it became one of those "expected" errors and was ignored when the page "went public".)
Several Web functions normally need to locate particular files within the file
system. Access override files (e.g. .htaccess
) must be able to
point to their associated master password and group files. CGI programs
may need to locate configuration, log, library, or other files of information.
If these resources cannot be located via the same path names on each mirror,
it is unlikely that all services will be available. If there is some reason that
such files cannot be stored permanently in the same locations everywhere, consider
utilizing a series of "mock" directories and soft links to provide
the appearance that they do.
For example, if the Web site's perl
scripts define perl's location as:
#!/opt/perl/bin/perl
but on one mirror perl
is still installed at /usr/local/bin/perl
,
you can create the /opt/perl/bin
directory structure, then provide
links such as:
ln -s /usr/local/bin/perl /opt/perl/bin/perl
ln -s /usr/local/lib/perl5 /opt/perl/lib
so that the scripts will continue to work.
I have found it handy to define the master Web location as /opt/web
, then
either provide the directories such as htdocs
and cgi-bin
directly in that structure or as links to their actual locations. For that matter,
/opt/web
itself could be a soft link to a different location on a different
disk, but common access is still provided and is independent of the particulars
of the brand or version of the Web server being used.
If the mirrors are not all running the same manufacturer's OS, you may discover that
what are normally considered common system programs are located in different places.
I have found this to be true with programs such as sendmail
. This was
compounded by the fact that some servers may not provide a PATH
variable
which includes the directory in which programs such as sendmail
may
be found. The solution is to change the PATH
or use links so the
required services may be located.
Finally, once all required information may be found on each of the mirrors, it is
equally important that compatible revisions of system and application software
be available. For example, it may not be important whether perl
5.003 or
5.004 is available, but it probably must be a version of 5.x.
If you have control of each of the mirrors, it should be a simple matter to upgrade certain programs to be of a compatible version. However, if you are sharing a mirror machine with another department or company, that may not be possible. In certain circumstances, it may be necessary to modify your Web site so that it may be accomodated on all mirrors.
This presentation was prepared by Artronic Development for the 1999 InterWorks Conference. An outline and an abstract are available for your convenience. This is an HTML document located at URL "http://www.arde.com/Papers/WebTips/paper.html".
At this writing, the scripts may be obtained via http
from the Artronic
Development site:
To learn more about the Web, please refer to the other papers which are also on this Web site.
(Dave Eaton has been creating and maintaining Web sites since mid-1994. He provides assistance on the Interex Web site and is the "Webmaster" for numerous other sites.)