Web::Chain project: Web/Definitions.pm
package Web::Definitions;
# doom@kzsu.stanford.edu
# August 03, 2004 # Rev: October 5, 2004
=head1 NAME
Web::Definitions - defines constants (now in the form of exported
variables) in use throughout the doomfiles tools
=head1 SYNOPSIS
use Web::Definitions qw($DF_DESTINATION_RULE);
unless( $name =~ qr/$DF_DESTINATION_RULE/ ) {
die "$name is not a well-formed DF node name";
}
=head1 DESCRIPTION
Set of constant definitions for doomfiles work.
=cut
use 5.006;
use strict;
use warnings;
use Carp;
require Exporter;
our @ISA = qw(Exporter);
=head2 EXPORT
This module uses a system that automatically adds
candidates for export to the EXPORT_OK list.
This system exports:
1. all constants
2. all *_RULE (or *_rule) variables (declared with 'our')
3. all UPPERCASE variables (declared with 'our')
The :all tag can be used to import all exports from this file:
use Web::Definitions qw(:all);
though this is not recommended.
It's better to just import the ones you need, chosen from the following:
=over
=cut
our @export_list;
sub BEGIN {
my $filename = (caller)[1];
open ME, "<$filename" or die "Can't open $filename for input: $!";
while(<ME>){
# get constant names from "use constant" lines
if ( m{^ \s* use \s+ constant \s+ (.*?) \s }x ) {
push @export_list, $1;
} elsif ( m{^ \s* our \s+ (.*?_RULE) \s }ix ) { # *_rule or *_RULE variable names
push @export_list, $1;
} elsif ( m{^ \s* our \s* ( \$[A-Z0-9_]+ ) \s* = }x ) { # $UPPER_CASE vars
push @export_list, $1 unless ($1 eq '$VERSION'); # skip $VERSION
}
}
}
# Additional items manually exported: (currently, none)
push @export_list, qw(
);
our %EXPORT_TAGS = ( 'all' => \@export_list );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
=item * B<$DF_VERSION> - project wide version number.
=cut
our $DF_VERSION = '0.08';
our $VERSION = $DF_VERSION;
=item * B<$DEBUG> - project wide debug flag, turns on lots
of excessively verbose reporting, all sent to STDERR.
=cut
our $DEBUG=0;
=item * B<$DOOM_HOME> - home directory of user doom (sometimes
this is better than playing with $ENV{'HOME'} because it
doesn't change if the script is run as root).
=cut
our $DOOM_HOME = '/home/doom';
#--------
# Web oriented locations and patterns
=item * B<$DF_LOC> - location of DF html in staging area (the finished product)
=cut
our $DF_LOC = $DOOM_HOME . '/End/Stage/Mirthless/doomfiles';
=item * B<$DOOM_THOUGHTS> - default source location of DF nodes in progress (rawtext)
The alternate name $DF_RAWTEXT is also available.
=cut
our $DOOM_THOUGHTS = $DOOM_HOME . '/End/Thought';
our $DF_RAWTEXT = $DOOM_THOUGHTS;
=item * B<$DF_PUSH_RSYNCH_TARGET> - rsync target for pushing the entire
project directory out to the public site
=cut
our $DF_PUSH_RSYNCH_TARGET = 'mirthles@shell.grin.net:/usr/home/mirthles/public_html';
# Example: $cmd = "rsync -avz -e ssh $doom_loc mirthles\@shell.grin.net:/usr/home/mirthles/public_html";
=item * B<$DF_PUSH_SCP_TARGET> - target for individual file transfer (via scp, or "rsync *.html"
rather than "rsync <dir>").
=cut
our $DF_PUSH_SCP_TARGET = "mirthles\@shell.grin.net:/usr/home/mirthles/public_html/doomfiles";
# Example: `scp $doomdir/$fpat.html mirthles\@shell.grin.net:/usr/home/mirthles/public_html/doomfiles`;
=item * B<$DF_THOUGHTS_TEMP_LOC> - bone pile to toss use rawtext files
after adding to the df html site.
=cut
our $DF_THOUGHTS_TEMP_LOC = $DOOM_THOUGHTS . '/Out';
=item * B<$DF_TOPNODE_NAME> - standard name for the first node in the linked-list
of DF files.
=cut
our $DF_TOPNODE_NAME = 'TOP';
=item * B<$DF_BOTNODE_NAME> - standard name for the last node in the linked-list
of DF files.
=cut
our $DF_BOTNODE_NAME = 'FIN';
# some other special nodes (generated and/or linked-to automatically)
=item * B<$DF_CONTENTS_NODE_NAME> - another standard name for a "special node":
a table of contents of all existing nodes in the DF project (this is an
automatically generated node).
=cut
our $DF_CONTENTS_NODE_NAME = 'CONTENTS';
=item * B<$DF_WHATSNEW_NODE_NAME> - another standard name for a "special node":
this is an inverse chronological change log. Additions to it are generated
automatically, but manual annotations are expected.
=cut
our $DF_WHATSNEW_NODE_NAME = 'WHATSNEW';
=item * B<$DF_WHATSNEW_NOW_MARKER> - a special marker string
that should always exist in a comment near the top of
WHATSNEW.html. This marker is searched for
when automatically adding additions to the file.
=cut
our $DF_WHATSNEW_NOW_MARKER = '===NOW MARKER===';
=item * B<$DF_NEW_NODES_LOG> - standard file that new additions to the
DF project were once logged to (this practice is being
phased out).
=cut
our $DF_NEW_NODES_LOG = $DOOM_HOME . '/tmp/doomfile_nodes.log';
#--------
# Web processing oriented patterns
#
=item * B<$df_node_name_quantified_pat> - this variable is
essentially the central definition of what constitutes a valid
"doomfiles node" name. These names must be at least three
characters long, and typically will be in all UPPER_CASE with
underscores as separators though hyphens and numerics are allowed,
as well as the lower-case 'c' (which allows names like McCLAUREN).
This variable is not exported as is, but is used by a number of
regexp rules below which are.
Example allowed node name:
McCELTIC-AMERICAN_SOUL_7
=cut
our $df_node_name_char_class_pat = '[0-9cA-Z_-]';
our $df_node_name_quantified_pat = $df_node_name_char_class_pat . '{3,}';
=item * B<$DF_NODE_NAME_RULE> - doomfiles node name pattern,
without any pinning at the beginning or end (with '^' or '$').
This is the regexp rule equivalent of $df_node_name_quantified_pat.
=cut
our $DF_NODE_NAME_RULE = qr{ $df_node_name_quantified_pat }x;
=item * B<$DF_NODE_NAME_PINNED_RULE> - doomfiles node name pattern,
pinned at both ends (used for verifying that a string contains
a valid node name and nothing else).
=cut
our $DF_NODE_NAME_PINNED_RULE = qr{ ^ $df_node_name_quantified_pat $ }x; # Note, begin and end pinning
=item * B<$DF_DESTINATION_RULE> - doomfiles node name pattern, pinned at the
beginning of the string via '^'. Used to extract a node name from
a body of text when it is up against the left margin, the right side
is pinned with zero-width lookahead for \s and/or $, the eol.
This right side pinning is better than simple greedy matching, because
it avoids a minor problem with false positives. In the case of
"FALSE*NAMES", it should not capture "FALSE", but instead report that
it doesn't see a valid match there.
=cut
#our $doomfiles_node_name_pat = "^[0-9cA-Z_-]{3,}"; # Note: half-pinned with '^'. Do not pin this with '$'
#our $doomfiles_node_name_pat = "^[0-9cA-Z_-]{3,}(?=\\s|\$)"; # Note: half-pinned with '^'. Do not pin this with '$'
#our $DF_DESTINATION_RULE = qr{ $doomfiles_node_name_pat }x;
our $DF_DESTINATION_RULE = qr{ ^ # Labels are up against left margin
( # Capturing to $1
$df_node_name_quantified_pat # i.e. [0-9cA-Z_-]{3,}
)
(?= \s | $ ) # space or EOL to pin the pattern
}x;
=item * B<$DF_GENERAL_NAME_RULE> - A simple, very liberal rule, matches
both links and destination labels
=cut
our $DF_GENERAL_NAME_RULE = qr{
\b
( $df_node_name_quantified_pat ) # doomfile node name, captured to 1
\b
}smx;
=item * B<$doomfiles_thoughts_node_separator_pat> - detects a line that
consists of a bar equal signs beginning in the first column,
('==='). This is used in the "rawtext" doomfiles format to indicate
the end of a node. Used by the $DF_END_RULE below.
=cut
our $doomfiles_thoughts_node_separator_pat = '^==+\s*$';
#--------
# Web::Chain::IO::Output::Html
# (via txt2html in Web::Pro::HtmlOutput)
# (((TODO Why is the above label here? The following has nothing
# to do with it... straighten out these comments.))
#--------
# Web::Chain::IO::Input::Rawtext
#
=item * B<$DF_THOUGHTS_LINK_RULE> - Tries to identify links embedded in
the rawtext source files (sometimes called "Thoughts") without
getting confused by incidental use of uppercase strings in the
text. Doomfile-style links are distinguished by the whitespace
that surrounds them, roughly, at least two spaces before and
behind, where the end of the line (and possibly the beginning?)
can be thought of as a chunk of virtual spaces.
(Getting this to work right on all corner cases is a
a suprisingly difficult problem.)
This version is a quickie that has bugs in identifying a link
near the end of the line without trailing spaces.
It keeps things simple by capturing leading and trailing spaces
as well as the link, using it in a s/// requires building a
replacement version with $1 $2 $3.
=cut
our $doomfiles_thoughts_link_pat = '([\ ]{2,})(' . $df_node_name_quantified_pat . ')([\ ]{2,}|$)';
# our $DF_THOUGHTS_LINK_RULE = qr{ $doomfiles_thoughts_link_pat }x; # $1 and $3: whitespace, $2: node_name
# TODO BUG The above, which is in "production"
# doesn't work with the eol case at all (that's TO_HELL
# in the test case in the *.t).
# Does this fix it?
our $DF_THOUGHTS_LINK_RULE = # $1 and $3: whitespace, $2: node_name
qr{
( [\x20\t][\x20\t] ) # two leading spaces, captured to $1
# Note: requires "fixed string" (char class okay)
( $df_node_name_quantified_pat ) # doomfile node name, captured to $2
( [\x20\t][\x20\t] # 2 spaces or... \
| [\x20\t] $ # 1 space then eol or... > captured to $3
| $ # eol /
)
}msx;
=item * B<$DF_EMBEDDED_LINK_SINGLE_CAPTURE_RULE> - Tries to identify
links embedded in the rawtext source files (sometimes called
without getting confused by incidental use of uppercase
strings in the text.
Doomfile-style links are distinguished by the whitespace
that surrounds them, roughly at least two spaces before and
behind, where the end of the line (and possibly the beginning?)
can be thought of as a chunk of virtual spaces.
(Getting this to work right on all corner cases is a
a suprisingly difficult problem: I'm willing to compromise
on the left hand boundary, and say "no links allowed
without at least two spaces from the left margin",
but getting the right side to work right could be a problem.)
This is much like $DF_THOUGHTS_LINK_RULE, except that
this version is an attempt at capturing *only* the link itself
to $1, using zero-width patterns to identify the whitespace.
Also, this pattern is intended to more cases.
=cut
our $DF_EMBEDDED_LINK_SINGLE_CAPTURE_RULE = qr{
(?<= [\x20\t][\x20\t] ) # zero-width pos lookbehind for two spaces
# Note: requires "fixed string" (char class okay)
( $df_node_name_quantified_pat ) # doomfile node name, captured to 1
(?= # zero-width positive lookahead for...
[\x20\t][\x20\t] # 2 spaces or...
| [\x20\t] $ # 1 space then eol or...
| $ # eol
)
}smx;
our $DF_EMBEDDED_LINK_RULE = $DF_EMBEDDED_LINK_SINGLE_CAPTURE_RULE;
our $DF_NODE_RULE = qr{ $df_node_name_quantified_pat }x; # Note no '^' pinning # on it's way to Deprecated
# Note in the following $DF_START_RULE
# the first non-doomfiles_node_name_pat character terminates
# the name, greedy matching ensures the whole name will be captured.
# In theory, anything could follow the name (a comment?) and it would be ignored (unused feature).
# our $DF_START_RULE = qr{ ^ # link destinations have labels at start of line
# ($DF_NODE_RULE) # capture name to $1
# }x;
our $DF_START_RULE = $DF_DESTINATION_RULE;
our $DF_END_RULE = qr{ $doomfiles_thoughts_node_separator_pat }x;
#--------
# Web::Chain::IO::Input::Html
# Html Format Web crunching
#
=item * B<$DF_EXTRACT_NEXT_NODE_RULE>, B<$DF_EXTRACT_PREV_NODE_RULE>, B<$DF_EXTRACT_BODY_RULE> -
These rules are used to scrape the "next node" and "previous node" and the main body
of content out of the finished DF html files.
=cut
our $df_extract_next_node_pat = '">\[NEXT\ -\ (' . $df_node_name_quantified_pat . ')\]</A>';
our $DF_EXTRACT_NEXT_NODE_RULE = qr{ $df_extract_next_node_pat }x;
our $df_extract_prev_node_pat = '">\[PREV\ -\ (' . $df_node_name_quantified_pat . ')\]</A>';
our $DF_EXTRACT_PREV_NODE_RULE = qr{ $df_extract_prev_node_pat }x;
our $DF_EXTRACT_BODY_RULE = qr{ </H1> \s* <PRE> \s* $ # Starts with <PRE> block after title (</H1>)
(.*?) # Capture all text up to
^--+ \s* # The line of hyphens before...
(?:
<A \s+ HREF [^[]+ \[NEXT \s+ - # the NEXT link or ...
| </PRE> # the </PRE> link (FIN.html has no NEXT)
)
}msx;
# An example of the field of html that the above rule works on:
#
# <HTML><HEAD>
# <TITLE>The doomfiles - DREAMS.html</TITLE>
# </HEAD><BODY>
# <PRE> <A HREF="MAGIC.html">[PREV - MAGIC]</A> <A HREF="TOP.html">[TOP]</A></PRE>
# <H1>DREAMS</H1>
#
#
# <PRE>
#
# I don't remember looking in
# their direction either.
#
#
# --------
#
# <A HREF="COUCH.html">[NEXT - COUCH]</A>
# </PRE></BODY></HTML>
#--------
# Patterns to extract <TITLE> and <H1> strings,
=item * B<$DF_EXTRACT_TITLE_RULE>, B<$DF_EXTRACT_H1_RULE> -
These are patterns to extract <TITLE> and <H1> strings,
out of finished DF html files.
=cut
our $DF_EXTRACT_TITLE_RULE = qr{<TITLE.*?>(.*?)</TITLE>}i;
our $DF_EXTRACT_H1_RULE = qr{<H1.*?>(.*?)</H1>}i;
#--------
# A splicing technique
#
# Add these patterns too?
# # Change PREV unless at TOP
# $text =~
# s{<A HREF="([^\.]*)\.html">\[PREV - \1\]</A>}
# {<A HREF="$prevnode.html">[PREV - $prevnode]</A>}
# unless ($file eq $top);
# # Change NEXT unless at BOT
# $text =~
# s{<A HREF="([^\.]*)\.html">\[NEXT - \1\]</A>}
# {<A HREF="$nextnode.html">[NEXT - $nextnode]</A>}
# unless ($file eq $bot);
#--------
# More Rawtext processing
#
=item * B<$DF_THOUGHTS_NODE_HEADER_RULE> -
Using this in a qr//msg should extract all the new doomfiles
nodes in a Rawtext (aka 'Thoughts') file. (Currently not
used).
=cut
our $DF_THOUGHTS_NODE_HEADER_RULE =
$doomfiles_thoughts_node_separator_pat .
'(^\s*$)' .
'(' . $df_node_name_quantified_pat . ')';
1;
__END__
=back
=head1 DISCUSSION
=head2 STYLE
(1) Avoid using creating a lot of built-up definitions to be
exported. E.g. something like this isn't a good idea:
our $DF_TOPNODE = $DF_LOC . '/' . $DF_TOPNODE_NAME . '.html';
Because code that such a $DF_TOPNODE can't be tested
very well. It always accesses the live location $DF_LOC.
Better to do this sort of build-up in the code that
uses these definitions.
(2) Use names like "*_pat" for strings that contain regular
expressions, and "*_RULE" for actual regexp objects created
with qr{}. Export all "*_RULE"s automatically.
(3) Actually, you should note that this module automatically
exports all variables with names all in uppercase.
This lessens the chance of collision with other variables
and makes the exports visually resemble constants even though
they're not.
=head2 CONSTANT IRRITATION
Perl constants have little going for them: their only
real advantage is the compiler can optimize them to inlines.
Well, there's also the fact that they're I<constant> and
putting a value into that kind of straight-jacket might help
prevent a shoot-yourself-in-the-foot problem, but unlike
most such things in perl, there is no easy way to escape the
straight-jacket later if you really need to. There is no:
{no strict 'constants' ... }
It's an odd thought, but it isn't really unusual to want to
temporarily change a "constant" e.g. over-ride a
project-level $DEBUG flag, setting it temporarily just for
the current module.
Also on the negative side, constants don't really interpolate,
not even if you try fugly tricks like:
print "My constant: &CONSTANT() \n";
Though, if you want to get even fuglier, you I<can> do this:
print "My constant: @{[ CONSTANT ]} \n";
Why use perl at all if you're going to, uh, constantly
do things like this:
print "My constant: " . CONSTANT . "\n";
So: I've stopped using (and exporting) constants.
=head1 SEE ALSO
L<Project Documentation|Web::Project>
=head1 AUTHOR
Joseph Brenner, E<lt>doom@kzsu.stanford.eduE<gt>
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2004 by Joseph Brenner
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.2 or,
at your option, any later version of Perl 5 you may have available.
=head1 BUGS
None reported... yet.
=cut
Joseph Brenner,
Sat Nov 6 17:04:11 2004