Michael Rash, Security Researcher

Website Editing from the Perl Command Line

The website is completely maintained by a set of custom perl scripts for automatic editing of HTML pages, verifcation of page structure with XML::Simple, building release tarballs and RPM files of the software projects, and deploying the site to both staging and production web servers. Because a key requirement for this site is to run as few pieces of code as possible for security reasons, CGI scripts are kept to a minimum. All of the blog postings and pages dedicated to the four software projects available here (psad, fwknop, fwsnort, and gpgdir) are pure HTML that is edited and updated by a set of perl scripts. Note that there are excellent content management solutions out there such as Plone, Drupal, and Joomla for building complete database-driven websites for online communities and the like, but I do not yet need such a comprehensive solution.

Sometimes I run into a situation where I need to apply a global change to all of the HTML pages, and at the same time it is not necessary to update my perl scripts to account for the change because it affects content that only changes once. In this situation a single perl command with a judiciously chosen regular expression can do the trick.

For example, I include the meta tag revisit-after at the top of many of the HTML pages. Initially, I set the revisit time interval to 31 days like so: <meta name="revisit-after" content="31 days" /> However, because the site usually changes significantly more rapidly than this, I decided to shorten the time interval to two days. With a single perl command combined with the find and xargs command, every .html page in a directory structure (including recursively in all subdirectories) can be updated: $ find /path/to/webroot -name '*.html' | xargs perl -p -i -e 's|revisit-after"\s+content="31\s+days"|revisit-after" content="2 days"|' All of the pages are stored within a Subversion repository, so checking that the above command did the right thing is easy: $ svn diff 14.html
Index: 14.html
--- 14.html (revision 590)
+++ 14.html (working copy)
@@ -4,7 +4,7 @@
 <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
 <meta name="robots" content="all" />
-<meta name="revisit-after" content="31 days" />
+<meta name="revisit-after" content="2 days" />

 <title>Website Editing from the Perl Command Line<title>
One could argue that this information should be kept in a global header that is applied programmatically to all HTML pages. This way the "revisit-after" attribute would only have to be changed in one place for it to be applied to all HTML pages. This is a good argument for this particular class of change, but let us consider a more complex example. At one point I decided that I would like all links within the body of a blog posts (such as the one you are reading now) to be highlighted in bold characters. I had been doing this for some links by manually including "<b>" and "</b>" (bold) tags within the text portion of the standard "<a href=...>" tags. But, after deciding to make all links in bold characters, a simple change to the stylesheet used by could accomplish this without using the bold tags. However, all of the existing blog posts still contained links with the bold tags embedded, so I needed a quick way to remove them. Again, perl to the rescue: find /path/to/webroot/ -name '*.html' |xargs perl -p -i -e 'undef $/; s|<a\s+href=(.*?)> \s*<b>\s*(.*?)\s*</b>\s*</a>|<a href=$1>$2</a>|gs' This time things are a bit more complicated. Note the usage of "undef $/" so that each .html file is slurped into a single string so that the regex can match links that span multiple lines. Also, note the usage of the "?" which turns the .* into a non-greedy match so that only the minimal text that qualifies as valid link and descriptive text is matched for each link. A quick check against Subversion is in order to make sure the command worked properly: $ svn diff 2003/10/01.html Index: 2003/10/01.html =================================================================== --- 2003/10/01.html (revision 589) +++ 2003/10/01.html (working copy) @@ -42,7 +42,7 @@  <a class="link" href="/gpgdir/download/">download<a>  <br/><br/>  <div class="createdate">Posted by Michael Rash on 2003/10/01 -| <a href="/blog/categories/software-releases.html"><b> Software Releases</b></a> +| <a href="/blog/categories/software-releases.html">Software Releases</a>  <div>  <span class="article_separator">nbsp;<span>  <div>