Search RCS and CVS ",v" files with rcsgrep
A few years ago I was doing research comparing how large software
distributions handled shared object libraries, and studied Debian,
FreeBSD, and Ubuntu. Extracting data about Debian packages was easy
thanks to Peter Palfrader’s
snapshot.debian.org service, which
provides a machine-usable interface to Debian’s package
history. FreeBSD’s data is equally accessible, albeit in a less
pleasant format: their ports tree was stored in
CVS until July 2012. One
could easily rsync a copy of the ports tree’s CVS repository to a
local machine to analyze the data. This left you with a local tree
full of ,v
files, each corresponding to the history of a given file
with at that location. I needed to extract all kinds of data from a
tree full of these files, such as what revisions contained lines
matching a regex, when these revisions were checked in, any tags
associated with it, etc. To make things easier, it also helped to know
the line numbers of the matching lines. Hence the birth of rcsgrep
.
rcsgrep
is a Python script that makes use of Paul McGuire’s fabulous
pyparsing library. It allows you
to search a RCS
file (the ,v
file format used by RCS and CVS to store revision
history) using a Python regex, and the output format is customizable
to allow printing only certain kinds of information, such as the
revision number, the line number, the matching line, the line’s
author, the date it appeared, any tags associated with the line, and
(useful when running over a large number of files) the file name. To
make machine parsing (using AWK of course) easier, you can also
specify the column separator.
For example, I entered the lines “The quick brown”, “fox jumped over”,
“the lazy dog. Woof!” into the file abc, checking in the changes after
each line. The invocation ./rcsgrep -s ' ' -f rlLda '.*' abc,v
, with
spaces for column separation, and format options r
is for revision,
l
for line number, L
for line contents, d
for date, and a
for
author, outputs:
1.3 1 The quick brown 2013.02.20.14.24.09 ryan
1.3 2 jumped over the 2013.02.20.14.24.09 ryan
1.3 3 lazy dog. Woof! 2013.02.20.14.24.09 ryan
1.2 1 The quick brown 2013.02.20.14.23.48 ryan
1.2 2 jumped over the 2013.02.20.14.23.48 ryan
1.1 1 The quick brown 2013.02.20.14.23.25 ryan
I’m particularly proud about my grep()
function in rcsfile.py
,
which goes through each revision, tracking additions and deletions,
but only keeping the lines matching the regex in memory. In any case,
rcsgrep
is licensed under the ISC license and can be found on
github.
Addendum: I learned after the fact that O’Reilly’s
“UNIX Power Tools”
offers something similar by the same name, except that it is runs
several processes, such as co
, grep
and sed
, as opposed to a
single Python script.
Comments: To comment on this post, send me an email following the template below. Your email address will not be posted, unless you choose to include it in the link: field. If your web browser is configured to handle mailto: links, click comment to load the template into your mail client.
To: Ryan Kavanagh <rak@rak.ac> Subject: [blog-comment] /blog/2013-02-20-grep-rcs-cvs-files-with-rcsgrep/ post_id: /blog/2013-02-20-grep-rcs-cvs-files-with-rcsgrep/ author: [How should you be identified? Usually your name or "Anonymous"] link: [optional link to your website] Your comments here. Markdown syntax accepted.
3 Comments
mirabilos
With my cvs-in-Debian maintainer hat on, do you wish for this script to be included with the cvs package?Ryan Kavanagh
That would be great. Why don’t I give my script a final look through and let you know once I’m done?HipJiveGuy
so are you running this on a local copy of a repo, or at directly on the CVS ROOT ? Is there anything destructive in here? running it at the cvs root would be sweet, as I could watch for changes without having to manually do updates or compares all the time…