Scan Text Files for Control Characters

Posted On Tue, 16 Apr 2013

Filed under Batch
Tags: , , ,

Comments Dropped leave a response

A large part of programming in Batch is taken up with processing text files. By “text file”, I mean a plain text file with Windows line endings. And “plain text” means no nasty control characters such as Control-Z or the infamous Null Character.

For example, the former is used by copy /a and type as the end-of-file marker, while set and echo interpret the latter as the end of input.

So it’s always a good idea to scan any text files of unknown origin for these troublesome characters before doing anything else. Which is why I wrote the ctrlscan.cmd program described below…

Program

The program accepts a list of filenames. Filenames may contain wildcards. Enter ctrlscan /? for basic usage info.

@if (@X)==(@Y) @goto Dummy @end /* Batch begins
@echo off & setlocal enableextensions
if "%~1"=="" call :usage && goto end
(set lf=^

)
set nl=^^^%lf%%lf%^%lf%%lf%
for /f %%h in (^"/?%nl%/h%nl%/he%nl%/hel%nl%/help^") ^
do if /i "%~1"=="%%h" call :usage && goto end

(call;)

set "ctrls=%tmp%\ctrls.tmp"
if not exist "%ctrls%" call :createCtrls

:loop
for %%f in ("%~1") do (set "match=1"
if exist "%%~f\" (>&2 echo("%%~f" is a folder& (call) & goto break
) else if not exist "%%~f" (>&2 echo(file "%%~f" not found& (call)
goto break)
if %%~zf gtr 0 (findstr /mlg:"%ctrls%" "%%~f" >nul ^
&& for /f "delims=:" %%l in ('findstr /nlg:"%ctrls%" "%%~f"') do (
>&2 echo(ctrl char(s^) found on line %%l of file "%%~f"& (call))
) else (>&2 echo(file "%%~f" is empty& (call))
)

:break
if not defined match (>&2 echo("%~1" did not match any files
(call)) else set "match="
shift /1
if "%~1" neq "" goto loop

:end
endlocal & goto :EOF

:createCtrls
:: xp users uncomment next line
:: fsutil file createnew "%ctrls%" 1 >nul
:: xp users comment out next 2 lines
>"%ctrls%" echo(00
certutil -f -decodehex "%ctrls%" "%ctrls%" >nul

>>"%ctrls%" echo(
cscript //nologo //e:jscript "%~dpf0" >>"%ctrls%"
exit /b

:usage
set ^"\n=^^^%lf%%lf%^%lf%%lf%^^"
cls & echo(Scans multiple text files for the presence of control ^
characters.%nl%%\n%
Usage:%nl%%\n%
  %~n0 filename [filename ...]%nl%%\n%
where filename specifies one or more text files.%nl%%\n%
Notes:%\n%
- Doesn't check for Tab (HT^), Line Feed (LF^), or Carriage Return^
 (CR^).%\n%
- Not suitable for Unicode or text files that use CR as line^
 terminator.%\n%
- Wildcards (* and ?^) in filenames are permitted.%\n%
- Multiple filenames may be specified on command line.%\n%
- No line length limit.
exit /b 0

JavaScript begins */

var ctrls = new Array(1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16,
  17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127);

for (var i=0; i<ctrls.length; i++) {
    WScript.Echo(String.fromCharCode(ctrls[i]));
}

Discussion

Two things worth noting at this point are:

  1. The program is a hybrid script. A few lines of JavaScript are used to create a file of control characters used later on in the program.

  2. Queue revealed in this DosTips topic that (call;) sets the dynamic variable errorlevel to 0 and (call) sets it to 1. A great tip and so useful! I’ve made extensive use of it throughout the program, so if you’re wondering what all those empty call statements are about, now you know. 🙂

Anyways, the location for the file of control characters is stored in the ctrls variable and is defined as %tmp%\ctrls.tmp by default, but feel free to change it to suit your own needs. If the file does not exist, a new one will be created.

Btw, Windows XP users should comment and uncomment the lines indicated in the :createCtrls subroutine.

The main for loop expands any wildcards (eg, *.htm?) in the command line parameter (%1) into a list of filenames, and stores them in the %%f loop variable.

If a filename passes all the usual validity checks, findstr performs a literal search (the /l switch) for every string in %ctrls% (that’s what /g:"%ctrls%" does) on the file, and the /m switch ensures only the filename is sent to output if there is a match. If nothing is found, the program moves on to the next file.

If there is a match, the file is scanned a second time by a similar findstr command, only this time it uses /n to number all lines containing any offending characters. The line numbers are captured by a surrounding for /f loop and stored in its %%l variable. Finally, the line number and filename are displayed in an error message.

The main for loop exits when all the expanded filenames have been processed. The next command line parameter is read in (shift /1), overwriting the current value of %1. If the new %1 is not empty, the program goes back to the :loop label and starts all over again. Otherwise, the program terminates.

The program exits with goto :EOF rather than exit /b 0 in order to preserve the value of errorlevel.

Please note that ctrlscan.cmd:

  • does not search for the control characters Tab [HT), Line Feed (LF), or Carriage Return (CR).

  • will work with Unix text files as well as files with Windows line endings. But files from systems that use CR as line terminator (such as MacOS 9 or earlier) will be treated as one long line.

  • is not suitable for Unicode text files.

  • has no limit on line length.

Well, that’s about it for now. The next version of this program will not only find control characters, but remove them as well! A program like that would be in real danger of being useful. 😉

Watch this space for updates. And in the meantime, feel free to leave a comment with any thoughts or suggestions you might have on the subject.

Related Links

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s