Filter and convert characters
Applying Erlang Binary syntax to get fast character manipulation.
- Author: Lloyd R. Prentice
- Co-Author: Andreas Stenius
Why
Erlang bit syntax is extraordinarily powerful and well worth learning.
Review documenation here: http://www.erlang.org/doc/programming_examples/bit_syntax.html
Here’s an experiment. Follow along in your handy Erlang shell:
1> A = "The cat on the mat".
"The cat on the mat"
2> B = <<"The cat on the mat">>.
<<"The cat on the mat">>
3> io:format(A,[]).
The cat on the matok
4> io:format(B,[]).
The cat on the matok
So, big deal. What’s the difference?
Memory consumption, that’s what. A is represented internally as a list. Each character is a membory cell with a character and a pointer to the next cell. The list format is a memory hog. List represention of text strings is both bane and blessing for Erlang programmers. B, on the other hand, is mapped in memory as a contiguous array of characters, thus consuming much less memory.
For details on working with Erlang Binaries, look here: http://www.erlang.org/doc/man/binary.html
When you want to filter and convert characters in an Erlang Binary, however, things get dicey. It’s all there in the Erlang documentation, but examples are few and far between.
The purpose of this particular recipe is to prepare text for conversion from a *.txt file to *.tex, a file marked up for input to the truly wonderful open source typesetting program Latex.
In our case, we don’t have control of what gets typed into the source text file. So we need to both filter for unwanted characters and sequences as well as convert characters and sequences into LateX markeup.
What does this have to do with Zotonic?
You’ll just have to wait and see.
Assumptions
All you’ll need here is a text editor to copy and paste an Erlang module and an Erlang shell.
A passing understanding of Erlang list comprehensions will also be useful. Find more here: http://www.erlang.org/doc/programming_examples/list_comprehensions.html
How
Study latexFilter/1 below:
-module(filterz).
-export([latexFilter/1]).
%% Binary filters
%% with many thanks to Andreas Stenius
latexFilter(B) ->
%% Delete control characters and extended ASCII
B1 = << <<C>> || <<C>> <= B, (C == 10) or (C == 13) or (C >= 31),
(C =< 127) >>,
%% Filter binary for conversion to *.tex format
%% NOTE: Partially tested, but needs more
F = [{"\r","\n"}, % Convert returns to new lines
{"\n *","\n"}, % Delete spaces following newline
{"^\\s*", ""}, % Delete all whitespace
characters at beginning of manuscript
{"^\n*", ""}, % Delete new lines at
beginning of manuscript
{"\n+$","\n"}, % Delete excess new lines
at end of manuscript
{" +"," "}, % Delete successive spaces
{"\n{3,}","\n\\\\bigskip\n"}, % Convert three or more
newlines to Latex bigskip
{"[&$#%~^{}]", "\\\\&"}, % Escape reserved Latex character
{"<i>", "\\\\emph{"}, % Convert HTML tag to Latex tag
{"</i>", "}"}, % Convert HTML tag to Latex tag
{"\"(.*\")","``\\1"}, % Convert opening double quotes (") to Latex conventions (``)
{" '"," `\\1"}], % Convert opening single quotes (') to Latex conventions (`)
B2 = lists:foldl(fun({Pattern, Replacement}, Subject) ->
re:replace(Subject, Pattern, Replacement,
[global, {return, binary}]) end,
B1, F),
{ok, B2}.
The first thing that happens in latexFiler/1 is an Erlang binary comprehension to delete pesky control and extended ASCII characters:
B1 = << <<C>> || <<C>> <= B, (C == 10) or (C == 13) or (C >= 31), (C =< 127) >>,
Note that the syntax is very similar to list comprehensions. As you can see, a single binary comprehension can chug out a lot of work. Define a few binaries in your Erlang shell and play around with the conditionals. You’ll catch on pretty quick.
For our purposes, we need to also search and replace a bunch of
substrings. For this we’ve enlisted lists:foldl/3
, a powerful list
function that just happens to work with binaries. See:
http://www.erlang.org/doc/man/lists.html#append-1
Expect a headache when you study this function. It’s not that easy to understand.
The fold
function exists in two versions, foldl
and foldr
(for left
and right, described shortly). It takes a list, and calls a function
for each item in the list, along with a accumulator, or state. The
function can operate on the item and state, producing a new state for
the next round. The fold function also takes an initial state to use
for the first item. The left and right mentioned previously determines
in which order the list is traversed. So foldl moves through the list
from left to right, i.e. takes the head off of the list for each
iteration moving towards the tail; whereas the foldr starts with the
tail and moves towards the head, i.e. from right to left. Regular
expressions can also get hairy, but they’re invaluable if you’re
working with text strings. For more information: