String and Character Array Alternative in Java
A String
in Java is a quasi-primitive type made for passing around sequences of Unicode characters such as are found in spoken languages, and the class does a good job at this. But if you want to indicate certain characters, without regard to order—such as indicating the characters allowed in a username, for instance? Most programmers will use a String
for this task as well, while others will resort to an arrays of char
. Both approaches have their drawbacks, and I've come up with a solution superior to both, an approach the usefulness of which even I have been surprised.
Let's say you want to trim characters from the end of a string, and would like to specify which characters should be candidates for trimming. Most programmers would create a method with a signature something like this:
public static trimEnd(String string, String characters)
{
int i=string.length()-1;
while(i>=0)
{
if(characters.indexOf(string.charAt(i))<0)
{
break;
}
}
return string.substring(0, i+1);
}
There are ways to make this method even more efficient, but it's sufficient to point out that the approach of using a String
to represent a set of characters has several drawbacks:
- It's possible someone could pass in duplicate characters in the
characters
parameter. This wouldn't break the algorithm, but it would make it less efficient. - It creates confusing as to the semantics of the method. What do we do when we want to trim an exact sequence of characters (e.g.
"ed"
or"ing"
from a verb) from the end of a string? Naturally we would wind up with the same method signature, requiring that the trim-character method be renamed totrimEndAnyOf(String, String)
or something similar. - The
String
class has a multitude of methods unnecessary for the task of merely transporting characters, making it a heavy solution to a lightweight problem.
A naive alternative might be to switch to an admittedly more semantically appropriate approach:
public static trimEnd(String string, char[] characters);
This second approach alleviates all but the first drawback of the first approach, but it has one even greater drawback that more than negates any benefits: arrays in Java are not immutable! Where do we store the characters we want to trim? Good programming practice calls for defining the characters in some reusable place rather then hard-coding them inline. We would therefore use something like this in some definitions class:
public static final char[] WHITESPACE_CHARS=new char[]{' ', '\t', 'r', 'n'};
While final
prevents the variable itself from being modified, any renegade piece of code could at any time modify the definition of whitespace, using a simple WHITESPACE_CHARS[0]='X';
. This is extremely dangerous; in programming, create code to trust no one, not even the original programmer.
So what do we do? There is no way to make an array immutable in Java, but we can make a thin and smart wrapper class around a character array that guards all access to the array. When we do this, we find that we gain a multitude of other, unexpected benefits beyond String
and char[]
. This is because we are using a class built for the specific purpose, rather than trying to ride on the coattails of classes meant for other applications.
Let's call this new class Characters
. Under the covers it will have a an array of characters, but access to that array will be strictly controlled. We'll make the class final
(analogous to String
and other quasi-primitive classes) to prevent subclasses from subterfuge
public final class Characters
{
private final char[] chars;
private final int minChar;
private final int maxChar;
public Characters(char... characters)
{
…
Now that all designation of the characters passes through the constructor, we can do any preprocessing we want, such as:
- Discard out any duplicate characters.
- Sort the characters in order of Unicode code point.
- Make a note of the character with the least and greatest Unicode code point.
With all this pre-knowledge of the characters we have, a method such as Characters.contains()
can be very efficient. Rather than walking the entire set using String.indexOf(char)
, for example, we can first check to see if the given character falls outside the known bounds. If so, we don't even need to check any characters:
public boolean contains(char character)
{if(character < minChar || character > maxChar)
{
return false;
}
for(final char c : chars)
{
if(c == character)
{
return true;
}
else if(c > character)
{
return false;
}
}
return false;
}
Notice also that, because we sorted the characters in our array in the constructor, if we find a character greater than our given character, we know that there can be no later characters that match our character, preventing the need to check the rest of the array.
With only these simple improvements, the Characters class already provides immense value, in readability, semantics, and even algorithm efficiency:
public static trimEnd(String string, Characters characters)
{
int i=string.length()-1;
while(i>=0)
{
if(!characters.contains(string.charAt(i)))
{
break;
}
}
return string.substring(0, i+1);
}
The method signature will no longer clash with a trimEnd(String, String)
, which as you would expect, will trim the string at the end of the input string, not the individual characters.
We can now safely store our whitespace characters in the global definition without fear of it being modified:
public static final Characters WHITESPACE_CHARACTERS=new Characters(' ', '\t', 'r', 'n');
We can add all sorts of builder methods to Characters
to assist in defining sets of characters. For instance, a Characters.add(char...)
method would produce a new instance of Characters
containing the additional supplied characters. The following example creates one set of characters to represent control characters, and then creates another instance containing all control characters as well as the space character:
public static final Characters CONTROL_CHARACTERS=new Characters('\t', 'r', 'n')
public static final Characters WHITESPACE_CHARACTERS=CONTROL_CHARACTERS.add(' ');
After creating the initial implementation of the Characters
class, I started integrating it into my String
manipulation functions and parsing routines. While I knew it was an implementation fit for its purpose, even I have been surprised at its usefulness, and the extent to which it makes code more elegant and more efficient. If you need to pass around a "set of characters" as opposed to a sequence of letters, try the Characters
class. The latest version of Characters
is freely downloadable via Subversion, and is distributed under the open-source Apache Licence, Version 2.0. You'll probably want to grab the entire Maven-buildable globalmentor-core
library source code while you're there.