Java Practices->Compare and sort Strings

Compare and sort Strings

Comparing and sorting Strings is a bit tricky, and should be done with some care. This is particularly true when the text is displayed to the end user, or when working with localized text.

There are two fundamentally different ways of comparing strings:

simple Unicode ordering - used by String
localized ordering (the kind expected by an end user) - used by Collator

This causes problems, because:

there are only occasional mismatches between the two styles
it's easy to forget to apply the distinction, when needed

Commonly used String methods such as:

String.equalsIgnoreCase(String)
String.compareTo(String)

can be dangerous to use, depending on the context. The reason is that programmers tend to apply them to tasks they really aren't meant for, simply out of habit.

The fundamental difference is that localized comparison depends on Locale, while String is largely ignorant of Locale. Here is a quote from The Java Programming Language by Arnold, Gosling, and Holmes:

"You should be aware that internationalization and localization issues of full Unicode strings are not addressed with [String] methods. For example, when you're comparing two strings to determine which is 'greater', characters in strings are compared numerically by their Unicode values, not by their localized notion of order."

The only robust way of doing localized comparison or sorting of Strings, in the manner expected by an end user, is to use a Collator, not the methods of the String class.

Example 1 - Unicode Ordering

Here's an example of simple Unicode ordering of Strings. Note the use of String.CASE_INSENSITIVE_ORDER, an implementation of Comparator.

Reminder - the following items are important with any form of comparison or sorting:

the Comparator and Comparable interfaces
the various sort methods of Collections and Arrays

import java.util.*;

/** Sorting Strings in Unicode order. */
public final class SortStringsNoLocale {

  public static void main(String... args){
    List<String> insects = Arrays.asList("Wasp", "ant", "", "Bee");
    log("Original:");
    log(insects);
    log("Sorted:");
    sortList(insects);
    log(insects);
    log("");

    Map<String,String> capitals = new LinkedHashMap<>();
    capitals.put("finland", "Helsinki");
    capitals.put("United States", "Washington");
    capitals.put("Mongolia", "Ulan Bator");
    capitals.put("Canada", "Ottawa");
    log("Original:");
    log(capitals);
    log("Sorted:");
    log(sortMapByKey(capitals));
  }

  private static void sortList(List<String> items){
    Collections.sort(items, String.CASE_INSENSITIVE_ORDER);
  }

  private static void log(Object thing){
    System.out.println(Objects.toString(thing)); 
  }

  private static Map<String, String> sortMapByKey(Map<String, String> items){
    TreeMap<String, String> result = 
      new TreeMap<>(String.CASE_INSENSITIVE_ORDER)
    ;
    result.putAll(items);
    return result;
  }
}

The class outputs the following:

Original:
[Wasp, ant, , Bee]
Sorted:
[, ant, Bee, Wasp]

Original:
{finland=Helsinki, United States=Washington, Mongolia=Ulan Bator, Canada=Ottawa}
Sorted:
{Canada=Ottawa, finland=Helsinki, Mongolia=Ulan Bator, United States=Washington}

Example 2 - Localized Ordering

Here's an example of using a Collator to perform localized sorting and comparison of Strings. Note the importance of Collator 'strength' for fine-tuning the comparison. To ignore case, for example, either PRIMARY or SECONDARY strength can be used.

package hirondelle.jp.util;

import java.text.Collator;
import java.util.*;

/** 
 Use Collator to sort and compare text.
*/
public final class SimpleCollator {

  /** Simple harness to exercise the code.  */
  public static void main (String... aArguments) {
    //This data is based on an example in Java Class Libraries, 
    //by Chan, Lee, and Kramer
    List<String> words = Arrays.asList(
      "Äbc", "äbc", "Àbc", "àbc", "Abc", "abc", "ABC"
    );
    
    log("Different 'Collation Strength' values give different sort results: ");
    log(words + " - Original Data");
    sort(words, Strength.Primary);
    sort(words, Strength.Secondary);
    sort(words, Strength.Tertiary);
    
    log(EMPTY_LINE);
    log("Case kicks in only with Tertiary Collation Strength  : ");
    List<String> wordsForCase = Arrays.asList("cache", "CACHE", "Cache");
    log(wordsForCase + " - Original Data");
    sort(wordsForCase, Strength.Primary);
    sort(wordsForCase, Strength.Secondary);
    sort(wordsForCase, Strength.Tertiary);
    
    log(EMPTY_LINE);
    log("Accents kick in with Secondary Collation Strength.");
    log("Compare with no accents present: ");
    compare("abc", "ABC", Strength.Primary);
    compare("abc", "ABC", Strength.Secondary);
    compare("abc", "ABC", Strength.Tertiary);
    
    log(EMPTY_LINE);
    log("Compare with accents present: ");
    compare("abc", "ÀBC", Strength.Primary);
    compare("abc", "ÀBC", Strength.Secondary);
    compare("abc", "ÀBC", Strength.Tertiary);
  }

  // PRIVATE //
  private static final String EMPTY_LINE = "";
  private static final Locale TEST_LOCALE = Locale.FRANCE;
  
  /** Transform some Collator 'int' consts into an equivalent enum. */
  private enum Strength {
    Primary(Collator.PRIMARY), //base char
    Secondary(Collator.SECONDARY), //base char + accent
    Tertiary(Collator.TERTIARY), // base char + accent + case
    Identical(Collator.IDENTICAL); //base char + accent + case + bits
    
    int getStrength() { return fStrength; }
    
    private int fStrength;
    private Strength(int aStrength){
      fStrength = aStrength;
    }
  }
  
  private static void sort(List<String> aWords, Strength aStrength){
    Collator collator = Collator.getInstance(TEST_LOCALE);
    collator.setStrength(aStrength.getStrength());
    Collections.sort(aWords, collator);
    log(aWords.toString() + " " + aStrength);
  }
  
  private static void compare(String aThis, String aThat, Strength aStrength){
    Collator collator = Collator.getInstance(TEST_LOCALE);
    collator.setStrength(aStrength.getStrength());
    int comparison = collator.compare(aThis, aThat);
    if ( comparison == 0 ) {
      log("Collator sees them as the same : " + aThis + ", " + aThat + " - " + aStrength);
    }
    else {
      log("Collator sees them as DIFFERENT  : " + aThis + ", " + aThat + " - " + aStrength);
    }
  }
  
  private static void log(String aMessage){
    System.out.println(aMessage);
  }
}

This class outputs the following:

Different 'Collation Strength' values give different sort results: 
[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] - Original Data
[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] Primary
[Abc, abc, ABC, Àbc, àbc, Äbc, äbc] Secondary
[abc, Abc, ABC, àbc, Àbc, äbc, Äbc] Tertiary

Case kicks in only with Tertiary Collation Strength  : 
[cache, CACHE, Cache] - Original Data
[cache, CACHE, Cache] Primary
[cache, CACHE, Cache] Secondary
[cache, Cache, CACHE] Tertiary

Accents kick in with Secondary Collation Strength.
Compare with no accents present: 
Collator sees them as the same : abc, ABC - Primary
Collator sees them as the same : abc, ABC - Secondary
Collator sees them as DIFFERENT: abc, ABC - Tertiary

Compare with accents present: 
Collator sees them as the same : abc, ÀBC - Primary
Collator sees them as DIFFERENT: abc, ÀBC - Secondary
Collator sees them as DIFFERENT: abc, ÀBC - Tertiary