2012-09-17

Removing accented characters

Here are two ways to replace accented characters by their equivalents in Java:

A basic solution with probably bad performance:
public static String removeAccents(String s) {
    s = s.replaceAll("[áàâãä]","a");
    s = s.replaceAll("[éèêë]","e");
    s = s.replaceAll("[íìîï]","i");
    s = s.replaceAll("[óòôõö]","o");
    s = s.replaceAll("[úùûü]","u");
    s = s.replaceAll("ç","c");

    s = s.replaceAll("[ÁÀÂÃÄ]","A");
    s = s.replaceAll("[ÉÈÊË]","E");
    s = s.replaceAll("[ÍÌÎÏ]","I");
    s = s.replaceAll("[ÓÒÔÕÖ]","O");
    s = s.replaceAll("[ÚÙÛÜ]","U");
    s = s.replaceAll("Ç","C");

    return s;
}

Another one, with JDK support that does not appear to work properly on the Mac implementation. Test carefully first.

public static String removeAccentsAlt(String s) {
    String temp = java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD);
    return temp.replaceAll("[^\\p{ASCII}]","");
}
credits: http://www.rgagnon.com/javadetails/java-0456.html

For comparing text, there is java.text.Collator:

Collator collator = Collator.getInstance(locale);
        collator.setStrength(Collator.PRIMARY);
        collator.setDecomposition(Collator.FULL_DECOMPOSITION);

No comments: