Organizational Research By

Surprising Reserch Topic

is there a way to get rid of accents and convert a whole string to regular lette

is there a way to get rid of accents and convert a whole string to regular lette  using -'java,string,diacritics'

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one?

Input:  orčpžsíáýd

Output: orcpzsiayd

It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

asked Oct 19, 2015 by sandeep bhadauria
0 votes

Related Hot Questions

7 Answers

0 votes

Use java.text.Normalizer to handle this for you.

string = Normalizer.normalize(string, Normalizer.Form.NFD);

This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");

If your text is in unicode, you should use this instead:

string = string.replaceAll("\\p{M}", "");

For unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

Thanks to GarretWilson for the pointer and for the great unicode guide.

answered Oct 19, 2015 by sachin valanju
0 votes

The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        return sb.toString();

Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    return new String(out);

This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).

answered Oct 19, 2015 by patilkiran.101
0 votes

EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop.

While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):

 * Mirror of the unicode table from 00c0 to 017f without diacritics.
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +

 * Returns string without diacritics - 7 bit approximation.
 * @param source string to convert
 * @return corresponding string without diacritics
public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        vysl[i] = one;
    return new String(vysl);

Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.

Enjoy :-)

answered Oct 19, 2015 by kotmus2002
0 votes

As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):

    String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
    // Prints "This is a funky String"
answered Oct 19, 2015 by ukohale
0 votes
System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));

worked for me. The output of the snippet above gives "aee" which is what I wanted, but

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

didn't do any substitution.

answered Oct 19, 2015 by sachin valanju
0 votes

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks

"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."

Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

answered Oct 19, 2015 by patelnikul321
0 votes

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //by Ricardo, modified the character check for accents, ref: a/5697575/689223
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
answered Oct 19, 2015 by rajnipancholi