Organizational Research By

Surprising Reserch Topic

php convert unicode codepoint to utf 8


php convert unicode codepoint to utf 8  using -'php,unicode,utf-8'

I have my data in this format: U+597D or like this U+6211. I want to convert them to UTF-8 (original characters are 好 and 我). How can I do it?
    

asked Sep 29, 2015 by android_master
0 votes
27 views



Related Hot Questions

5 Answers

0 votes
$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $string), ENT_NOQUOTES, 'UTF-8');

is probably the simplest solution.

answered Sep 29, 2015 by yogeshplv
0 votes
function utf8($num)
{
    if($num<=0x7F)       return chr($num);
    if($num<=0x7FF)      return chr(($num>>6)+192).chr(($num&63)+128);
    if($num<=0xFFFF)     return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
    if($num<=0x1FFFFF)   return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
    return '';
}

function uniord($c)
{
    $ord0 = ord($c{0}); if ($ord0>=0   && $ord0<=127) return $ord0;
    $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
    $ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
    $ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
    return false;
}

utf8() and uniord() try to mirror the chr() and ord() functions on php:

echo utf8(0x6211)."\n";
echo uniord(utf8(0x6211))."\n";
echo "U+".dechex(uniord(utf8(0x6211)))."\n";

//In your case:
$wo='U+6211';
$hao='U+597D';
echo utf8(hexdec(str_replace("U+","", $wo)))."\n";
echo utf8(hexdec(str_replace("U+","", $hao)))."\n";

output:

我
25105
U+6211
我
好
answered Sep 29, 2015 by amit.gupta
0 votes
mb_convert_encoding(
    preg_replace("/U\+([0-9A-F]*)/"
        ,"&#x\\1;"
        ,'U+597DU+6211'
    )
    ,"UTF-8"
    ,"HTML-ENTITIES"
);

works fine, too.

answered Sep 29, 2015 by vijaygupta1980
0 votes

With the aid of the following table:

http://en.wikipedia.org/wiki/UTF-8#Description

can't be simpler :)

Simply mask the unicode numbers according to which range they fit in.

answered Sep 29, 2015 by kinnari
0 votes

I just wrote a polyfill for missing multibyte versions of ord and chr with the following in mind:

  • It defines functions mb_ord and mb_chr only if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.

  • It uses the widely used mbstring extension to do the conversion. If the mbstring extension is not loaded, it will use the iconv extension instead.

I also added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions


Code :

if (!function_exists('codepoint_encode')) {

    function codepoint_encode($str) {
        return substr(json_encode($str), 1, -1);
    }

}

if (!function_exists('codepoint_decode')) {

    function codepoint_decode($str) {
        return json_decode(sprintf('"%s"', $str));
    }

}

if (!function_exists('mb_internal_encoding')) {

    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }

}

if (!function_exists('mb_convert_encoding')) {

    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }

}

if (!function_exists('mb_chr')) {

    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }

}

if (!function_exists('mb_ord')) {

    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }

}

if (!function_exists('mb_htmlentities')) {

    function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
        return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
            return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
        }, $string);
    }

}

if (!function_exists('mb_html_entity_decode')) {

    function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
        return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
    }

}

How to use :

echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));

echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('我好', false));
var_dump(mb_html_entity_decode('我好'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('我好'));
var_dump(mb_html_entity_decode('我好'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("我好"));
var_dump(codepoint_decode('\u6211\u597d'));

Output :

Get string from numeric DEC value
string(3) "我"
string(3) "好"

Get string from numeric HEX value
string(3) "我"
string(3) "好"

Get numeric value of character as DEC string
int(25105)
int(22909)

Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"

Encode / decode to DEC based HTML entities
string(16) "我好"
string(6) "我好"

Encode / decode to HEX based HTML entities
string(16) "我好"
string(6) "我好"

Use JSON encoding / decoding
string(12) "\u6211\u597d"
string(6) "我好"

See also

answered Sep 29, 2015 by jekbishnoi

...