Wednesday, September 12, 2018

Removing Symbols from a UTF-8 String and Retaining the Characters in PHP

You will often have strings that contain symbols, where you only want to extract just the text component of this string.

Using a regular expression (regex) in PHP, you can do this quite easily...

$string = "12345-hello, 汉语!!!";
$string = preg_replace('/[^(\p{L}\p{N}\s)]/u', '', $string);
print($string); // output is: 12345hello 汉语

Each part of the preg_replace here can be explained without much effort.

/regex/u - The two slashes indicate what is matched against. The U indicates UTF-8.

[class] - Within this, the two brackets indicate the one single class of values considered valid.

^(values) - We are looking for things that do not match values.

\p{L} - This value is all UTF-8 characters.

\p{N} - This value is all numbers.

\s - This value is all white spacing.

So, anything that doesn't match a letter, a number, or a space (i.e., /[^(\p{L}\p{N}\s)]/), is replaced with '', or the empty string.

No comments:

Post a Comment