PHP Revolution: Removing Symbols from a UTF-8 String and Retaining the Characters in PHP

Wednesday, September 12, 2018

You will often have strings that contain symbols, where you only want to extract just the text component of this string.

Using a regular expression (regex) in PHP, you can do this quite easily...

$string = "12345-hello, 汉语!!!"; $string = preg_replace('/[^(\p{L}\p{N}\s)]/u', '', $string); print($string); // output is: 12345hello 汉语

Each part of the preg_replace here can be explained without much effort.

/regex/u - The two slashes indicate what is matched against. The U indicates UTF-8.

[class] - Within this, the two brackets indicate the one single class of values considered valid.

^(values) - We are looking for things that do not match values.

\p{L} - This value is all UTF-8 characters.

\p{N} - This value is all numbers.

\s - This value is all white spacing.

So, anything that doesn't match a letter, a number, or a space (i.e., /[^(\p{L}\p{N}\s)]/), is replaced with '', or the empty string.

PHP Revolution