Showing posts with label utf8. Show all posts
Showing posts with label utf8. Show all posts

Monday, September 17, 2018

Removing Windows ^M Characters From Linux Vim Editor

If you open a document in vim, and you see it littered with ^M characters, that means that someone saved this document in MicroSoft Windows.

But the problem (^M characters, not MicroSoft) is easily fixed!

Simply typing this with the CARROT and M keys will not be enough...

:%s/^M//g

The ^M can be produced by hitting the control key and V, and then, hitting the control key and M.

Once that command is executed with the correct control-key sequence, you will have document freshly cleaned of ^M characters.

Wednesday, September 12, 2018

Removing Symbols from a UTF-8 String and Retaining the Characters in PHP

You will often have strings that contain symbols, where you only want to extract just the text component of this string.

Using a regular expression (regex) in PHP, you can do this quite easily...

$string = "12345-hello, 汉语!!!";
$string = preg_replace('/[^(\p{L}\p{N}\s)]/u', '', $string);
print($string); // output is: 12345hello 汉语

Each part of the preg_replace here can be explained without much effort.

/regex/u - The two slashes indicate what is matched against. The U indicates UTF-8.

[class] - Within this, the two brackets indicate the one single class of values considered valid.

^(values) - We are looking for things that do not match values.

\p{L} - This value is all UTF-8 characters.

\p{N} - This value is all numbers.

\s - This value is all white spacing.

So, anything that doesn't match a letter, a number, or a space (i.e., /[^(\p{L}\p{N}\s)]/), is replaced with '', or the empty string.

Tuesday, September 11, 2018

Converting Accented Characters to Non-Accented Equivalents in PHP

For some reason or other, you may want to have a string of accented characters converted to its equivalent, such as "hëllo" to "hello".

Doing this in PHP is easy. You will have some input like this...

$input = "hëllo";

Then you can convert and see the output like this...

$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
print($output);

Output here is "hello", and our problem is solved very easily!

Of course, for the most part, you should not need to do this. Accepting the entirety of UTF-8's range in diversity and color is usually a good thing for any code solution.

Friday, August 24, 2018

How to Detect the Character Set Encoding of a MySQL Database

You may want to know if your MySQL DB is encoded in latin1 or UTF-8, or maybe some other charset.

You can see the default character set encoding and collation with these commands...

SELECT @@character_set_database;
SELECT @@collation_database;

But this is just for the current database. There are various other default encodings and collations at different levels, which you will only be able to see with commands like these...

show variables like 'char%';
show variables like 'collation%';

If you want to see the encoding for a particular table, you can find its encoding with...

SHOW CREATE TABLE MyTable;

To see ONLY the encoding of a particular Table's field, you can search the INFORMATION_SCHEMA tables...

SELECT COLLATION_NAME
FROM information_schema.columns
WHERE TABLE_SCHEMA = 'MySchema'
AND TABLE_NAME = 'MyTable'
AND COLUMN_NAME = 'MyField';

Thursday, August 23, 2018

Full UTF-8 Web Application Development - How to Do It Right

With UTF-8 support, you can be guaranteed that any language will work with your code. But there are two fundamental things that you need to make sure happens:

First, convert your database to UTF-8. This is the data part of it.

To convert your database requires changing the default settings and converting the tables themselves. You can change the default settings with...

MyDB@sys mysql> ALTER DATABASE MyDatabase DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

And you can convert your tables with...

MyDB@sys mysql> ALTER TABLE MyTable DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
MyDB@sys mysql> ALTER TABLE MyTable MODIFY MyColumn VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '';

Second, send all web pages with a UTF-8 header. This is the networking part of it.

For example, in PHP, to send the UTF-8 header, you will run this command being sending any output to the browser...

header('Content-type: text/html; charset=utf-8');

Every program, protocol, etc., has its exceptions and variations, so these are just the two basics you need to handle.