小程序国际化_在国际化您的应用程序时忘记的一件事

小程序国际化

The hidden bugs waiting to be found by your international users

您的国际用户正在等待发现的隐藏错误

While internationalizing our applications, we focus on the things we can see: text, tool-tips, error messages, and the like. But, hidden in our code there are places requiring internationalization that tend to be missed until found by our international users and reported as a bug.

在对我们的应用程序进行国际化的同时，我们专注于可以看到的内容：文本，工具提示，错误消息等。但是，在我们的代码中隐藏着一些需要国际化的地方，在我们的国际用户发现并报告为错误之前，这些地方往往会被遗漏。

Here’s a big one: regular expressions. You likely use these handy, flexible, programming features to parse text entered by users. If your regular expressions are not internationalized, more specifically, if they are not written to handle Unicode characters, they will fail in subtle ways.

这是一个很大的：正则表达式。您可能会使用这些方便，灵活的编程功能来解析用户输入的文本。如果您的正则表达式没有被国际化，更具体地说，如果它们不被编写为处理Unicode字符，它们将以微妙的方式失败。

Here’s an example: imagine a commenting system in your application that allows users to type at-mentions of other users or user groups. People at-mentioned are notified that the comment needs their attention. Your system may have the requirement that the at-mention format is something like:

^ h ERE是一个例子：想象你的应用程序中的评论系统，允许用户输入其他用户或用户组的AT-提及。通知所提及的人该评论需要引起他们的注意。您的系统可能要求注意格式为：

Writing a regular expression to find and parse the usernames out of these strings is the most direct way for handling this. In Java, JavaScript, and other languages, the regular expression might look like this:

编写正则表达式以从这些字符串中查找和解析用户名是处理此问题的最直接方法。在Java，JavaScript和其他语言中，正则表达式可能如下所示：

This expression specifies that we’re looking for an ‘@’ followed by a letter or number, followed by one or more letters, numbers, dashes, underscores, or dots, and ending with a letter or number. The parentheses tell the expression to capture this string and return it to us.

此表达式指定我们要查找的是“ @”，后跟一个字母或数字，然后是一个或多个字母，数字，破折号，下划线或点，并以字母或数字结尾。括号告诉表达式捕获该字符串并将其返回给我们。

We can test it using the regex101 tester:

我们可以使用regex101测试仪进行测试：

So that regex works great! But now let’s test it against some comment text containing Unicode characters:

因此，正则表达式效果很好！但是，现在让我们针对一些包含Unicode字符的注释文本进行测试：

“This comment mentions @Adriàn, @François, @Noël, @David, and @ひなた”
“此评论提到@Adriàn，@ François，@Noël，@ David和@ひなた”

Unicode characters are not matched, so we either get incomplete usernames or no username at all.

Unicode字符不匹配，因此我们得到的用户名不完整或根本没有用户名。

The solution:

吨他的解决方案：

Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead.”
Unicode是一种字符集，旨在定义所有人类语言(生与死)中的所有字符和字形。”

http://www.regular-expressions.info/unicode.html
http://www.regular-expressions.info/unicode.html

It would seem incredibly difficult to write a regular expression encompassing the Unicode mission statement quoted above, but it’s fairly straight forward. To match a single letter grapheme (a complete letter as rendered on screen), we use the \p{L} notation.

编写包含上面引用的Unicode Mission语句的正则表达式似乎非常困难，但这很简单。为了匹配单个字母字素(屏幕上呈现的完整字母)，我们使用\ p {L}表示法。

Updating our regex to use this Unicode friendly notation for letters, we get:

更新我们的正则表达式以对字母使用此Unicode友好符号，我们得到：

Let’s try it out in the regex101 tester:

让我们在regex101测试仪中尝试一下：

Close! But @Adriàn is not getting fully parsed. In fact, the string returned from the capture group is ‘Adria’, so we’ve got an incomplete username and lost the grave accent over the a. What’s going on?

关！但是@Adriàn尚未完全解析。实际上，从捕获组返回的字符串是“ Adria”，因此我们的用户名不完整，并且丢失了a字母的重音。这是怎么回事？

To understand this, let’s take a look at how single characters rendered on a screen or page are represented in Unicode. The à is actually two Unicode characters, U+0061 representing the a and U+0300 representing the grave accent above the a. The grave accent is a combining mark. A character can be followed by any number of combining marks which will be assembled together when rendered.

为了理解这一点，让我们看一下屏幕或页面上呈现的单个字符如何以Unicode表示。 à实际上是两个 Unicode字符，U + 0061代表a ，U + 0300代表a上方的重音。重音是一个结合的标志。字符后可以跟任意数量的组合标记，这些标记在渲染时将组装在一起。

Fortunately, our regex can look for combining marks as well with the \p{M} specifier. This matches on a Unicode character that is a combining mark. Our usernames as defined will never start with a combining mark, but we do need to check for them in the middle and at the end of the strings. The new regex looks like this:

幸运的是，我们的正则表达式也可以使用\ p {M}说明符来查找标记组合。这与作为组合标记的Unicode字符匹配。我们定义的用户名永远不会以组合标记开头，但是我们确实需要在字符串的中间和结尾检查它们。新的正则表达式如下所示：

Testing it:

测试它：

Success!

成功！

One detail worth knowing is that some combined characters like the à can also be specified in Unicode with a single character (U+00E0 in this case). But with our regex, it doesn’t matter. We’ll match the character if it has a single representation, with the /p{L} specifier, or if it is a combination of two characters, with the /p{M} specifier.

值得一提的一个细节是，也可以使用单个字符(在本例中为U + 00E0)在Unicode中指定诸如à之类的一些组合字符。但是使用我们的正则表达式，没关系。如果字符具有单个表示，则将其与/ p {L}说明符相匹配，或者，如果它是两个字符的组合，则将与/ p {M}说明符相匹配。

As long as we’re internationalizing, let’s deal with the digits as well. Unicode regex handling gives us a safe way to match any representation of the digits 0 through 9 using the \p{Nd} specifier. Using it, we get our final internationalized regular expression for matching and returning usernames in the body of a comment’s text:

只要我们正在国际化，我们也要处理数字。 Unicode正则表达式处理为我们提供了一种安全的方式，可以使用\ p {Nd}说明符来匹配数字0到9的任何表示形式。使用它，我们得到了最终的国际化正则表达式，用于匹配和返回注释文本正文中的用户名：

The exact details for handling Unicode in regular expressions can vary from language to language, so be sure to check out the differences for your code. The site regular-expressions.info is an excellent source for regular expression information in all programming languages and is what lead me to the solution I described in this article.

在不同的语言中，使用正则表达式处理Unicode的确切细节可能有所不同，因此请务必检查出代码的差异。该网站regular-expressions.info是所有编程语言中正则表达式信息的绝佳来源，也是使我引向本文所述解决方案的原因。