xuzhezhaozhao的专栏

想用lex&yacc写一个json的解析, 而json的string类型是包含unicode的, 词法解析工具Lex是不直接支持unicode字符匹配的, 那如果要想匹配unicode字符应该怎么办呢, 在stack overflow上看到一个很好的解答:

基本思想就是unicode字符写一个匹配模式,

ASC[\x00-\x7f]ASCN [\x00-\t\v-\x7f]U[\x80-\xbf]U2[\xc2-\xdf]U3[\xe0-\xef]U4[\xf0-\xf4]UANY |UANYN |UONLY |

上面匹配模式的意义如下:

UANY: 匹配unicode和ascii字符 UANYN: 与UANY类似, 只是不匹配换行符 UONLY: 只匹配unicode字符, 不匹配ascii字符

DISCLAIMER: Note that the scanner’s rules use a function called utf8_dup_from to convert the yytext to wide character strings containing Unicode codepoints. That function is robust; it detects problems like overlong sequences and invalid bytes and properly handles them. I.e. this program is not relying on these lex rules to do the validation and conversion, just to do the basic lexical recognition. These rules will recognize an overlong form (like an ASCII code encoded using several bytes) as valid syntax, but the conversion function will treat them properly. In any case, I don’t expect UTF-8 related security issues in the program source code, since you have to trust source code to be running it anyway (but data handled by the program may not be trusted!) If you’re writing a scanner for untrusted UTF-8 data, take care!

,三人一条心,黄土变成金。

xuzhezhaozhao的专栏

相关文章:

你感兴趣的文章:

标签云: