双语词表征(bilingual word representation)

什么是双语词表征？

关于词表征(word representation) 的解释，可以参考博客《Deep Learning in NLP （一）词向量和语言模型》和ACL2010年的论文Word representations: A simple and general method for semi-supervised learning。ACL论文给出的解释是：

A word representation is a mathematical object associated with each word, often a vector. Each dimension’s value corresponds to a feature and might even have a semantic or grammatical interpretation, so we call it a word feature.

什么意思呢？简单说就一般我们对词语进行数值化，赋一个向量值。以求通过向量值的操作捕捉语言的特性。想简单些就是希望语义上相近的词语在向量表示时候也能靠近点。至于如何向量化，向量的每一个维度代表什么含义，那是另一个议题，不在此进行讨论了。

一般问题关注于单语，在进行word representation建模时候是单语进行。在面对跨语言问题时候，我们需要进行进行bilingual word representation。问题来了，之前的monolingual word representation是在monolingual corpus训练的，跨语言的时候，语言间各自的word representation如何转换呢？

1. 单语各自训练，直接转换

代表工作Tomas Mikolov的Exploiting similarities among languages for machine translation

Mikolov的工作简单直观，但容易引起其他学者的质疑。Mikolov实验经验性地表明通过线性转换，源语言的word representation能够很好地变为目标语言的word representation。啥意思呢？就是说我希望训练好的一个线性模型，训练时候，输入一个中文的“猫”的word representation，尽量能够出现一个和英文“cat”很像很像的word representation；在测试时候，能偶达到你输入一个“狗”的word representation，得到一个类似英文“dog”的word representation。输猫得猫，输狗得狗。

这个工作需要字典，在训练模型时候使用。

2. 单语各自训练，然后分别转换到第三方语义空间

代表工作Manaal Faruqui的Improving vector space word representations using multilingual correlation

Faruqui的做法不同，他选择了将两个word representation转换到第三方空间，并且希望在三方空间，转换后的“猫”和转换后的“cat”靠在一起，转换后的“狗”和转换后“dog”靠在一起。学习这个模型的过程，就是在最大化我们平行语料（字典）中的互译单词间转换后的相似度，也就是学习过程中尽量让“猫”和“cat”在第三方空间靠近。

*****************

上述两个工作都是将学习单语word representation和双语word representation的步骤分开进行。先单语学习，然后想办法让学习到的单语word representation之间产生关系。很自然地，能否一开始就直接进行双语学习呢？也就是学习时候一边学词语word representation，同时把双语之间的关系（其实就是互译的词语要尽量挨着，，不互译的词语尽量别挨着）也给学习下。下面两个工作就是从后面这个出发点进行的。

*****************

3. 双语一起训练

代表工作Stephan Gouws的BILBOWA: Fast Bilingual DistributedRepresentations withoutWord Alignments

Sarath Chandar A P的An Autoencoder Approach to Learning BilingualWord Representations

你说只有有缘人才可以取下，我看着你手中的戒指，想做你的有缘人，

相关文章：

你感兴趣的文章：

标签云：