eBay User feedback clustering in R

Author: Zhao, Kevin

Abstract

Learning eBay user’s feedback is fairly important to improve our site service. However, Catching useful information from tons of user feedbacks is really not an easy task. Some PM and Site Analyst will scratch their head and sample hundreds of user feedbacks and read them one by one to get a general idea about their comments. It is extremely time-consuming and inefficient.

To deal with this problem, machine learning and natural language process methods are used to cluster user feedbacks into different groups and then generate the main topics in each cluster. Furthermore, a service tool/user interface is built to illustrate main topics in each cluster, basic statistics for each experiment and also sampled user feedbacks. By using this tool, site Analyst and PM can get a general and clear idea about what our client are talking about in just 5 seconds. A new feature on this tool is a word cloud section. Users can see which words are mentioned most after launching an experiment, it will help them know better of the experiment effects.

In this article, clustering algorithm and natural language processing method are explained and it also introduces some useful functions of our feedback clustering tool.

Keywords: nlp, k-means clustering, feedback, word cloud

User Story and Request

Some PM and Site analyst often complain that it is very hard to learn our eBay site users’ comments, some users talk about shipping, some others talk about item image problem and someone mentioned bad search results. After launching an experiment, people need to go through lots of user feedbacks to get a general idea about which feature makes user unhappy or which part user complains most. It is really tedious and time consuming.

Methodology

Based on this real problem, our basic idea is to cluster user feedbacks into different groups and find topics within each group, then sample data from each group and exposed to PM/Analyst. The Methodology will be explained in two main sections: NLP process of user feedback and then k-means clustering method for clustering.

1. NLP pocessing of user feedback

A. Dump user feedback from Oracle DB into R

The first step is to get all user feedbacks from Oracle DB and dump into R. We choose R as our machine learning algorithm realization software, because R is commonly used in academia and also in industry, it has lots of useful packages and we can use it directly to help us process data

R package : RJDBC

Description: RJDBC package in R to get R connected to Oracle DB, we can write sql code directly in R and all data can be dumped into R using some simple R code.

R code is provided here and we can apply this code when you want to dump data from Teradata DB/Oracle DB into R.

cp <- c("classes12-10.2.jar")

.jinit(classpath=cp)

drv <- JDBC("oracle.jdbc.driver.OracleDriver")

conn <- dbConnect(drv, "jdbc:oracle:thin:@qudb.vip.arch.ebay.com:1521/QUDB", "name", "password")

sql = "select * from survey_response_detail_v where survey_id=5000001476 and Q2 <> ’ ’ "

res <- dbSendQuery(conn, statement=sql);

data <- fetch(res, n = -1)

B. Clean data and generate feature matrix using NLP

After we got feedback data, we need to clean feedbacks before we use it for clustering like removing stop words, do stemming and also removing stop words. R package: tm and Snowball

Description of packages: tm and Snowball has lots of existing function to clean natural language in R

We clean data in 7 steps:

1. Put all words into lower case

We will put all words into lower cases

2. Remove stop words

We will remove all stop words like ”I’m”, ”who’s”

3. Remove punctuation

Remove all punctuations, we can create another feature column for some special punctuations like ”!”,”?”, because those are also very important to reflect user’s mood.

4. Remove numbers

We don’t think number is useful in user feedback clustering

5. Do stemming

we will only take the stem of the word, for example: ”running”,”runs”,”runned” will all be expressed as ”run”, so when we do clustering, feedbacks contains those words can be grouped together.

6. Eliminating extra white space

This step can make our user feedback more clean

7. Generate data matrix for next clustering step

Finally, we will generate a data matrix, this will be a data matrix in R

R code is provided here for your reference:

##datasheet is a R object which contains all user feedback, we grab Q2 column which has feedback sentences

reuters <- Corpus(VectorSource(datasheet$Q2))

reuters <- tm_map(reuters, tolower)

## Remove Stopwords

reuters <- tm_map(reuters, removeWords, stopwords("english"))

## Remove Punctuations

reuters <- tm_map(reuters, removePunctuation)

## Remove Numbers

reuters <- tm_map(reuters, removeNumbers)

## Stemming

reuters <- tm_map(reuters, stemDocument)

## Eliminating Extra White Spaces

reuters <- tm_map(reuters, stripWhitespace)

head(datasheet$Q2)

dtm <- DocumentTermMatrix(reuters)

2. K-means clustering method to cluster user feedback

A. Find the appropriate number of clusters

有事者，事竟成;破釜沉舟，百二秦关终归楚;苦心人，

相关文章：

你感兴趣的文章：

标签云：