Supplementary Materials: Deconstructing domain names to reveal latent topics

This webpage contains supplementary material for the paper "Deconstructing domain names to reveal latent topics" by Cheryl J. Flynn, Kenneth E. Shirley, and Wei Wang.

In the paper we fit three types of topic models with various numbers of topics to two collections of domain names. There was not enough room in the paper, of course, to summarize all the topics in these model fits, so here we share information about the model that was fit to one of the two data sets, the DMOZ data. This corpus of documents (where each segmented domain name is treated as a single document) contains D = 443,266 documents, N = 1,047,109 total tokens, and a vocabulary of W = 17,636 unique terms. Here we share the fit of the Biterm Topic Model (BTM) to this data where the number of topics was set to K = 250.

We share three supplementary data sets:

data.csv: This comma-separated file contains 443,266 rows and four columns (plus a header row with column names). There is one row for each document in the DMOZ sample, and the four columns contain each document's raw domain name, tokenized domain name, most probable topic (an integer from 1 to 250), and the corresponding probability of the most probable topic.

                             Raw                 Tokenized Main BTM Topic Probability
1          chasefurniture.com.au           chase,furniture             94       0.954
2               flamingolake.com             flamingo,lake            208       0.706
3                    tvguide.com                  tv,guide            135       0.930
4  all-about-cyprus-yachting.com all,about,cyprus,yachting             29       0.402
5                  newadvent.org                new,advent             26       0.445
6      winecountrysequential.com              wine,country            115       0.240
7      distinctivedirections.com    distinctive,directions            127       0.902
8                 murphyship.com               murphy,ship             74       0.968
9         laperlaranchresort.com           la,ranch,resort            159       0.413
10           mikethurston.org.uk             mike,thurston            234       0.977

vocab.txt: This file contains the vocabulary of the DMOZ topic model, with one term per line, and 17,636 lines.
```
chase
furniture
flamingo
lake
tv
guide
all
about
cyprus
yachting
```
topics.csv: This comma-separated file contains 250 * 30 = 7500 rows and five columns (and a header row). The rows contain the top-30 most relevant terms for each of the 250 topics, where the relevance of a term to a topic was calculated as in Sievert and Shirley (2014), where:
relevance(term w, topic k) = lambda *log(p(w | k)) + (1 - lambda) * log(p(w | k)/p(w)),
and we set lambda = 0.6. This is designed to rank terms within topics as a roughly equally weighted average of the terms frequency and the term's exclusivity to that particular topic. The columns contain, for each term, the BTM Topic ID, the relevance rank within this topic, the term itself, the relevance of the term, and the probability of the term within the topic.
```
BTM.topic Relevance.rank  Term Probability Relevance
         1              1   phi       0.071     0.914
         1              2 sigma       0.072     0.858
         1              3 alpha       0.066     0.518
         1              4 gamma       0.030     0.387
         1              5  beta       0.035     0.327
         1              6 kappa       0.025     0.274
         1              7 delta       0.048     0.259
         1              8   chi       0.040     0.203
         1              9 theta       0.017     0.027
         1             10   psi       0.022    -0.091
```