This webpage contains supplementary material for the paper "Deconstructing domain names to reveal latent topics" by Cheryl J. Flynn, Kenneth E. Shirley, and Wei Wang.
In the paper we fit three types of topic models with various numbers of topics to two collections of domain names. There was not enough room in the paper, of course, to summarize all the topics in these model fits, so here we share information about the model that was fit to one of the two data sets, the DMOZ data. This corpus of documents (where each segmented domain name is treated as a single document) contains D = 443,266 documents, N = 1,047,109 total tokens, and a vocabulary of W = 17,636 unique terms. Here we share the fit of the Biterm Topic Model (BTM) to this data where the number of topics was set to K = 250.
We share three supplementary data sets:
Raw Tokenized Main BTM Topic Probability
1 chasefurniture.com.au chase,furniture 94 0.954
2 flamingolake.com flamingo,lake 208 0.706
3 tvguide.com tv,guide 135 0.930
4 all-about-cyprus-yachting.com all,about,cyprus,yachting 29 0.402
5 newadvent.org new,advent 26 0.445
6 winecountrysequential.com wine,country 115 0.240
7 distinctivedirections.com distinctive,directions 127 0.902
8 murphyship.com murphy,ship 74 0.968
9 laperlaranchresort.com la,ranch,resort 159 0.413
10 mikethurston.org.uk mike,thurston 234 0.977
chase furniture flamingo lake tv guide all about cyprus yachting
relevance(term w, topic k) = lambda *log(p(w | k)) + (1 - lambda) * log(p(w | k)/p(w)),
and we set lambda = 0.6. This is designed to rank terms within topics as a roughly equally weighted average of the terms frequency and the term's exclusivity to that particular topic. The columns contain, for each term, the BTM Topic ID, the relevance rank within this topic, the term itself, the relevance of the term, and the probability of the term within the topic.
BTM.topic Relevance.rank Term Probability Relevance
1 1 phi 0.071 0.914
1 2 sigma 0.072 0.858
1 3 alpha 0.066 0.518
1 4 gamma 0.030 0.387
1 5 beta 0.035 0.327
1 6 kappa 0.025 0.274
1 7 delta 0.048 0.259
1 8 chi 0.040 0.203
1 9 theta 0.017 0.027
1 10 psi 0.022 -0.091