topmost

Package Contents

Preprocess

BasicDataset

RawDataset

CrosslingualDataset

DynamicDataset

BasicTrainer

BERTopicTrainer

FASTopicTrainer

LDAGensimTrainer

LDASklearnTrainer

NMFGensimTrainer

NMFSklearnTrainer

CrosslingualTrainer

DynamicTrainer

DTMTrainer

HierarchicalTrainer

HDPGensimTrainer

ProdLDA

Autoencoding Variational Inference For Topic Models. ICLR 2017

CombinedTM

DecTM

Discovering Topics in Long-tailed Corpora with Causal Intervention. ACL 2021 findings.

ETM

Topic Modeling in Embedding Spaces. TACL 2020

NSTM

Neural Topic Model via Optimal Transport. ICLR 2021

TSCTM

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. EMNLP 2022

ECRTM

Effective Neural Topic Modeling with Embedding Clustering Regularization. ICML 2023

NMTM

Learning Multilingual Topics with Neural Variational Inference. NLPCC 2020.

InfoCTM

InfoCTM: A Mutual Information Maximization Perspective of Cross-lingual Topic Modeling. AAAI 2023

DETM

The Dynamic Embedded Topic Model. 2019

CFDTM

Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. ACL 2024 Findings

SawETM

Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. ICML 2021.

HyperMiner

HyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. NeurIPS 2022.

TraCo

On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling. AAAI 2024

download_dataset(dataset_name[, cache_path])

class Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True)
Parameters:
  • test_sample_size – Size of the test set.

  • test_p – Proportion of the test set. This helps sample the train set based on the size of the test set.

  • stopwords – List of stopwords to exclude.

  • min-doc-count – Exclude words that occur in less than this number of documents.

  • max_doc_freq – Exclude words that occur in more than this proportion of documents.

  • keep-num – Keep tokens made of only numbers.

  • keep-alphanum – Keep tokens made of a mixture of letters and numbers.

  • strip_html – Strip HTML tags.

  • no-lower – Do not lowercase text

  • min_length – Minimum token length.

  • min_term – Minimum term number

  • vocab-size – Size of the vocabulary (by most common in the union of train and test sets, following above exclusions)

  • seed – Random integer seed (only relevant for choosing test set)

test_sample_size = None
min_doc_count = 0
max_doc_freq = 1.0
min_term = 0
test_p = 0.2
vocab_size = None
seed = 42
parse(texts, vocab)
preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False)
convert_labels(train_labels, test_labels)
preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False)
save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)
class BasicDataset(dataset_dir, batch_size=200, read_labels=False, as_tensor=True, contextual_embed=False, doc_embed_model='all-MiniLM-L6-v2', device='cpu')
vocab_size = 0
load_data(path, read_labels)
class RawDataset(docs, preprocess=None, batch_size=200, device='cpu', as_tensor=True, contextual_embed=False, pretrained_WE=False, doc_embed_model='all-MiniLM-L6-v2', embed_model_device=None, verbose=False)
train_data
train_texts
vocab
vocab_size
class CrosslingualDataset(dataset_dir, lang1, lang2, dict_path, device='cpu', batch_size=200, as_tensor=True)
batch_size = 200
train_size_en = 0
train_size_cn = 0
vocab_size_en = 0
vocab_size_cn = 0
pretrained_WE_en
pretrained_WE_cn
Map_en2cn
Map_cn2en
move_to_device(bow, device)
read_data(dataset_dir, lang)
parse_dictionary(dict_path)
get_Map(trans_matrix, bow)
class DynamicDataset(dataset_dir, batch_size=200, read_labels=False, device='cpu', as_tensor=True)
vocab_size = 0
train_size
num_times
train_time_wordfreq
load_data(path, read_labels)
get_time_wordfreq(bow, times)
download_dataset(dataset_name, cache_path='~/.topmost')
class BasicTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)
model
dataset
num_top_words = 15
epochs = 200
learning_rate = 0.002
batch_size = 200
lr_scheduler = None
lr_step_size = 125
log_interval = 5
verbose = False
make_optimizer()
make_lr_scheduler(optimizer)
train()
test(bow)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class BERTopicTrainer(dataset, num_topics=50, num_top_words=15)
model
dataset
train()
test(texts)
get_beta()
get_top_words()
export_theta()
class FASTopicTrainer(dataset, num_topics=50, num_top_words=15, preprocess=None, epochs=200, DT_alpha=3.0, TW_alpha=2.0, theta_temp=1.0, verbose=False)
dataset
num_top_words = 15
model
epochs = 200
train()
test(texts)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class LDAGensimTrainer(dataset, num_topics=50, num_top_words=15, max_iter=1, alpha='symmetric', eta=None, verbose=False)
dataset
num_topics = 50
vocab_size
max_iter = 1
alpha = 'symmetric'
eta = None
verbose = False
num_top_words = 15
train()
test(bow)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class LDASklearnTrainer(model, dataset, num_top_words=15, verbose=False)
model
dataset
num_top_words = 15
verbose = False
train()
test(bow)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class NMFGensimTrainer(dataset, num_topics=50, num_top_words=15, max_iter=1)
dataset
num_topics = 50
num_top_words = 15
vocab_size
max_iter = 1
train()
test(bow)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class NMFSklearnTrainer(model, dataset, num_top_words=15)
model
dataset
num_top_words = 15
train()
test(bow)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class CrosslingualTrainer(model, dataset, num_top_words=15, epochs=500, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)
model
dataset
num_top_words = 15
epochs = 500
learning_rate = 0.002
batch_size = 200
lr_scheduler = None
lr_step_size = 125
log_interval = 5
make_optimizer()
make_lr_scheduler(optimizer)
train()
test(bow_en, bow_cn)
infer_theta(bow, lang)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class DynamicTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)
model
dataset
num_top_words = 15
epochs = 200
learning_rate = 0.002
batch_size = 200
lr_scheduler = None
lr_step_size = 125
log_interval = 5
verbose = False
make_optimizer()
make_lr_scheduler(optimizer)
train()
test(bow, times)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class DTMTrainer(dataset, num_topics=50, num_top_words=15, alphas=0.01, chain_variance=0.005, passes=10, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, verbose=False)
dataset
vocab_size
num_topics = 50
num_top_words = 15
alphas = 0.01
chain_variance = 0.005
passes = 10
lda_inference_max_iter = 25
em_min_iter = 6
em_max_iter = 20
verbose = False
train()
test(bow)
get_theta()
get_beta()
get_top_words(num_top_words=None)
export_theta()
class HierarchicalTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)
model
dataset
num_top_words = 15
epochs = 200
learning_rate = 0.002
batch_size = 200
lr_scheduler = None
lr_step_size = 125
log_interval = 5
verbose = False
make_optimizer()
make_lr_scheduler(optimizer)
train()
test(bow)
get_phi()
get_beta()
get_top_words(num_top_words=None, annotation=False)
export_theta()
class HDPGensimTrainer(dataset, num_top_words=15, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, verbose=False)
dataset
num_top_words = 15
vocab_size
max_chunks = None
max_time = None
chunksize = 256
kappa = 1.0
tau = 64.0
K = 15
T = 150
alpha = 1
gamma = 1
eta = 0.01
scale = 1.0
var_converge = 0.0001
verbose = False
train()
test(bow)
get_beta()
get_top_words(num_top_words=None)
export_theta()
class ProdLDA(vocab_size, num_topics=50, en_units=200, dropout=0.4)

Bases: torch.nn.Module

Autoencoding Variational Inference For Topic Models. ICLR 2017

Akash Srivastava, Charles Sutton.

num_topics = 50
a
mu2
var2
fc11
fc12
fc21
fc22
mean_bn
logvar_bn
decoder_bn
fc1_drop
theta_drop
fcd1
get_beta()
get_theta(x)
reparameterize(mu, logvar)
encode(x)
decode(theta)
forward(x)
loss_function(x, recon_x, mu, logvar)
class CombinedTM(vocab_size, contextual_embed_size, num_topics=50, en_units=200, dropout=0.4)

Bases: torch.nn.Module

vocab_size
num_topics = 50
a
mu2
var2
fc_contextual
fc11
fc12
fc21
fc22
mean_bn
logvar_bn
decoder_bn
fc1_drop
theta_drop
fcd1
get_beta()
get_theta(x)
reparameterize(mu, logvar)
encode(x)
decode(theta)
forward(x)
loss_function(x, recon_x, mu, logvar)
class DecTM(vocab_size, num_topics=50, en_units=200, dropout=0.4)

Bases: torch.nn.Module

Discovering Topics in Long-tailed Corpora with Causal Intervention. ACL 2021 findings.

Xiaobao Wu, Chunping Li, Yishu Miao.

num_topics = 50
a
mu2
var2
fc11
fc12
fc21
fc22
mean_bn
logvar_bn
decoder_bn
fc1_drop
theta_drop
beta
get_beta()
get_theta(x)
reparameterize(mu, logvar)
encode(x)
decode(theta)
forward(x)
loss_function(x, recon_x, mu, logvar)
class ETM(vocab_size, embed_size=200, num_topics=50, en_units=800, dropout=0.0, pretrained_WE=None, train_WE=False)

Bases: torch.nn.Module

Topic Modeling in Embedding Spaces. TACL 2020

Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei.

topic_embeddings
encoder1
fc21
fc22
reparameterize(mu, logvar)
encode(x)
get_theta(x)
get_beta()
forward(x, avg_loss=True)
loss_function(x, recon_x, mu, logvar, avg_loss=True)
class NSTM(vocab_size, num_topics=50, en_units=200, dropout=0.25, pretrained_WE=None, train_WE=True, embed_size=200, recon_loss_weight=0.07, sinkhorn_alpha=20)

Bases: torch.nn.Module

Neural Topic Model via Optimal Transport. ICLR 2021

He Zhao, Dinh Phung, Viet Huynh, Trung Le, Wray Buntine.

recon_loss_weight = 0.07
sinkhorn_alpha = 20
e1
e2
e_dropout
mean_bn
topic_embeddings
get_beta()
get_theta(input)
forward(input)
class TSCTM(vocab_size, num_topics=50, en_units=200, temperature=0.5, weight_contrast=1.0)

Bases: torch.nn.Module

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. EMNLP 2022

Xiaobao Wu, Anh Tuan Luu, Xinshuai Dong.

Note: This implementation does not include TSCTM with augmentations. For augmentations, see https://github.com/BobXWu/TSCTM.

fc11
fc12
fc21
mean_bn
decoder_bn
fcd1
topic_dist_quant
contrast_loss
get_beta()
encode(inputs)
decode(theta)
get_theta(inputs)
forward(inputs)
loss_function(recon_x, x)
class ECRTM(vocab_size, num_topics=50, en_units=200, dropout=0.0, pretrained_WE=None, embed_size=200, beta_temp=0.2, weight_loss_ECR=100.0, sinkhorn_alpha=20.0, sinkhorn_max_iter=1000)

Bases: torch.nn.Module

Effective Neural Topic Modeling with Embedding Clustering Regularization. ICML 2023

Xiaobao Wu, Xinshuai Dong, Thong Thanh Nguyen, Anh Tuan Luu.

num_topics = 50
beta_temp = 0.2
a
mu2
var2
fc11
fc12
fc21
fc22
fc1_dropout
theta_dropout
mean_bn
logvar_bn
decoder_bn
word_embeddings
topic_embeddings
ECR
get_beta()
reparameterize(mu, logvar)
encode(input)
get_theta(input)
compute_loss_KL(mu, logvar)
get_loss_ECR()
pairwise_euclidean_distance(x, y)
forward(input)
class NMTM(Map_en2cn, Map_cn2en, vocab_size_en, vocab_size_cn, num_topics=50, en_units=200, dropout=0.0, lam=0.8)

Bases: torch.nn.Module

Learning Multilingual Topics with Neural Variational Inference. NLPCC 2020.

Xiaobao Wu, Chunping Li, Yan Zhu, Yishu Miao.

num_topics = 50
lam = 0.8
Map_en2cn
Map_cn2en
a
mu2
var2
decoder_bn_en
decoder_bn_cn
fc11_en
fc11_cn
fc12
fc21
fc22
fc1_drop
z_drop
mean_bn
logvar_bn
phi_en
phi_cn
reparameterize(mu, logvar)
encode(x, lang)
get_theta(x, lang)
get_beta()
decode(theta, lang)
forward(x_en, x_cn)
loss_function(recon_x, x, mu, logvar)
class InfoCTM(trans_e2c, pretrain_word_embeddings_en, pretrain_word_embeddings_cn, vocab_size_en, vocab_size_cn, num_topics=50, en_units=200, dropout=0.0, temperature=0.2, pos_threshold=0.4, weight_MI=30.0)

Bases: torch.nn.Module

InfoCTM: A Mutual Information Maximization Perspective of Cross-lingual Topic Modeling. AAAI 2023

Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, Anh Tuan Luu

num_topics = 50
encoder_en
encoder_cn
a
mu2
var2
decoder_bn_en
decoder_bn_cn
phi_en
phi_cn
TAMI
get_beta()
get_theta(x, lang)
decode(theta, beta, lang)
forward(x_en, x_cn)
compute_loss_TM(recon_x, x, mu, logvar)
class DETM(vocab_size, num_times, train_size, train_time_wordfreq, num_topics=50, train_WE=True, pretrained_WE=None, en_units=800, eta_hidden_size=200, rho_size=300, enc_drop=0.0, eta_nlayers=3, eta_dropout=0.0, delta=0.005, theta_act='relu', device='cpu')

Bases: torch.nn.Module

The Dynamic Embedded Topic Model. 2019

Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei

num_topics = 50
num_times
vocab_size
eta_hidden_size = 200
rho_size = 300
enc_drop = 0.0
eta_nlayers = 3
t_drop
eta_dropout = 0.0
delta = 0.005
train_WE = True
train_size
rnn_inp
device = 'cpu'
theta_act = 'relu'
mu_q_alpha
logsigma_q_alpha
q_theta
mu_q_theta
logsigma_q_theta
q_eta_map
q_eta
mu_q_eta
logsigma_q_eta
decoder_bn
get_activation(act)
reparameterize(mu, logvar)

Returns a sample from a Gaussian distribution via reparameterization.

get_kl(q_mu, q_logsigma, p_mu=None, p_logsigma=None)

Returns KL( N(q_mu, q_logsigma) || N(p_mu, p_logsigma) ).

get_alpha()
get_eta(rnn_inp)
get_theta(bows, times, eta=None)

Returns the topic proportions.

property word_embeddings
property topic_embeddings
get_beta(alpha=None)

Returns the topic matrix eta of shape T x K x V

get_NLL(theta, beta, bows)
forward(bows, times)
init_hidden()

Initializes the first hidden state of the RNN used as inference network for eta.

class CFDTM(vocab_size, train_time_wordfreq, num_times, pretrained_WE=None, num_topics=50, en_units=100, temperature=0.1, beta_temp=1.0, weight_neg=10000000.0, weight_pos=10.0, weight_UWE=1000.0, neg_topk=15, dropout=0.0, embed_size=200)

Bases: torch.nn.Module

Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. ACL 2024 Findings

Xiaobao Wu, Xinshuai Dong, Liangming Pan, Thong Nguyen, Anh Tuan Luu.

num_topics = 50
beta_temp = 1.0
train_time_wordfreq
encoder
a
mu2
var2
decoder_bn
topic_embeddings
ETC
UWE
get_beta()
pairwise_euclidean_dist(x, y)
get_theta(x, times=None)
get_KL(mu, logvar)
get_NLL(theta, beta, x, recon_x=None)
decode(theta, beta)
forward(x, times)
class SawETM(vocab_size, num_topics_list, device='cpu', embed_size=100, hidden_size=256, pretrained_WE=None)

Bases: torch.nn.Module

Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. ICML 2021.

Zhibin Duan, Dongsheng Wang, Bo Chen, Chaojie Wang, Wenchao Chen, Yewen Li, Jie Ren, Mingyuan Zhou.

https://github.com/ZhibinDuan/SawETM

device = 'cpu'
gam_prior
real_min
theta_max
wei_shape_min
wei_shape_max
num_topics_list
num_hiddens_list
num_layers
alpha
h_encoder
q_theta
log_max(x)
reparameterize(shape, scale, sample_num=50)

Returns a sample from a Weibull distribution via reparameterization.

kl_weibull_gamma(wei_shape, wei_scale, gam_shape, gam_scale)

Returns the Kullback-Leibler divergence between a Weibull distribution and a Gamma distribution.

get_nll(x, x_reconstruct)

Returns the negative Poisson likelihood of observational count data.

property bottom_word_embeddings
property topic_embeddings_list
get_phis()

Returns the factor loading matrix by utilizing sawtooth connection.

get_beta()
get_phi_list()
get_theta(x)
forward(x)

Forward pass: compute the kl loss and data likelihood.

class HyperMiner(vocab_size, num_topics_list, device='cpu', manifold='PoincareBall', clip_r=None, curvature=-0.01, embed_size=50, hidden_size=300, pretrained_WE=None)

Bases: topmost.models.hierarchical.SawETM.SawETM.SawETM

HyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. NeurIPS 2022.

Yishi Xu, Dongsheng Wang, Bo Chen, Ruiying Lu, Zhibin Duan, Mingyuan Zhou.

https://github.com/NoviceStone/HyperMiner

manifold
clip_r = None
feat_clip(x)
property bottom_word_embeddings
property topic_embeddings_list
get_phi()

Returns the factor loading matrix by utilizing sawtooth connection.

get_beta()
get_phi_list()
get_theta(x)
forward(x)

Forward pass: compute the kl loss and data likelihood.

class TraCo(vocab_size, num_topics_list=[10, 50, 200], en_units=300, dropout=0.0, embed_size=200, bias_topk=20, bias_p=5.0, beta_temp=0.1, weight_loss_TPD=20.0, sinkhorn_alpha=20.0, sinkhorn_max_iter=1000)

Bases: torch.nn.Module

On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling. AAAI 2024

Xiaobao Wu, Fengjun Pan, Thong Nguyen, Yichao Feng, Chaoqun Liu, Cong-Duy Nguyen, Anh Tuan Luu.

num_topics_list = [10, 50, 200]
weight_loss_TPD = 20.0
beta_temp = 0.1
num_layers
bottom_word_embeddings
topic_embeddings_list
TPD
CDDecoder
encoder
get_beta()
get_phi_list()
get_theta(input_bow)
forward(input_bow)
compute_loss_KL(mu, logvar, mu_prior=None)