topmost¶

Package Contents¶

`Preprocess`
`BasicDataset`
`RawDataset`
`CrosslingualDataset`
`DynamicDataset`
`BasicTrainer`
`BERTopicTrainer`
`FASTopicTrainer`
`LDAGensimTrainer`
`LDASklearnTrainer`
`NMFGensimTrainer`
`NMFSklearnTrainer`
`CrosslingualTrainer`
`DynamicTrainer`
`DTMTrainer`
`HierarchicalTrainer`
`HDPGensimTrainer`
`ProdLDA`	Autoencoding Variational Inference For Topic Models. ICLR 2017
`CombinedTM`
`DecTM`	Discovering Topics in Long-tailed Corpora with Causal Intervention. ACL 2021 findings.
`ETM`	Topic Modeling in Embedding Spaces. TACL 2020
`NSTM`	Neural Topic Model via Optimal Transport. ICLR 2021
`TSCTM`	Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. EMNLP 2022
`ECRTM`	Effective Neural Topic Modeling with Embedding Clustering Regularization. ICML 2023
`NMTM`	Learning Multilingual Topics with Neural Variational Inference. NLPCC 2020.
`InfoCTM`	InfoCTM: A Mutual Information Maximization Perspective of Cross-lingual Topic Modeling. AAAI 2023
`DETM`	The Dynamic Embedded Topic Model. 2019
`CFDTM`	Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. ACL 2024 Findings
`SawETM`	Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. ICML 2021.
`HyperMiner`	HyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. NeurIPS 2022.
`TraCo`	On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling. AAAI 2024

download_dataset(dataset_name[, cache_path])

class Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True)¶

Parameters:

test_sample_size – Size of the test set.
test_p – Proportion of the test set. This helps sample the train set based on the size of the test set.
stopwords – List of stopwords to exclude.
min-doc-count – Exclude words that occur in less than this number of documents.
max_doc_freq – Exclude words that occur in more than this proportion of documents.
keep-num – Keep tokens made of only numbers.
keep-alphanum – Keep tokens made of a mixture of letters and numbers.
strip_html – Strip HTML tags.
no-lower – Do not lowercase text
min_length – Minimum token length.
min_term – Minimum term number
vocab-size – Size of the vocabulary (by most common in the union of train and test sets, following above exclusions)
seed – Random integer seed (only relevant for choosing test set)

test_sample_size = None¶

min_doc_count = 0¶

max_doc_freq = 1.0¶

min_term = 0¶

test_p = 0.2¶

vocab_size = None¶

seed = 42¶

parse(texts, vocab)¶

preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False)¶

convert_labels(train_labels, test_labels)¶

preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False)¶

save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)¶

class BasicDataset(dataset_dir, batch_size=200, read_labels=False, as_tensor=True, contextual_embed=False, doc_embed_model='all-MiniLM-L6-v2', device='cpu')¶

vocab_size = 0¶

load_data(path, read_labels)¶

class RawDataset(docs, preprocess=None, batch_size=200, device='cpu', as_tensor=True, contextual_embed=False, pretrained_WE=False, doc_embed_model='all-MiniLM-L6-v2', embed_model_device=None, verbose=False)¶

train_data¶

train_texts¶

vocab¶

vocab_size¶

class CrosslingualDataset(dataset_dir, lang1, lang2, dict_path, device='cpu', batch_size=200, as_tensor=True)¶

batch_size = 200¶

train_size_en = 0¶

train_size_cn = 0¶

vocab_size_en = 0¶

vocab_size_cn = 0¶

pretrained_WE_en¶

pretrained_WE_cn¶

Map_en2cn¶

Map_cn2en¶

move_to_device(bow, device)¶

read_data(dataset_dir, lang)¶

parse_dictionary(dict_path)¶

get_Map(trans_matrix, bow)¶

class DynamicDataset(dataset_dir, batch_size=200, read_labels=False, device='cpu', as_tensor=True)¶

vocab_size = 0¶

train_size¶

num_times¶

train_time_wordfreq¶

load_data(path, read_labels)¶

get_time_wordfreq(bow, times)¶

download_dataset(dataset_name, cache_path='~/.topmost')¶

class BasicTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶

model¶

dataset¶

num_top_words = 15¶

epochs = 200¶

learning_rate = 0.002¶

batch_size = 200¶

lr_scheduler = None¶

lr_step_size = 125¶

log_interval = 5¶

verbose = False¶

make_optimizer()¶

make_lr_scheduler(optimizer)¶

train()¶

test(bow)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class BERTopicTrainer(dataset, num_topics=50, num_top_words=15)¶

model¶

dataset¶

train()¶

test(texts)¶

get_beta()¶

get_top_words()¶

export_theta()¶

class FASTopicTrainer(dataset, num_topics=50, num_top_words=15, preprocess=None, epochs=200, DT_alpha=3.0, TW_alpha=2.0, theta_temp=1.0, verbose=False)¶

dataset¶

num_top_words = 15¶

model¶

epochs = 200¶

train()¶

test(texts)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class LDAGensimTrainer(dataset, num_topics=50, num_top_words=15, max_iter=1, alpha='symmetric', eta=None, verbose=False)¶

dataset¶

num_topics = 50¶

vocab_size¶

max_iter = 1¶

alpha = 'symmetric'¶

eta = None¶

verbose = False¶

num_top_words = 15¶

train()¶

test(bow)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class LDASklearnTrainer(model, dataset, num_top_words=15, verbose=False)¶

model¶

dataset¶

num_top_words = 15¶

verbose = False¶

train()¶

test(bow)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class NMFGensimTrainer(dataset, num_topics=50, num_top_words=15, max_iter=1)¶

dataset¶

num_topics = 50¶

num_top_words = 15¶

vocab_size¶

max_iter = 1¶

train()¶

test(bow)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class NMFSklearnTrainer(model, dataset, num_top_words=15)¶

model¶

dataset¶

num_top_words = 15¶

train()¶

test(bow)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class CrosslingualTrainer(model, dataset, num_top_words=15, epochs=500, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶

model¶

dataset¶

num_top_words = 15¶

epochs = 500¶

learning_rate = 0.002¶

batch_size = 200¶

lr_scheduler = None¶

lr_step_size = 125¶

log_interval = 5¶

make_optimizer()¶

make_lr_scheduler(optimizer)¶

train()¶

test(bow_en, bow_cn)¶

infer_theta(bow, lang)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class DynamicTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶

model¶

dataset¶

num_top_words = 15¶

epochs = 200¶

learning_rate = 0.002¶

batch_size = 200¶

lr_scheduler = None¶

lr_step_size = 125¶

log_interval = 5¶

verbose = False¶

make_optimizer()¶

make_lr_scheduler(optimizer)¶

train()¶

test(bow, times)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class DTMTrainer(dataset, num_topics=50, num_top_words=15, alphas=0.01, chain_variance=0.005, passes=10, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, verbose=False)¶

dataset¶

vocab_size¶

num_topics = 50¶

num_top_words = 15¶

alphas = 0.01¶

chain_variance = 0.005¶

passes = 10¶

lda_inference_max_iter = 25¶

em_min_iter = 6¶

em_max_iter = 20¶

verbose = False¶

train()¶

test(bow)¶

get_theta()¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class HierarchicalTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶

model¶

dataset¶

num_top_words = 15¶

epochs = 200¶

learning_rate = 0.002¶

batch_size = 200¶

lr_scheduler = None¶

lr_step_size = 125¶

log_interval = 5¶

verbose = False¶

make_optimizer()¶

make_lr_scheduler(optimizer)¶

train()¶

test(bow)¶

get_phi()¶

get_beta()¶

get_top_words(num_top_words=None, annotation=False)¶

export_theta()¶

class HDPGensimTrainer(dataset, num_top_words=15, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, verbose=False)¶

dataset¶

num_top_words = 15¶

vocab_size¶

max_chunks = None¶

max_time = None¶

chunksize = 256¶

kappa = 1.0¶

tau = 64.0¶

K = 15¶

T = 150¶

alpha = 1¶

gamma = 1¶

eta = 0.01¶

scale = 1.0¶

var_converge = 0.0001¶

verbose = False¶

train()¶

test(bow)¶

get_beta()¶

get_top_words(num_top_words=None)¶

export_theta()¶

class ProdLDA(vocab_size, num_topics=50, en_units=200, dropout=0.4)¶

Bases: torch.nn.Module

Autoencoding Variational Inference For Topic Models. ICLR 2017

Akash Srivastava, Charles Sutton.

num_topics = 50¶

a¶

mu2¶

var2¶

fc11¶

fc12¶

fc21¶

fc22¶

mean_bn¶

logvar_bn¶

decoder_bn¶

fc1_drop¶

theta_drop¶

fcd1¶

get_beta()¶

get_theta(x)¶

reparameterize(mu, logvar)¶

encode(x)¶

decode(theta)¶

forward(x)¶

loss_function(x, recon_x, mu, logvar)¶

class CombinedTM(vocab_size, contextual_embed_size, num_topics=50, en_units=200, dropout=0.4)¶

Bases: torch.nn.Module

vocab_size¶

num_topics = 50¶

a¶

mu2¶

var2¶

fc_contextual¶

fc11¶

fc12¶

fc21¶

fc22¶

mean_bn¶

logvar_bn¶

decoder_bn¶

fc1_drop¶

theta_drop¶

fcd1¶

get_beta()¶

get_theta(x)¶

reparameterize(mu, logvar)¶

encode(x)¶

decode(theta)¶

forward(x)¶

loss_function(x, recon_x, mu, logvar)¶

class DecTM(vocab_size, num_topics=50, en_units=200, dropout=0.4)¶

Bases: torch.nn.Module

Discovering Topics in Long-tailed Corpora with Causal Intervention. ACL 2021 findings.

Xiaobao Wu, Chunping Li, Yishu Miao.

num_topics = 50¶

a¶

mu2¶

var2¶

fc11¶

fc12¶

fc21¶

fc22¶

mean_bn¶

logvar_bn¶

decoder_bn¶

fc1_drop¶

theta_drop¶

beta¶

get_beta()¶

get_theta(x)¶

reparameterize(mu, logvar)¶

encode(x)¶

decode(theta)¶

forward(x)¶

loss_function(x, recon_x, mu, logvar)¶

class ETM(vocab_size, embed_size=200, num_topics=50, en_units=800, dropout=0.0, pretrained_WE=None, train_WE=False)¶

Bases: torch.nn.Module

Topic Modeling in Embedding Spaces. TACL 2020

Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei.

topic_embeddings¶

encoder1¶

fc21¶

fc22¶

reparameterize(mu, logvar)¶

encode(x)¶

get_theta(x)¶

get_beta()¶

forward(x, avg_loss=True)¶

loss_function(x, recon_x, mu, logvar, avg_loss=True)¶

class NSTM(vocab_size, num_topics=50, en_units=200, dropout=0.25, pretrained_WE=None, train_WE=True, embed_size=200, recon_loss_weight=0.07, sinkhorn_alpha=20)¶

Bases: torch.nn.Module

Neural Topic Model via Optimal Transport. ICLR 2021

He Zhao, Dinh Phung, Viet Huynh, Trung Le, Wray Buntine.

recon_loss_weight = 0.07¶

sinkhorn_alpha = 20¶

e1¶

e2¶

e_dropout¶

mean_bn¶

topic_embeddings¶

get_beta()¶

get_theta(input)¶

forward(input)¶

class TSCTM(vocab_size, num_topics=50, en_units=200, temperature=0.5, weight_contrast=1.0)¶

Bases: torch.nn.Module

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. EMNLP 2022

Xiaobao Wu, Anh Tuan Luu, Xinshuai Dong.

Note: This implementation does not include TSCTM with augmentations. For augmentations, see https://github.com/BobXWu/TSCTM.

fc11¶

fc12¶

fc21¶

mean_bn¶

decoder_bn¶

fcd1¶

topic_dist_quant¶

contrast_loss¶

get_beta()¶

encode(inputs)¶

decode(theta)¶

get_theta(inputs)¶

forward(inputs)¶

loss_function(recon_x, x)¶

class ECRTM(vocab_size, num_topics=50, en_units=200, dropout=0.0, pretrained_WE=None, embed_size=200, beta_temp=0.2, weight_loss_ECR=100.0, sinkhorn_alpha=20.0, sinkhorn_max_iter=1000)¶

Bases: torch.nn.Module

Effective Neural Topic Modeling with Embedding Clustering Regularization. ICML 2023

Xiaobao Wu, Xinshuai Dong, Thong Thanh Nguyen, Anh Tuan Luu.

num_topics = 50¶

beta_temp = 0.2¶

a¶

mu2¶

var2¶

fc11¶

fc12¶

fc21¶

fc22¶

fc1_dropout¶

theta_dropout¶

mean_bn¶

logvar_bn¶

decoder_bn¶

word_embeddings¶

topic_embeddings¶

ECR¶

get_beta()¶

reparameterize(mu, logvar)¶

encode(input)¶

get_theta(input)¶

compute_loss_KL(mu, logvar)¶

get_loss_ECR()¶

pairwise_euclidean_distance(x, y)¶

forward(input)¶

class NMTM(Map_en2cn, Map_cn2en, vocab_size_en, vocab_size_cn, num_topics=50, en_units=200, dropout=0.0, lam=0.8)¶

Bases: torch.nn.Module

Learning Multilingual Topics with Neural Variational Inference. NLPCC 2020.

Xiaobao Wu, Chunping Li, Yan Zhu, Yishu Miao.

num_topics = 50¶

lam = 0.8¶

Map_en2cn¶

Map_cn2en¶

a¶

mu2¶

var2¶

decoder_bn_en¶

decoder_bn_cn¶

fc11_en¶

fc11_cn¶

fc12¶

fc21¶

fc22¶

fc1_drop¶

z_drop¶

mean_bn¶

logvar_bn¶

phi_en¶

phi_cn¶

reparameterize(mu, logvar)¶

encode(x, lang)¶

get_theta(x, lang)¶

get_beta()¶

decode(theta, lang)¶

forward(x_en, x_cn)¶

loss_function(recon_x, x, mu, logvar)¶

class InfoCTM(trans_e2c, pretrain_word_embeddings_en, pretrain_word_embeddings_cn, vocab_size_en, vocab_size_cn, num_topics=50, en_units=200, dropout=0.0, temperature=0.2, pos_threshold=0.4, weight_MI=30.0)¶

Bases: torch.nn.Module

InfoCTM: A Mutual Information Maximization Perspective of Cross-lingual Topic Modeling. AAAI 2023

Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, Anh Tuan Luu

num_topics = 50¶

encoder_en¶

encoder_cn¶

a¶

mu2¶

var2¶

decoder_bn_en¶

decoder_bn_cn¶

phi_en¶

phi_cn¶

TAMI¶

get_beta()¶

get_theta(x, lang)¶

decode(theta, beta, lang)¶

forward(x_en, x_cn)¶

compute_loss_TM(recon_x, x, mu, logvar)¶

class DETM(vocab_size, num_times, train_size, train_time_wordfreq, num_topics=50, train_WE=True, pretrained_WE=None, en_units=800, eta_hidden_size=200, rho_size=300, enc_drop=0.0, eta_nlayers=3, eta_dropout=0.0, delta=0.005, theta_act='relu', device='cpu')¶

Bases: torch.nn.Module

The Dynamic Embedded Topic Model. 2019

Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei

num_topics = 50¶

num_times¶

vocab_size¶

eta_hidden_size = 200¶

rho_size = 300¶

enc_drop = 0.0¶

eta_nlayers = 3¶

t_drop¶

eta_dropout = 0.0¶

delta = 0.005¶

train_WE = True¶

train_size¶

rnn_inp¶

device = 'cpu'¶

theta_act = 'relu'¶

mu_q_alpha¶

logsigma_q_alpha¶

q_theta¶

mu_q_theta¶

logsigma_q_theta¶

q_eta_map¶

q_eta¶

mu_q_eta¶

logsigma_q_eta¶

decoder_bn¶

get_activation(act)¶

reparameterize(mu, logvar)¶: Returns a sample from a Gaussian distribution via reparameterization.

get_kl(q_mu, q_logsigma, p_mu=None, p_logsigma=None)¶: Returns KL( N(q_mu, q_logsigma) || N(p_mu, p_logsigma) ).

get_alpha()¶

get_eta(rnn_inp)¶

get_theta(bows, times, eta=None)¶: Returns the topic proportions.

property word_embeddings¶

property topic_embeddings¶

get_beta(alpha=None)¶: Returns the topic matrix eta of shape T x K x V

get_NLL(theta, beta, bows)¶

forward(bows, times)¶

init_hidden()¶: Initializes the first hidden state of the RNN used as inference network for eta.

class CFDTM(vocab_size, train_time_wordfreq, num_times, pretrained_WE=None, num_topics=50, en_units=100, temperature=0.1, beta_temp=1.0, weight_neg=10000000.0, weight_pos=10.0, weight_UWE=1000.0, neg_topk=15, dropout=0.0, embed_size=200)¶

Bases: torch.nn.Module

Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. ACL 2024 Findings

Xiaobao Wu, Xinshuai Dong, Liangming Pan, Thong Nguyen, Anh Tuan Luu.

num_topics = 50¶

beta_temp = 1.0¶

train_time_wordfreq¶

encoder¶

a¶

mu2¶

var2¶

decoder_bn¶

topic_embeddings¶

ETC¶

UWE¶

get_beta()¶

pairwise_euclidean_dist(x, y)¶

get_theta(x, times=None)¶

get_KL(mu, logvar)¶

get_NLL(theta, beta, x, recon_x=None)¶

decode(theta, beta)¶

forward(x, times)¶

class SawETM(vocab_size, num_topics_list, device='cpu', embed_size=100, hidden_size=256, pretrained_WE=None)¶

Bases: torch.nn.Module

Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. ICML 2021.

Zhibin Duan, Dongsheng Wang, Bo Chen, Chaojie Wang, Wenchao Chen, Yewen Li, Jie Ren, Mingyuan Zhou.

https://github.com/ZhibinDuan/SawETM

device = 'cpu'¶

gam_prior¶

real_min¶

theta_max¶

wei_shape_min¶

wei_shape_max¶

num_topics_list¶

num_hiddens_list¶

num_layers¶

alpha¶

h_encoder¶

q_theta¶

log_max(x)¶

reparameterize(shape, scale, sample_num=50)¶: Returns a sample from a Weibull distribution via reparameterization.

kl_weibull_gamma(wei_shape, wei_scale, gam_shape, gam_scale)¶: Returns the Kullback-Leibler divergence between a Weibull distribution and a Gamma distribution.

get_nll(x, x_reconstruct)¶: Returns the negative Poisson likelihood of observational count data.

property bottom_word_embeddings¶

property topic_embeddings_list¶

get_phis()¶: Returns the factor loading matrix by utilizing sawtooth connection.

get_beta()¶

get_phi_list()¶

get_theta(x)¶

forward(x)¶: Forward pass: compute the kl loss and data likelihood.

class HyperMiner(vocab_size, num_topics_list, device='cpu', manifold='PoincareBall', clip_r=None, curvature=-0.01, embed_size=50, hidden_size=300, pretrained_WE=None)¶

Bases: topmost.models.hierarchical.SawETM.SawETM.SawETM

HyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. NeurIPS 2022.

Yishi Xu, Dongsheng Wang, Bo Chen, Ruiying Lu, Zhibin Duan, Mingyuan Zhou.

https://github.com/NoviceStone/HyperMiner

manifold¶

clip_r = None¶

feat_clip(x)¶

property bottom_word_embeddings¶

property topic_embeddings_list¶

get_phi()¶: Returns the factor loading matrix by utilizing sawtooth connection.

get_beta()¶

get_phi_list()¶

get_theta(x)¶

forward(x)¶: Forward pass: compute the kl loss and data likelihood.

class TraCo(vocab_size, num_topics_list=[10, 50, 200], en_units=300, dropout=0.0, embed_size=200, bias_topk=20, bias_p=5.0, beta_temp=0.1, weight_loss_TPD=20.0, sinkhorn_alpha=20.0, sinkhorn_max_iter=1000)¶

Bases: torch.nn.Module

On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling. AAAI 2024

Xiaobao Wu, Fengjun Pan, Thong Nguyen, Yichao Feng, Chaoqun Liu, Cong-Duy Nguyen, Anh Tuan Luu.

num_topics_list = [10, 50, 200]¶

weight_loss_TPD = 20.0¶

beta_temp = 0.1¶

num_layers¶

bottom_word_embeddings¶

topic_embeddings_list¶

TPD¶

CDDecoder¶

encoder¶

get_beta()¶

get_phi_list()¶

get_theta(input_bow)¶

forward(input_bow)¶

compute_loss_KL(mu, logvar, mu_prior=None)¶