topmost¶
Package Contents¶
Autoencoding Variational Inference For Topic Models. ICLR 2017 |
|
Discovering Topics in Long-tailed Corpora with Causal Intervention. ACL 2021 findings. |
|
Topic Modeling in Embedding Spaces. TACL 2020 |
|
Neural Topic Model via Optimal Transport. ICLR 2021 |
|
Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. EMNLP 2022 |
|
Effective Neural Topic Modeling with Embedding Clustering Regularization. ICML 2023 |
|
Learning Multilingual Topics with Neural Variational Inference. NLPCC 2020. |
|
InfoCTM: A Mutual Information Maximization Perspective of Cross-lingual Topic Modeling. AAAI 2023 |
|
The Dynamic Embedded Topic Model. 2019 |
|
Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. ACL 2024 Findings |
|
Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. ICML 2021. |
|
HyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. NeurIPS 2022. |
|
On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling. AAAI 2024 |
|
- class Preprocess(tokenizer=None, test_sample_size=None, test_p=0.2, stopwords='English', min_doc_count=0, max_doc_freq=1.0, keep_num=False, keep_alphanum=False, strip_html=False, no_lower=False, min_length=3, min_term=0, vocab_size=None, seed=42, verbose=True)¶
- Parameters:
test_sample_size – Size of the test set.
test_p – Proportion of the test set. This helps sample the train set based on the size of the test set.
stopwords – List of stopwords to exclude.
min-doc-count – Exclude words that occur in less than this number of documents.
max_doc_freq – Exclude words that occur in more than this proportion of documents.
keep-num – Keep tokens made of only numbers.
keep-alphanum – Keep tokens made of a mixture of letters and numbers.
strip_html – Strip HTML tags.
no-lower – Do not lowercase text
min_length – Minimum token length.
min_term – Minimum term number
vocab-size – Size of the vocabulary (by most common in the union of train and test sets, following above exclusions)
seed – Random integer seed (only relevant for choosing test set)
- test_sample_size = None¶
- min_doc_count = 0¶
- max_doc_freq = 1.0¶
- min_term = 0¶
- test_p = 0.2¶
- vocab_size = None¶
- seed = 42¶
- parse(texts, vocab)¶
- preprocess_jsonlist(dataset_dir, label_name=None, pretrained_WE=False)¶
- convert_labels(train_labels, test_labels)¶
- preprocess(raw_train_texts, train_labels=None, raw_test_texts=None, test_labels=None, pretrained_WE=False)¶
- save(output_dir, vocab, train_texts, train_bow, word_embeddings=None, train_labels=None, test_texts=None, test_bow=None, test_labels=None)¶
- class BasicDataset(dataset_dir, batch_size=200, read_labels=False, as_tensor=True, contextual_embed=False, doc_embed_model='all-MiniLM-L6-v2', device='cpu')¶
- vocab_size = 0¶
- load_data(path, read_labels)¶
- class RawDataset(docs, preprocess=None, batch_size=200, device='cpu', as_tensor=True, contextual_embed=False, pretrained_WE=False, doc_embed_model='all-MiniLM-L6-v2', embed_model_device=None, verbose=False)¶
- train_data¶
- train_texts¶
- vocab¶
- vocab_size¶
- class CrosslingualDataset(dataset_dir, lang1, lang2, dict_path, device='cpu', batch_size=200, as_tensor=True)¶
- batch_size = 200¶
- train_size_en = 0¶
- train_size_cn = 0¶
- vocab_size_en = 0¶
- vocab_size_cn = 0¶
- pretrained_WE_en¶
- pretrained_WE_cn¶
- Map_en2cn¶
- Map_cn2en¶
- move_to_device(bow, device)¶
- read_data(dataset_dir, lang)¶
- parse_dictionary(dict_path)¶
- get_Map(trans_matrix, bow)¶
- class DynamicDataset(dataset_dir, batch_size=200, read_labels=False, device='cpu', as_tensor=True)¶
- vocab_size = 0¶
- train_size¶
- num_times¶
- train_time_wordfreq¶
- load_data(path, read_labels)¶
- get_time_wordfreq(bow, times)¶
- download_dataset(dataset_name, cache_path='~/.topmost')¶
- class BasicTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶
- model¶
- dataset¶
- num_top_words = 15¶
- epochs = 200¶
- learning_rate = 0.002¶
- batch_size = 200¶
- lr_scheduler = None¶
- lr_step_size = 125¶
- log_interval = 5¶
- verbose = False¶
- make_optimizer()¶
- make_lr_scheduler(optimizer)¶
- train()¶
- test(bow)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class BERTopicTrainer(dataset, num_topics=50, num_top_words=15)¶
- model¶
- dataset¶
- train()¶
- test(texts)¶
- get_beta()¶
- get_top_words()¶
- export_theta()¶
- class FASTopicTrainer(dataset, num_topics=50, num_top_words=15, preprocess=None, epochs=200, DT_alpha=3.0, TW_alpha=2.0, theta_temp=1.0, verbose=False)¶
- dataset¶
- num_top_words = 15¶
- model¶
- epochs = 200¶
- train()¶
- test(texts)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class LDAGensimTrainer(dataset, num_topics=50, num_top_words=15, max_iter=1, alpha='symmetric', eta=None, verbose=False)¶
- dataset¶
- num_topics = 50¶
- vocab_size¶
- max_iter = 1¶
- alpha = 'symmetric'¶
- eta = None¶
- verbose = False¶
- num_top_words = 15¶
- train()¶
- test(bow)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class LDASklearnTrainer(model, dataset, num_top_words=15, verbose=False)¶
- model¶
- dataset¶
- num_top_words = 15¶
- verbose = False¶
- train()¶
- test(bow)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class NMFGensimTrainer(dataset, num_topics=50, num_top_words=15, max_iter=1)¶
- dataset¶
- num_topics = 50¶
- num_top_words = 15¶
- vocab_size¶
- max_iter = 1¶
- train()¶
- test(bow)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class NMFSklearnTrainer(model, dataset, num_top_words=15)¶
- model¶
- dataset¶
- num_top_words = 15¶
- train()¶
- test(bow)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class CrosslingualTrainer(model, dataset, num_top_words=15, epochs=500, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶
- model¶
- dataset¶
- num_top_words = 15¶
- epochs = 500¶
- learning_rate = 0.002¶
- batch_size = 200¶
- lr_scheduler = None¶
- lr_step_size = 125¶
- log_interval = 5¶
- make_optimizer()¶
- make_lr_scheduler(optimizer)¶
- train()¶
- test(bow_en, bow_cn)¶
- infer_theta(bow, lang)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class DynamicTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶
- model¶
- dataset¶
- num_top_words = 15¶
- epochs = 200¶
- learning_rate = 0.002¶
- batch_size = 200¶
- lr_scheduler = None¶
- lr_step_size = 125¶
- log_interval = 5¶
- verbose = False¶
- make_optimizer()¶
- make_lr_scheduler(optimizer)¶
- train()¶
- test(bow, times)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class DTMTrainer(dataset, num_topics=50, num_top_words=15, alphas=0.01, chain_variance=0.005, passes=10, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, verbose=False)¶
- dataset¶
- vocab_size¶
- num_topics = 50¶
- num_top_words = 15¶
- alphas = 0.01¶
- chain_variance = 0.005¶
- passes = 10¶
- lda_inference_max_iter = 25¶
- em_min_iter = 6¶
- em_max_iter = 20¶
- verbose = False¶
- train()¶
- test(bow)¶
- get_theta()¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class HierarchicalTrainer(model, dataset, num_top_words=15, epochs=200, learning_rate=0.002, batch_size=200, lr_scheduler=None, lr_step_size=125, log_interval=5, verbose=False)¶
- model¶
- dataset¶
- num_top_words = 15¶
- epochs = 200¶
- learning_rate = 0.002¶
- batch_size = 200¶
- lr_scheduler = None¶
- lr_step_size = 125¶
- log_interval = 5¶
- verbose = False¶
- make_optimizer()¶
- make_lr_scheduler(optimizer)¶
- train()¶
- test(bow)¶
- get_phi()¶
- get_beta()¶
- get_top_words(num_top_words=None, annotation=False)¶
- export_theta()¶
- class HDPGensimTrainer(dataset, num_top_words=15, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, verbose=False)¶
- dataset¶
- num_top_words = 15¶
- vocab_size¶
- max_chunks = None¶
- max_time = None¶
- chunksize = 256¶
- kappa = 1.0¶
- tau = 64.0¶
- K = 15¶
- T = 150¶
- alpha = 1¶
- gamma = 1¶
- eta = 0.01¶
- scale = 1.0¶
- var_converge = 0.0001¶
- verbose = False¶
- train()¶
- test(bow)¶
- get_beta()¶
- get_top_words(num_top_words=None)¶
- export_theta()¶
- class ProdLDA(vocab_size, num_topics=50, en_units=200, dropout=0.4)¶
Bases:
torch.nn.ModuleAutoencoding Variational Inference For Topic Models. ICLR 2017
Akash Srivastava, Charles Sutton.
- num_topics = 50¶
- a¶
- mu2¶
- var2¶
- fc11¶
- fc12¶
- fc21¶
- fc22¶
- mean_bn¶
- logvar_bn¶
- decoder_bn¶
- fc1_drop¶
- theta_drop¶
- fcd1¶
- get_beta()¶
- get_theta(x)¶
- reparameterize(mu, logvar)¶
- encode(x)¶
- decode(theta)¶
- forward(x)¶
- loss_function(x, recon_x, mu, logvar)¶
- class CombinedTM(vocab_size, contextual_embed_size, num_topics=50, en_units=200, dropout=0.4)¶
Bases:
torch.nn.Module- vocab_size¶
- num_topics = 50¶
- a¶
- mu2¶
- var2¶
- fc_contextual¶
- fc11¶
- fc12¶
- fc21¶
- fc22¶
- mean_bn¶
- logvar_bn¶
- decoder_bn¶
- fc1_drop¶
- theta_drop¶
- fcd1¶
- get_beta()¶
- get_theta(x)¶
- reparameterize(mu, logvar)¶
- encode(x)¶
- decode(theta)¶
- forward(x)¶
- loss_function(x, recon_x, mu, logvar)¶
- class DecTM(vocab_size, num_topics=50, en_units=200, dropout=0.4)¶
Bases:
torch.nn.ModuleDiscovering Topics in Long-tailed Corpora with Causal Intervention. ACL 2021 findings.
Xiaobao Wu, Chunping Li, Yishu Miao.
- num_topics = 50¶
- a¶
- mu2¶
- var2¶
- fc11¶
- fc12¶
- fc21¶
- fc22¶
- mean_bn¶
- logvar_bn¶
- decoder_bn¶
- fc1_drop¶
- theta_drop¶
- beta¶
- get_beta()¶
- get_theta(x)¶
- reparameterize(mu, logvar)¶
- encode(x)¶
- decode(theta)¶
- forward(x)¶
- loss_function(x, recon_x, mu, logvar)¶
- class ETM(vocab_size, embed_size=200, num_topics=50, en_units=800, dropout=0.0, pretrained_WE=None, train_WE=False)¶
Bases:
torch.nn.ModuleTopic Modeling in Embedding Spaces. TACL 2020
Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei.
- topic_embeddings¶
- encoder1¶
- fc21¶
- fc22¶
- reparameterize(mu, logvar)¶
- encode(x)¶
- get_theta(x)¶
- get_beta()¶
- forward(x, avg_loss=True)¶
- loss_function(x, recon_x, mu, logvar, avg_loss=True)¶
- class NSTM(vocab_size, num_topics=50, en_units=200, dropout=0.25, pretrained_WE=None, train_WE=True, embed_size=200, recon_loss_weight=0.07, sinkhorn_alpha=20)¶
Bases:
torch.nn.ModuleNeural Topic Model via Optimal Transport. ICLR 2021
He Zhao, Dinh Phung, Viet Huynh, Trung Le, Wray Buntine.
- recon_loss_weight = 0.07¶
- sinkhorn_alpha = 20¶
- e1¶
- e2¶
- e_dropout¶
- mean_bn¶
- topic_embeddings¶
- get_beta()¶
- get_theta(input)¶
- forward(input)¶
- class TSCTM(vocab_size, num_topics=50, en_units=200, temperature=0.5, weight_contrast=1.0)¶
Bases:
torch.nn.ModuleMitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. EMNLP 2022
Xiaobao Wu, Anh Tuan Luu, Xinshuai Dong.
Note: This implementation does not include TSCTM with augmentations. For augmentations, see https://github.com/BobXWu/TSCTM.
- fc11¶
- fc12¶
- fc21¶
- mean_bn¶
- decoder_bn¶
- fcd1¶
- topic_dist_quant¶
- contrast_loss¶
- get_beta()¶
- encode(inputs)¶
- decode(theta)¶
- get_theta(inputs)¶
- forward(inputs)¶
- loss_function(recon_x, x)¶
- class ECRTM(vocab_size, num_topics=50, en_units=200, dropout=0.0, pretrained_WE=None, embed_size=200, beta_temp=0.2, weight_loss_ECR=100.0, sinkhorn_alpha=20.0, sinkhorn_max_iter=1000)¶
Bases:
torch.nn.ModuleEffective Neural Topic Modeling with Embedding Clustering Regularization. ICML 2023
Xiaobao Wu, Xinshuai Dong, Thong Thanh Nguyen, Anh Tuan Luu.
- num_topics = 50¶
- beta_temp = 0.2¶
- a¶
- mu2¶
- var2¶
- fc11¶
- fc12¶
- fc21¶
- fc22¶
- fc1_dropout¶
- theta_dropout¶
- mean_bn¶
- logvar_bn¶
- decoder_bn¶
- word_embeddings¶
- topic_embeddings¶
- ECR¶
- get_beta()¶
- reparameterize(mu, logvar)¶
- encode(input)¶
- get_theta(input)¶
- compute_loss_KL(mu, logvar)¶
- get_loss_ECR()¶
- pairwise_euclidean_distance(x, y)¶
- forward(input)¶
- class NMTM(Map_en2cn, Map_cn2en, vocab_size_en, vocab_size_cn, num_topics=50, en_units=200, dropout=0.0, lam=0.8)¶
Bases:
torch.nn.ModuleLearning Multilingual Topics with Neural Variational Inference. NLPCC 2020.
Xiaobao Wu, Chunping Li, Yan Zhu, Yishu Miao.
- num_topics = 50¶
- lam = 0.8¶
- Map_en2cn¶
- Map_cn2en¶
- a¶
- mu2¶
- var2¶
- decoder_bn_en¶
- decoder_bn_cn¶
- fc11_en¶
- fc11_cn¶
- fc12¶
- fc21¶
- fc22¶
- fc1_drop¶
- z_drop¶
- mean_bn¶
- logvar_bn¶
- phi_en¶
- phi_cn¶
- reparameterize(mu, logvar)¶
- encode(x, lang)¶
- get_theta(x, lang)¶
- get_beta()¶
- decode(theta, lang)¶
- forward(x_en, x_cn)¶
- loss_function(recon_x, x, mu, logvar)¶
- class InfoCTM(trans_e2c, pretrain_word_embeddings_en, pretrain_word_embeddings_cn, vocab_size_en, vocab_size_cn, num_topics=50, en_units=200, dropout=0.0, temperature=0.2, pos_threshold=0.4, weight_MI=30.0)¶
Bases:
torch.nn.ModuleInfoCTM: A Mutual Information Maximization Perspective of Cross-lingual Topic Modeling. AAAI 2023
Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, Anh Tuan Luu
- num_topics = 50¶
- encoder_en¶
- encoder_cn¶
- a¶
- mu2¶
- var2¶
- decoder_bn_en¶
- decoder_bn_cn¶
- phi_en¶
- phi_cn¶
- TAMI¶
- get_beta()¶
- get_theta(x, lang)¶
- decode(theta, beta, lang)¶
- forward(x_en, x_cn)¶
- compute_loss_TM(recon_x, x, mu, logvar)¶
- class DETM(vocab_size, num_times, train_size, train_time_wordfreq, num_topics=50, train_WE=True, pretrained_WE=None, en_units=800, eta_hidden_size=200, rho_size=300, enc_drop=0.0, eta_nlayers=3, eta_dropout=0.0, delta=0.005, theta_act='relu', device='cpu')¶
Bases:
torch.nn.ModuleThe Dynamic Embedded Topic Model. 2019
Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei
- num_topics = 50¶
- num_times¶
- vocab_size¶
- rho_size = 300¶
- enc_drop = 0.0¶
- eta_nlayers = 3¶
- t_drop¶
- eta_dropout = 0.0¶
- delta = 0.005¶
- train_WE = True¶
- train_size¶
- rnn_inp¶
- device = 'cpu'¶
- theta_act = 'relu'¶
- mu_q_alpha¶
- logsigma_q_alpha¶
- q_theta¶
- mu_q_theta¶
- logsigma_q_theta¶
- q_eta_map¶
- q_eta¶
- mu_q_eta¶
- logsigma_q_eta¶
- decoder_bn¶
- get_activation(act)¶
- reparameterize(mu, logvar)¶
Returns a sample from a Gaussian distribution via reparameterization.
- get_kl(q_mu, q_logsigma, p_mu=None, p_logsigma=None)¶
Returns KL( N(q_mu, q_logsigma) || N(p_mu, p_logsigma) ).
- get_alpha()¶
- get_eta(rnn_inp)¶
- get_theta(bows, times, eta=None)¶
Returns the topic proportions.
- property word_embeddings¶
- property topic_embeddings¶
- get_beta(alpha=None)¶
Returns the topic matrix eta of shape T x K x V
- get_NLL(theta, beta, bows)¶
- forward(bows, times)¶
Initializes the first hidden state of the RNN used as inference network for eta.
- class CFDTM(vocab_size, train_time_wordfreq, num_times, pretrained_WE=None, num_topics=50, en_units=100, temperature=0.1, beta_temp=1.0, weight_neg=10000000.0, weight_pos=10.0, weight_UWE=1000.0, neg_topk=15, dropout=0.0, embed_size=200)¶
Bases:
torch.nn.ModuleModeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. ACL 2024 Findings
Xiaobao Wu, Xinshuai Dong, Liangming Pan, Thong Nguyen, Anh Tuan Luu.
- num_topics = 50¶
- beta_temp = 1.0¶
- train_time_wordfreq¶
- encoder¶
- a¶
- mu2¶
- var2¶
- decoder_bn¶
- topic_embeddings¶
- ETC¶
- UWE¶
- get_beta()¶
- pairwise_euclidean_dist(x, y)¶
- get_theta(x, times=None)¶
- get_KL(mu, logvar)¶
- get_NLL(theta, beta, x, recon_x=None)¶
- decode(theta, beta)¶
- forward(x, times)¶
- class SawETM(vocab_size, num_topics_list, device='cpu', embed_size=100, hidden_size=256, pretrained_WE=None)¶
Bases:
torch.nn.ModuleSawtooth Factorial Topic Embeddings Guided Gamma Belief Network. ICML 2021.
Zhibin Duan, Dongsheng Wang, Bo Chen, Chaojie Wang, Wenchao Chen, Yewen Li, Jie Ren, Mingyuan Zhou.
https://github.com/ZhibinDuan/SawETM
- device = 'cpu'¶
- gam_prior¶
- real_min¶
- theta_max¶
- wei_shape_min¶
- wei_shape_max¶
- num_topics_list¶
- num_layers¶
- alpha¶
- h_encoder¶
- q_theta¶
- log_max(x)¶
- reparameterize(shape, scale, sample_num=50)¶
Returns a sample from a Weibull distribution via reparameterization.
- kl_weibull_gamma(wei_shape, wei_scale, gam_shape, gam_scale)¶
Returns the Kullback-Leibler divergence between a Weibull distribution and a Gamma distribution.
- get_nll(x, x_reconstruct)¶
Returns the negative Poisson likelihood of observational count data.
- property bottom_word_embeddings¶
- property topic_embeddings_list¶
- get_phis()¶
Returns the factor loading matrix by utilizing sawtooth connection.
- get_beta()¶
- get_phi_list()¶
- get_theta(x)¶
- forward(x)¶
Forward pass: compute the kl loss and data likelihood.
- class HyperMiner(vocab_size, num_topics_list, device='cpu', manifold='PoincareBall', clip_r=None, curvature=-0.01, embed_size=50, hidden_size=300, pretrained_WE=None)¶
Bases:
topmost.models.hierarchical.SawETM.SawETM.SawETMHyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. NeurIPS 2022.
Yishi Xu, Dongsheng Wang, Bo Chen, Ruiying Lu, Zhibin Duan, Mingyuan Zhou.
https://github.com/NoviceStone/HyperMiner
- manifold¶
- clip_r = None¶
- feat_clip(x)¶
- property bottom_word_embeddings¶
- property topic_embeddings_list¶
- get_phi()¶
Returns the factor loading matrix by utilizing sawtooth connection.
- get_beta()¶
- get_phi_list()¶
- get_theta(x)¶
- forward(x)¶
Forward pass: compute the kl loss and data likelihood.
- class TraCo(vocab_size, num_topics_list=[10, 50, 200], en_units=300, dropout=0.0, embed_size=200, bias_topk=20, bias_p=5.0, beta_temp=0.1, weight_loss_TPD=20.0, sinkhorn_alpha=20.0, sinkhorn_max_iter=1000)¶
Bases:
torch.nn.ModuleOn the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling. AAAI 2024
Xiaobao Wu, Fengjun Pan, Thong Nguyen, Yichao Feng, Chaoqun Liu, Cong-Duy Nguyen, Anh Tuan Luu.
- num_topics_list = [10, 50, 200]¶
- weight_loss_TPD = 20.0¶
- beta_temp = 0.1¶
- num_layers¶
- bottom_word_embeddings¶
- topic_embeddings_list¶
- TPD¶
- CDDecoder¶
- encoder¶
- get_beta()¶
- get_phi_list()¶
- get_theta(input_bow)¶
- forward(input_bow)¶
- compute_loss_KL(mu, logvar, mu_prior=None)¶