============ Quick Start ============ Install TopMost ----------------- Install topmost with ``pip`` as .. code-block:: console $ pip install topmost ------------------------------------------- We try FASTopic_ to get the top words of discovered topics, ``topic_top_words`` and the topic distributions of documents, ``doc_topic_dist``. The preprocessing steps are configurable. See our documentations. .. code-block:: python from topmost import RawDataset, Preprocess, FASTopicTrainer from sklearn.datasets import fetch_20newsgroups docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] preprocess = Preprocess(vocab_size=10000) dataset = RawDataset(docs, preprocess, device="cuda") trainer = FASTopicTrainer(dataset, verbose=True) top_words, doc_topic_dist = trainer.train() new_docs = [ "This is a document about space, including words like space, satellite, launch, orbit.", "This is a document about Microsoft Windows, including words like windows, files, dos." ] new_theta = trainer.test(new_docs) print(new_theta.argmax(1)) .. _FASTopic: https://arxiv.org/pdf/2405.17978 ============ Usage ============ Download a preprocessed dataset ----------------------------------- .. code-block:: python import topmost topmost.download_dataset('20NG', cache_path='./datasets') Train a model ----------------------------------- .. code-block:: python device = "cuda" # or "cpu" # load a preprocessed dataset dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True) # create a model model = topmost.ProdLDA(dataset.vocab_size) model = model.to(device) # create a trainer trainer = topmost.BasicTrainer(model, dataset) # train the model top_words, train_theta = trainer.train() Evaluate ----------------------------------- .. code-block:: python from topmost import eva # topic diversity and coherence TD = eva._diversity(top_words) TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words) # get doc-topic distributions of testing samples test_theta = trainer.test(dataset.test_data) # clustering clustering_results = eva._clustering(test_theta, dataset.test_labels) # classification cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels) Test new documents ----------------------------------- .. code-block:: python import torch from topmost import Preprocess new_docs = [ "This is a new document about space, including words like space, satellite, launch, orbit.", "This is a new document about Microsoft Windows, including words like windows, files, dos." ] preprocess = Preprocess() new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab) new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())