Infer.NET

Infer.NET user guide : Tutorials and examples : Latent Dirichlet Allocation

Inference and Prediction

This page shows to perform inference and prediction with your LDA model. Example test code for each of the usages is available in the project’s TestLDA project.

Inference

Both the LDAModel class and the LDAShared class derive from the ILDA interface which has an inference method and also an Engine property of type InferenceEngine. To learn the mixture distributions in the LDA model, call the Infer method on an LDA instance. The code can be seen in LDAModel.cs and LDAShared.cs. In this usage, we observe the words in the documents, and want to infer the distributions over topic and word mixtures. You will need to provide the following arguments to the Infer method.

Argument	Type	Description	Example
wordsInDoc	`Dictionary<int,int>[]`	Array of dictionaries giving word counts for each document. Each dictionary is keyed by word index and the corresponding value is the document count for the word.	See the LDA test program for an example.
alpha	`double`	Parameterises ThetaPrior.	150 / numTopics
beta	`double`	Parameterises PhiPrior.	0.1

On return, the posteriors are available as follows:

Result	Type	Description	Comments
Log evidence	`double`	The log of the model evidence. This is the return value of the method, and can be used to tune hyper-parameters, and to compare models	The log evidence values can vary quite widely across data sets. You may want to normalise this by dividing by the sum of document lengths as shown in the example program.
postTheta	`Dirichlet[]`	The per-document posterior marginal distributions over topic mixtures	This can be used along with postPhi to create a predictive distribution for words in each of these documents.
postPhi	`Dirichlet[]`	The per-topic posterior marginal distributions over word mixtures	This can be used with the LDATopicModelInference class to infer the topic distribution of a new document.

If you are running inference on an LDAModel instance, you can set the NumberOfIterations (and other parameters) on the Engine property of the class as normal. However, if you are running inference on an LDAShared instance, the NumberOfIterations is ignored. Instead there is a special property “IterationsPerPass” which is of type int[], and allows the number of passes, and the number of iterations per pass to be set. Refer to Page 3 for details on setting this, and on other differences between LDAModel and LDAShared.

Prediction

One way of testing your model is to hold out some data for each training document, and see how likely the held-out words are - you can do this using evaluation measures such as Perplexity which can also be used to compare different models. Prediction on an LDA model is achieved by running the Predict method on an instance of the LDAPredictionModel class. This returns the predictive distributions over words that you need to calculate Perplexity. In this usage we observe the mixture distributions over topics and words (as inferred by the Infer method on the LDAModel or LDAShared class) and infer the predictive distributions over words. Internally the calculation is done one document at a time; this has the effect of keeping the per-topic word mixture distributions fixed, and also provides scalability with the number of documents.

You will need to provide the following arguments to the Predict method:

Argument	TType	Description	Example
postTheta	`Dirichlet[]`	The posterior marginal distributions over topic mixtures for each document.	These are available from the call to the Infer method on `LDAModel` or `LDAShared` instances.
postPhi	`Dirichlet[]`	The posterior marginal distributions over word mixtures for each topic.	These are available from the call to the Infer method on `LDAModel` or `LDAShared` instances.

There is a single output from the Predict method:

RResult	Type	Description	Comments
Word distributions	`Discrete[]`	The predictive distributions over words for each document.	This can be used to calculate perplexity.

Topic Inference

The prediction method only operates on documents for which we know the posterior distributions over topics. Another mode of operation is to infer topics for a new document whilst fixing the learnt distributions for the per-topic word mixtures.. This is provided by the InferTopics method on an instance of the LDATopicModelInference class. In this usage we observe the per-topic word mixture distributions, and the words in the test documents, and infer the topic mixture distributions for each test document.

You will need to provide the following arguments to the InferTopics method.

Argument	Type	Description	Example
alpha	`double`	Parameterises ThetaPrior.	150 / numTopics
postPhi	`Dirichlet[]`	The posterior marginal distributions over word mixtures for each topic.	These are available from the call to the Infer method on LDAModel or LDAShared instances.
wordsInDoc	`int[][]`	Array of dictionaries giving word counts for each document. Each dictionary is keyed by word index and the corresponding value is the document count for the word.	See the LDA test program for an example.

There is a single output from the InferTopics method:

Result	Type	Description	Comments
Topic mixture distributions	`Dirichlet[]`	The distributions over topic mixtures for each document.	Relating latent topics to true topics may not be straightforward.

Page 1 | Page 2 | Page 3