Skip to main content

Infer.NET user guide : Tutorials and examples

Strings Tutorial 1: Hello, Strings!

This tutorial introduces the basics of performing inference over string variables in Infer.NET. It shows how to define a generative process that includes strings and how to reason about variables involved in that process.

You can run the code in this tutorial either using the Examples Browser or by opening the Tutorials solution in Visual Studio and editing RunMe.cs to execute HelloStrings.cs.

A generative model of text

The probabilistic models considered so far had variables of numeric types only, i.e. integers, floats and booleans. In principle, however, there is no reason to restrict model variables to these domains: as long as the inference engine is able to handle variables of a certain type, such variables should be allowed in the modelling code. The Infer.NET inference engine, in particular, also supports variables of collection types, such as strings or lists. This tutorial focuses on string variables.

One way to define a string random variable is to specify a prior distribution over it:

Variable<string> str1 = Variable.StringUniform().Named("str1");  
Variable<string> str2 = Variable.StringUniform().Named("str2");

See also: Creating variables

Variable.StringUniform creates a string random variable from a uniform distribution over all possible strings. So, both str1 and str2 can potentially take any value, and all values are equally likely. It should be noted that since the number of all possible strings is infinite, this distribution is improper. However, in many models improper priors don’t constitute a problem since the posterior distribution over variables with an improper prior can still be proper.

Another way to obtain a string random variable is to invoke an operation that produces a string, such as, for instance, concatenation:

Variable<string> text = (str1 + " " + str2).Named("text");

So, text is defined to be a concatenation of str1, a string containing a single space, and str2.

The model just defined can be thought of as the following generative process: take any two strings and concatenate them, putting a space in between. Another possible interpretation is a parsing process that accepts only strings that contain at least one space. If you are familiar with regular expressions, such a process can be represented by an expression of the form “.* .*”.

Uncertain segmentation

Now that you have a model, you are ready to observe some data and make an inference about its variables. In particular, you can observe the value of text and try to figure out what the values of str1 and str2 are. To observe the value of text, set its ObservedValue property:

text.ObservedValue = "Hello uncertain world";

Note that it’s not clear from the value of text what str1 and str2 are: the whitespace between str1 and str2 can correspond to either the first or the second space in the observed string. The segmentation of text into str1 and str2 is, therefore, subject to uncertainty. And this is precisely the conclusion that the Infer.NET inference engine will reach if you run it on this model.

var engine = new InferenceEngine();  

Console.WriteLine("str1: {0}", engine.Infer(str1));  
Console.WriteLine("str2: {0}", engine.Infer(str2));

Running this code will produce the following output:

str1: Hello▪ uncertain
str2: [uncertain world|w➥]

This reflects the underlying representation of the string distribution and is a compact way of writing:

str1: Hello[ uncertain]  
str2: [uncertain world|world]

This says that, given the observed text, the value of str1 used to produce it could have been “Hello” or “Hello uncertain”, while str2 could have been “world” or “uncertain world”. When a distribution over strings is printed to the console (or ToString is called on it), the result is a compact representation of the set of all strings that are possible under that distribution, also known as the support of the distribution. The distribution is sometimes printed using Unicode symbols so make sure that your console supports them.

StringAutomaton

Inspecting the support of the posterior is not, however, usually sufficient. For a more detailed analysis, say, seeing how likely a particular string is under the distribution, one needs to obtain a distribution object. For string random variables the corresponding object is always of type StringDistribution. The StringDistribution class is an implementation of a distribution over strings that represents uncertainty via a weighted finite state automaton internally. As with other distribution classes, it has methods for sampling and retrieving the probability of a given string. Thus, you can write the following code:

var distOfStr1 = engine.Infer<StringDistribution>(str1);  
foreach (var s in new[] { "Hello", "Hello uncertain", "Hello uncertain world" })  
{  
    Console.WriteLine("P(str1 = '{0}') = {1}", s, distOfStr1.GetProb(s));  
}

And, as expected, it will produce this output:

P(str1 = 'Hello') = 0.5  
P(str1 = 'Hello uncertain') = 0.5  
P(str1 = 'Hello uncertain world') = 0

The algorithms behind string inference are described in String inference.

You are now ready to move to the next tutorial, where you will see some other supported operations over strings and use them to define a more sophisticated model.