This is the test class code that I need help with:
@Test
public void testTask3example()
{
MarkovModel model = new MarkovModel(2,”aabcabaacaac”);
ModelMatcher match = new ModelMatcher(model,”aabbcaac”);
assertEquals(0,1); //TODO replace with test code
}
}
The details for this task and code are below:
Task 3 – Matching test strings to a model (ModelMatcher)
Link to task 1: https://www.chegg.com/homework-help/questions-and-answers/task-1-analysing-n-grams-sample-text-ngramanalyser-task-need-complete-ngramanalyser-class–q22251885
Link to task 2: https://www.chegg.com/homework-help/questions-and-answers/task-2-building-markov-model-text-sample-markovmodel-task-need-complete-markovmodel-class–q22255515
For this task, you will need to complete the ModelMatcher class, which determines which of two Markov models better matches a test string. Given two Markov models built from text samples taken from two different sources, we can use them to estimate which source it is more likely a test string was drawn from. Even using only zero-th order models constructed using English and Russian text, we should be able to tell, for instance, that the string
“Борщ безвкусен”
is more likely to be Russian than English.
We will computer a measure of fit of test strings against models, called the likelihood of a sequence under the model. The likelihood of a sequence under a kk-th order Markov model is calculated as follows:
For each symbol c in the sequence, compute the probability of observing c under the model, given its k-letter context p (assuming the sequence to “wrap around”, as described above), using the Laplace-smoothed estimate of probability we used for the MarkovModel class.
Compute the likelihood of the entire sequence as the product of the likelihoods of each character.
In mathematical notation:
Let ss be an input sequence of length nn, and let MM be a kk-th order Markov model. In order to calculate the likelihood of the sequence ss under the model MM, for each symbol cici in ss (where 1≤i≤n1≤i≤n), let pipi be the kk-length context of the symbol cici (assuming wraparound). The likelihood of the sequence ss under the model is
n∏i=1laplace(ci)∏i=1nlaplace(ci)
where laplacelaplace is the Laplace-smoothed probability of cici occurring (given its context) as described in the previous task.
The probability we obtain from this calculation may be very small – in fact, potentially so small as to be indistinguishable from zero when using Java’s built-in floating-point arithmetic. Therefore, we will calculate and express the likelihood using log probabilities, which do not suffer from this problem. (A weakness of log probabilities is that they cannot straightforwardly represent probabilities of zero, but our use of Laplace smoothing allows us to avoid this problem.) The product of two probabilities pp and qq can be calculated by adding their log probabilities, logplogp and logqlogq.
By way of example, suppose we have constructed a 2nd-order Markov model using the input string “aabcabaacaac”, as described in the example for Task 2. If we are then given a test string “aabbcaac”, we can compute its log likelihood as follows:
For each character in the test string, obtain its length-2 context (assuming wraparound). Note that the tri-gram “caa” occurs twice in the (wrapped) test string.
Context | Character | Frequency |
---|---|---|
“aa” | ‘b’ | 1 |
“ab” | ‘b’ | 1 |
“bb” | ‘c’ | 1 |
“bc” | ‘a’ | 1 |
“ca” | ‘a’ | 2 |
“aa” | ‘c’ | 1 |
“ac” | ‘a’ | 1 |
For each character in the test string, calculate its Laplace-smoothed probability of occurring (given its context), and then take the log of this.
Context | Character | Laplace estimate | log10log10 of laplace estimate |
---|---|---|---|
“ac” | ‘a’ | 2+12+32+12+3 | -0.2218 |
“aa” | ‘c’ | 2+13+32+13+3 | -0.3010 |
“bc” | ‘a’ | 1+11+31+11+3 | -0.3010 |
“aa” | ‘b’ | 1+13+31+13+3 | -0.4771 |
“bb” | ‘c’ | 0+10+30+10+3 | -0.4771 |
“ca” | ‘a'(2x) | 2+13+32+13+3 | -0.6021 |
“ab” | ‘b’ | 0+12+30+12+3 | -0.6990 |
Then
total log likelihood = sum of log probabilities = -3.0792, and
average log likelihood = sum of log probabilities = -0.3849
If we have two Markov models, and one test string, then we can select the best model as being the one that maximizes the likelihood of the string under the model – i.e. we choose the model with the higher average log likelihood. Notice that the example above lists ngrams from that most in agreement with the model (highest log likelihood) to the least likely. This helps us to explain why a particular test string matches a model or not.
You will need to write code in the ModelMatcher class that takes as input one Markov model and a “test string” for which we want to identify how well it matches the given model. Complete the ModelMatcher class as follows. You may wish to complete sub-task (c) before sub-task (b).
Fill in the @author and @version fields in the header comments with all authors’ names, student numbers and the date.
Complete the constructor, which calculates and sets the fields in the class. Getter methods for these fields are already complete.
Complete the totalLogLikelihood method, a private helper method which sums a map containing log likelihoods for strings, and the averageLogLikelihood helper method, which calculates the average.
Complete the toString method for the ModelMatcher class. This method should give a summary of the results. You may choose how to arrange the string, but it should include at least the name, averageLogLikelihood and a table of the loglikelihood probabilities as shown in the example above.
Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.
import java.util.HashMap;
import java.util.Collection;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Map;
import java.util.Set;
public class ModelMatcher
{
/** log likelihoods for a teststring under a given model */
private HashMap<String,Double> logLikelihoodMap;
/** summary statistic for this setting */
private double averageLogLikelihood;
/**
* Constructor to initialise the fields for the log likelihood map for
* a test string and a given Markov model and
* the average log likelihood summary statistic
* @param MarkovModel model a given Markov model object
* @param String teststring
*/
public ModelMatcher(MarkovModel model, String testString) {
int modelOrder = model.getK();
this.logLikelihoodMap = new HashMap<String, Double>();
double laplaceEstimate;
double logLikelihood;
String sequence;
NgramAnalyser stringNgram = new NgramAnalyser(modelOrder + 1,testString);
Set<String> distinctNgrams = stringNgram.getDistinctNgrams();
for (String ngram : distinctNgrams) {
laplaceEstimate = model.laplaceEstimate(ngram);
//Use change of base formula to find log(10) likelihood
logLikelihood = Math.log(laplaceEstimate) / Math.log(10);
this.logLikelihoodMap.put(ngram, logLikelihood);
}
this.averageLogLikelihood =
this.averageLogLikelihood(this.logLikelihoodMap, stringNgram.getNgramCount());
}
/** Helper method that calculates the average log likelihood statistic
* given a HashMap of strings and their Laplace probabilities
* and the total number of ngrams in the model.
*
* @param logs map of ngram strings and their log likelihood
* @param ngramCount int number of ngrams in the original test string
* @return average log likelihood: the total of loglikelihoods
* divided by the ngramCount
*/
private double averageLogLikelihood(HashMap<String,Double> logs, int ngramCount) {
double logSum = this.totalLogLikelihood(logs);
return (logSum / ngramCount);
}
/** Helper method to calculate the total log likelihood statistic
* given a HashMap of strings and their Laplace probabilities
* and the total number of ngrams in the model.
*
* @param logs map of ngram strings and their log likelihood
* @return total log likelihood: the sum of loglikelihoods in logs
*/
private double totalLogLikelihood(HashMap<String,Double> logs) {
double logSum = 0;
for (Map.Entry<String, Double> entry : logs.entrySet()) {
logSum += entry.getValue();
}
return (logSum);
}
/**
* @return the average log likelihood statistic
*/
public double getAverageLogLikelihood() {
return averageLogLikelihood;
}
/**
* @return the log likelihood value for a given ngram from the input string
*/
public double getLogLikelihood(String ngram) {
return (logLikelihoodMap.get(ngram));
}
/**
* Make a String summarising the log likelihood map and its statistics
* @return Header lines containing the testString, the averageLogLikelihood
* and a sorted table of ngrams and their loglikelihoods
* The likelihood table should be ordered from highest to lowest likelihood
*/
public String toString() {
String returnString = “”;
returnString = returnString + Double.toString(this.averageLogLikelihood);
returnString = returnString + NgramAnalyser.printHashMap(this.logLikelihoodMap);
return returnString;
}
}
ProjectTest.java
import static org.junit.Assert.*;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.util.Set;
public class ProjectTest
{
/**
* Default constructor for test class ProjectTest
*/
public ProjectTest() {
}
/**
* Sets up the test fixture.
*
* Called before every test case method.
*/
@Before
public void setUp() {
}
/**
* Tears down the test fixture.
*
* Called after every test case method.
*/
@After
public void tearDown() {
}
//TODO add new test cases from here include brief documentation
/**
* Test if the number of lines in a string output from Ngram.toString()
* is valid (i.e equal to the size of the alphabet of that Ngram)
* Also ensures that the sort, splice and constructor functions work
* as required to produce the required comparison
*/
@Test(timeout=1000)
public void testSensibleToStringSize() {
String[] stringsToTest = {“Hello my friend”,
“be”,
“Have a nice day you filthy animal”,
“asdfghjkl$$sdfghj%%”,
“2”,
“adadadadaaaaa”,
” “};
Integer[] ngramSizesToTest = {1, 2, 3, 4, 5};
NgramAnalyser analysis;
String analysisString;
int i = ngramSizesToTest[0];
String s = stringsToTest[5];
if (i > s.length()) {
try {
analysis = new NgramAnalyser(i, s);
} catch (IllegalArgumentException e) {
assertEquals(0, 0);
}
} else {
analysis = new NgramAnalyser(i, s);
analysisString = analysis.toString();
//Number of lines is equal to the number of n’s plus 1
int numberofLines = analysisString.length() –
analysisString.replace(“n”, “”).length() + 1;
assert(numberofLines >= analysis.getAlphabetSize());
}
}
/**
* Tests various aspects of the getDistinctNgrams function
* inlcuding set length, …
*/
@Test(timeout=1000)
public void testGetDistinctNgrams() {
String[] stringsToTest = {
“123!@#123!@#”,
“adadadadadadadad”,
“cadadcdaadcdbed”,
“aaaaaa”,
“HOWWEYVUFXBINEF”
};
String stringToTest = stringsToTest[0];
int ngramSize = 2;
NgramAnalyser analysis = new NgramAnalyser(ngramSize, stringToTest);
Set<String> distinctNgrams = analysis.getDistinctNgrams();
int distinctNgramCount = analysis.getDistinctNgramCount();
int totalNgramCount = analysis.getNgramCount();
//Test that there are fewer or equal distinct Ngrams than total Ngrams
assert(distinctNgramCount <= totalNgramCount);
//Test that the alphabet size is smaller than
//or equal to the number of distinct NGrams
assert(analysis.getAlphabetSize() <= distinctNgramCount);
}
@Test(timeout=1000)
public void testLaplaceExample() {
assertEquals(0,1); //TODO replace with test code
}
@Test(timeout=1000)
public void testSimpleExample() {
assertEquals(0,1); //TODO replace with test code
}
@Test
public void testTask3example()
{
MarkovModel model = new MarkovModel(2,”aabcabaacaac”);
ModelMatcher match = new ModelMatcher(model,”aabbcaac”);
assertEquals(0,1); //TODO replace with test code
}