Package com.articulate.sigma.wordNet
Class WordNet
java.lang.Object
com.articulate.sigma.wordNet.WordNet
- All Implemented Interfaces:
Serializable
This program finds and displays SUMO terms that are related in meaning to the English
expressions that are entered as input. Note that this program uses four WordNet data
files, "NOUN.EXC", "VERB.EXC" etc, as well as four WordNet to SUMO
mappings files called "WordNetMappings-nouns.txt", "WordNetMappings-verbs.txt" etc
The main part of the program prompts the user for an English term and then
returns associated SUMO concepts. The two primary public methods are initOnce() and page().
- Author:
- Ian Niles, Adam Pease
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intstatic final intstatic Stringstatic Filestatic booleanstatic booleanlist of irregular plural forms where the key is the plural, singular is the value.static booleanstatic final intA HashMap with language name keys and HashMapinvalid input: '<'String,String> values.static Pattern[]This array contains all of the compiled Pattern objects that will be used by methods in this file.static final String[]This array contains all of the regular expression strings that will be compiled to Pattern objects for use in the methods in this file.Keys are POS-prefixed synsets, values are ArrayList(s) of AVPair(s) in which the attribute is a pointer type according to http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 and the value is a POS-prefixed synset @see WordNetUtilities.convertWordNetPointerA HashMap where the keys are 9 digit POS prefixed WordNet synset byte offsets, and the values are of the form word_POS_sensenum (alpha POS like "VB").a HashMap where the key is a 9-digit POS-prefixed sense and the value is a the number of times that sense occurs in the Brown corpus.A HashMap where the keys are of the form word_POS_sensenum (alpha POS like "VB") and values are 8 digit WordNet synset byte offsets.A HashMap where the keys are of the form word%POS:lex_filenum:lex_id (numeric POS) and values are 8 digit WordNet synset byte offsets.English "stop words" such as "a", "at", "them", which have no or little inherent meaning when taken alone.Keys are SUMO terms, values are ArrayLists(s) of POS-prefixed 9-digit synset String(s) meaning that the part of speech code is prepended to the synset number.Keys are String POS-prefixed synsets.static final intA HashMap where keys are 8 digit WordNet synset byte offsets or synsets appended with a dash and a specific word such as "12345678-foo" or in the case where the frame applies to the entire synset, it's just the synset number.static WordNeta HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet.a HashMap of HashMaps where the key is a word and the value is a HashMap of 9-digit POS-prefixed senses which is the value of the AVPair, and the number of times that sense occurs in the Brown corpus, which is the key of the AVPairA HashMap with words as keys and ArrayList as values. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidaddToWordFreq(String word, com.articulate.sigma.utils.AVPair avp) Add an entry to the wordFrequencies list, checking whether it has a valid count and synset pair.static voidcollectCountedWordSenses(String sentence) Collect all the synsets that represent the best guess at meanings for all the words in a sentence.voidThis method compiles all of the regular expression pattern strings in regexPatternStrings and puts the resulting compiled Pattern objects in the Pattern[] regexPatterns.booleancontainsWord(String word) Does WordNet contain the given word.booleancontainsWord(String word, int pos) Does WordNet contain the given word.booleancontainsWordIgnoreCase(String word) Does WordNet contain the given word, ignoring case.static <T> Tdecoder()displayByKey(String sumokbname, String key, String params) displaySynset(String sumokbname, String synset, String params) static voidGenerate a new eight digit noun synset ID that doesn't have an existing hashGenerate a new 8 digit synset ID that doesn't have an existing hashGenerate a new eight digit verb synset ID that doesn't have an existing hashgetDocumentation(String synset) static voidgetSenseKeysFromWord(String word) Get all the synsets for a given word.getSUMOMapping(String synset) Get the SUMO mapping for a POS-prefixed synsetgetSynsetsFromWord(String word) Get all the synsets for a given word.getTransitivity(String synset, String word) Frame transitivity intransitive - 1,2,3,4,7,23,35 transitive - everything else ditransitive - 15,16,17,18,19Returns the WordNet File object corresponding to key.getWordsFromSynset(String synset) getWordsFromTerm(String SUMOterm) Get the words and synsets corresponding to a SUMO term.static voidinitOnce()Read the WordNet files only on initialization of the class.booleanbooleanbooleanisHyponymRecurse(String synset, String hypo, List<String> visited) booleanisStopWord(String word) Check whether the word is a stop wordstatic voidLoads the most recently saved serialized version.static voidA main method, used only for testing.voidMerge a new set of word co-occurrence statistics into the existing set.nounRootForm(String mixedCase, String input) Return the root form of the noun, or null if it's not in the lexicon.nounSynsetFromTermFormat(String tf, String SUMOterm, KB kb) Generate a new noun synset from a termFormatThis is the regular point of entry for this class.prependPOS(Set<String> synsets, String POS) Prepend a POS number to a set of 8 digit synsetsprotected booleanprocessNounLine(String line) Double any single quotes that appear.voidRead word sense frequencies into a HashMap of PriorityQueues containing AVPairs where the value is a word and the attribute (on which PriorityQueue is sorted) is an 8 digit String representation of an integer count.voidreadSenseIndex(String filename) Note that WordNet forces all these words to lowercase in the index.xxx filesvoidvoidReturn a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet.removeStopWords(String sentence) Remove stop words from a sentence.removeStopWords(List<String> sentence) Remove stop words from a sentence.senseKeyPOS(String senseKey) static voidsave serialized version.static booleanstatic booleanCheck whether sources are newer than serialized version.protected voidsetMaxNounSynsetID(String synset) protected voidsetMaxVerbSynsetID(String synset) static voidshowHelp()Return an ArrayList of the string split by spaces.Return an ArrayList of the string split by periods.sumoFileDisplay(String pathname, String counter, String params) A routine which takes a full pathname as input and returns a sentence by sentence display of sense and sentiment analysissumoSentenceDisplay(String input, String context, String params) A routine which looks up a given list of words in the hashtables to find the relevant word definitions and SUMO mappings.sumoSentimentDisplay(String sentence) A routine that uses computeSentiment in DB.java to display a sentiment score for a single sentence as well as the individual scores of scored descriptors.voidsynsetFromTermFormat(Formula form, String tf, String SUMOterm, KB kb) Generate a new synset from a termFormat statementvoidGenerate a new synset from a termFormatstatic voidA method used only for testing.static voidA method used only for testing.verbRootForm(String mixedCase, String input) Return the present tense singular form of the verb, or null if it's not in the lexicon.verbSynsetFromTermFormat(String tf, String SUMOterm, KB kb) Generate a new verb synset from a termFormatvoidwriteProlog(KB kb) static voidWrite a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet.voidvoidvoidvoidWrite WordNet data to a prolog file with a single kind of clause in the following format: s(Synset_ID, Word_No_in_the_Synset, Word, SS_Type, Synset_Rank_By_the_Word,Tag_Count)voidwriteXML()
-
Field Details
-
disable
public static boolean disable -
debug
public static boolean debug -
wn
-
baseDir
-
baseDirFile
-
initNeeded
public static boolean initNeeded -
regexPatterns
This array contains all of the compiled Pattern objects that will be used by methods in this file. -
nounSynsetHash
-
verbSynsetHash
-
adjectiveSynsetHash
-
adverbSynsetHash
-
ignoreCaseSynsetHash
-
verbDocumentationHash
-
adjectiveDocumentationHash
-
adverbDocumentationHash
-
nounDocumentationHash
-
nounSUMOHash
-
verbSUMOHash
-
adjectiveSUMOHash
-
adverbSUMOHash
-
maxNounSynsetID
-
maxVerbSynsetID
-
origMaxNounSynsetID
-
origMaxVerbSynsetID
-
SUMOHash
Keys are SUMO terms, values are ArrayLists(s) of POS-prefixed 9-digit synset String(s) meaning that the part of speech code is prepended to the synset number. -
synsetsToWords
Keys are String POS-prefixed synsets. Values are ArrayList(s) of String(s) which are words. Note that the order of words in the file is preserved. -
exceptionVerbHash
-
exceptionVerbPastProgHash
-
exceptionVerbPastHash
-
exceptVerbProgHash
-
exceptionNounHash
list of irregular plural forms where the key is the plural, singular is the value. -
exceptionNounPluralHash
-
relations
Keys are POS-prefixed synsets, values are ArrayList(s) of AVPair(s) in which the attribute is a pointer type according to http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 and the value is a POS-prefixed synset @see WordNetUtilities.convertWordNetPointer -
wordCoFrequencies
a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet. The value is a HashMap of words and the number of times that word occurs in sentences with the word sense given in the key. -
wordFrequencies
a HashMap of HashMaps where the key is a word and the value is a HashMap of 9-digit POS-prefixed senses which is the value of the AVPair, and the number of times that sense occurs in the Brown corpus, which is the key of the AVPair -
caseMap
-
senseFrequencies
a HashMap where the key is a 9-digit POS-prefixed sense and the value is a the number of times that sense occurs in the Brown corpus. -
stopwords
English "stop words" such as "a", "at", "them", which have no or little inherent meaning when taken alone. -
senseIndex
A HashMap where the keys are of the form word_POS_sensenum (alpha POS like "VB") and values are 8 digit WordNet synset byte offsets. Note that all words are from index.sense, which reduces all words to lower case -
senseKeys
A HashMap where the keys are of the form word%POS:lex_filenum:lex_id (numeric POS) and values are 8 digit WordNet synset byte offsets. Note that all words are from index.sense, which reduces all words to lower case -
reverseSenseIndex
A HashMap where the keys are 9 digit POS prefixed WordNet synset byte offsets, and the values are of the form word_POS_sensenum (alpha POS like "VB"). Note that all words are from index.sense, which reduces all words to lower case -
verbFrames
A HashMap where keys are 8 digit WordNet synset byte offsets or synsets appended with a dash and a specific word such as "12345678-foo" or in the case where the frame applies to the entire synset, it's just the synset number. Values are ArrayList(s) of String verb frame numbers. -
wordsToSenseKeys
A HashMap with words as keys and ArrayList as values. The ArrayList contains word senses which are Strings of the form word_POS_num (alpha POS like "VB") signifying the word, part of speech and number of the sense in WordNet. Note that all words are from index.sense, which reduces all words to lower case -
multiWords
-
NOUN
public static final int NOUN- See Also:
-
VERB
public static final int VERB- See Also:
-
ADJECTIVE
public static final int ADJECTIVE- See Also:
-
ADVERB
public static final int ADVERB- See Also:
-
ADJECTIVE_SATELLITE
public static final int ADJECTIVE_SATELLITE- See Also:
-
OMW
A HashMap with language name keys and HashMapinvalid input: '<'String,String> values. The interior HashMap has String keys which are PWN30 synsets with 8-digit synsets a dash and then a alphabetic part of speech character. Values are words in the target language. -
regexPatternStrings
This array contains all of the regular expression strings that will be compiled to Pattern objects for use in the methods in this file. -
VerbFrames
-
-
Constructor Details
-
WordNet
public WordNet()
-
-
Method Details
-
getMultiWords
-
compileRegexPatterns
public void compileRegexPatterns()This method compiles all of the regular expression pattern strings in regexPatternStrings and puts the resulting compiled Pattern objects in the Pattern[] regexPatterns. -
getWnFile
Returns the WordNet File object corresponding to key.- Parameters:
key- A descriptive literal String that maps to a regular expression pattern used to obtain a WordNet file.- Returns:
- A File object
-
splitToArrayList
Return an ArrayList of the string split by spaces. -
splitToArrayListSentence
Return an ArrayList of the string split by periods. -
getSUMOMapping
Get the SUMO mapping for a POS-prefixed synset -
setMaxNounSynsetID
-
setMaxVerbSynsetID
-
processNounLine
-
mergeWordCoFrequencies
Merge a new set of word co-occurrence statistics into the existing set. -
writeWordCoFrequencies
Write a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet. The value is a HashMap of words and the number of times that word occurs in sentences with the word sense given in the key. -
readWordCoFrequencies
public void readWordCoFrequencies()Return a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet. The value is a HashMap of words and the number of times that word occurs in sentences with the word sense given in the key. -
readStopWords
public void readStopWords() -
readSenseIndex
Note that WordNet forces all these words to lowercase in the index.xxx files -
readSenseCount
public void readSenseCount()Read word sense frequencies into a HashMap of PriorityQueues containing AVPairs where the value is a word and the attribute (on which PriorityQueue is sorted) is an 8 digit String representation of an integer count. -
addToWordFreq
Add an entry to the wordFrequencies list, checking whether it has a valid count and synset pair. -
sumoSentenceDisplay
A routine which looks up a given list of words in the hashtables to find the relevant word definitions and SUMO mappings.- Parameters:
input- is the target sentence to be parsed. See WordSenseBody.jsp for usage.context- is the larger context of the sentence. Can mean more accurate results.params- is the set of html parameters
-
sumoSentimentDisplay
A routine that uses computeSentiment in DB.java to display a sentiment score for a single sentence as well as the individual scores of scored descriptors.- Parameters:
sentence- is the target sentence to be scored. See WordSenseBody.jsp for usage.
-
sumoFileDisplay
A routine which takes a full pathname as input and returns a sentence by sentence display of sense and sentiment analysis- Parameters:
pathname-counter- is used to keep track of which sentence is being displayedparams- is the set of html parameters
-
isFile
- Returns:
- true if the input String is a file pathname. Determined by whether the string contains a forward or backward slash. This is only used in WordSense.jsp and will fail if a sentence that is not a file contains a forward or back slash.
-
isHyponymRecurse
- Returns:
- true if the first POS-prefixed synset is a hyponym of the second POS-prefixed synset. This is a recursive method.
-
isHyponym
- Returns:
- true if the first POS-prefixed synset is a hyponym of the second POS-prefixed synset. This is a recursive method.
-
removeStopWords
Remove stop words from a sentence. -
removeStopWords
Remove stop words from a sentence. -
isStopWord
Check whether the word is a stop word -
collectCountedWordSenses
Collect all the synsets that represent the best guess at meanings for all the words in a sentence. Keep track of how many times each sense appears. -
encoder
-
decoder
public static <T> T decoder() -
serializedExists
public static boolean serializedExists() -
serializedOld
public static boolean serializedOld()Check whether sources are newer than serialized version. -
loadSerialized
public static void loadSerialized()Loads the most recently saved serialized version. -
serialize
public static void serialize()save serialized version. -
initOnce
public static void initOnce()Read the WordNet files only on initialization of the class. -
nounRootForm
Return the root form of the noun, or null if it's not in the lexicon. -
verbRootForm
Return the present tense singular form of the verb, or null if it's not in the lexicon. -
prependPOS
Prepend a POS number to a set of 8 digit synsets- Returns:
- an ArrayList of 9 digit synset Strings
-
getSynsetsFromWord
Get all the synsets for a given word. Print an error if this routine gives a result and getSenseKeysFromWord() doesn't- Returns:
- an ArrayList of 9 digit synset Strings
-
getSenseKeysFromWord
Get all the synsets for a given word.- Returns:
- a TreeMap of sense keys in the form of word_POS_num and values that are ArrayLists of synset Strings
-
getWordsFromTerm
Get the words and synsets corresponding to a SUMO term. The return is a Map of words with their corresponding synset number. -
getWordsFromSynset
-
containsWord
Does WordNet contain the given word. -
containsWord
Does WordNet contain the given word. -
containsWordIgnoreCase
Does WordNet contain the given word, ignoring case. -
page
This is the regular point of entry for this class. It takes the word the user is searching for, and the part of speech index, does the search, and returns the string with HTML formatting codes to present to the user. The part of speech codes must be the same as in the menu options in WordNet.jsp and Browse.jsp- Parameters:
inp- The string the user is searching for.pos- The part of speech of the word 1=noun, 2=verb, 3=adjective, 4=adverb- Returns:
- A string contained the HTML formatted search result.
-
getDocumentation
- Parameters:
synset- is a synset with POS-prefix
-
displaySynset
- Parameters:
synset- is a synset with POS-prefix
-
displayByKey
- Parameters:
key- is a WordNet sense key- Returns:
- 9-digit POS-prefix and synset number
-
writeXML
public void writeXML() -
getTransitivity
Frame transitivity intransitive - 1,2,3,4,7,23,35 transitive - everything else ditransitive - 15,16,17,18,19 -
writeProlog
-
senseKeyPOS
-
writeWordNetS
public void writeWordNetS()Write WordNet data to a prolog file with a single kind of clause in the following format: s(Synset_ID, Word_No_in_the_Synset, Word, SS_Type, Synset_Rank_By_the_Word,Tag_Count) -
writeWordNetHyp
public void writeWordNetHyp() -
processPrologString
Double any single quotes that appear. -
writeWordNetG
public void writeWordNetG() -
writeWordNetProlog
- Throws:
IOException
-
generateSynsetID
Generate a new 8 digit synset ID that doesn't have an existing hash -
generateNounSynsetID
Generate a new eight digit noun synset ID that doesn't have an existing hash -
generateVerbSynsetID
Generate a new eight digit verb synset ID that doesn't have an existing hash -
nounSynsetFromTermFormat
Generate a new noun synset from a termFormat -
verbSynsetFromTermFormat
Generate a new verb synset from a termFormat -
synsetFromTermFormat
Generate a new synset from a termFormat statement- Parameters:
form- is the entire termFormat statementtf- is the lexical item (word). note that in the case of a multi-word lexical item it should already have had spaces replaced by underscoresSUMOterm- is the SUMO term that the lexical item is mapped to
-
termFormatsToSynsets
Generate a new synset from a termFormat -
testWordFreq
public static void testWordFreq()A method used only for testing. It should not be called during normal operation. -
testProcessPointers
public static void testProcessPointers()A method used only for testing. It should not be called during normal operation. -
checkWordsToSenses
public static void checkWordsToSenses() -
getEntailments
public static void getEntailments() -
showHelp
public static void showHelp() -
main
A main method, used only for testing. It should not be called during normal operation.
-