Class WordNet

java.lang.Object
com.articulate.sigma.wordNet.WordNet
All Implemented Interfaces:
Serializable

public class WordNet extends Object implements Serializable
This program finds and displays SUMO terms that are related in meaning to the English expressions that are entered as input. Note that this program uses four WordNet data files, "NOUN.EXC", "VERB.EXC" etc, as well as four WordNet to SUMO mappings files called "WordNetMappings-nouns.txt", "WordNetMappings-verbs.txt" etc The main part of the program prompts the user for an English term and then returns associated SUMO concepts. The two primary public methods are initOnce() and page().
Author:
Ian Niles, Adam Pease
See Also:
  • Field Details

    • disable

      public static boolean disable
    • debug

      public static boolean debug
    • wn

      public static WordNet wn
    • baseDir

      public static String baseDir
    • baseDirFile

      public static File baseDirFile
    • initNeeded

      public static boolean initNeeded
    • regexPatterns

      public static Pattern[] regexPatterns
      This array contains all of the compiled Pattern objects that will be used by methods in this file.
    • nounSynsetHash

      public Map<String,Set<String>> nounSynsetHash
    • verbSynsetHash

      public Map<String,Set<String>> verbSynsetHash
    • adjectiveSynsetHash

      public Map<String,Set<String>> adjectiveSynsetHash
    • adverbSynsetHash

      public Map<String,Set<String>> adverbSynsetHash
    • ignoreCaseSynsetHash

      public Map<String,Set<String>> ignoreCaseSynsetHash
    • verbDocumentationHash

      public Map<String,String> verbDocumentationHash
    • adjectiveDocumentationHash

      public Map<String,String> adjectiveDocumentationHash
    • adverbDocumentationHash

      public Map<String,String> adverbDocumentationHash
    • nounDocumentationHash

      public Map<String,String> nounDocumentationHash
    • nounSUMOHash

      public Map<String,String> nounSUMOHash
    • verbSUMOHash

      public Map<String,String> verbSUMOHash
    • adjectiveSUMOHash

      public Map<String,String> adjectiveSUMOHash
    • adverbSUMOHash

      public Map<String,String> adverbSUMOHash
    • maxNounSynsetID

      public String maxNounSynsetID
    • maxVerbSynsetID

      public String maxVerbSynsetID
    • origMaxNounSynsetID

      public String origMaxNounSynsetID
    • origMaxVerbSynsetID

      public String origMaxVerbSynsetID
    • SUMOHash

      public Map<String,List<String>> SUMOHash
      Keys are SUMO terms, values are ArrayLists(s) of POS-prefixed 9-digit synset String(s) meaning that the part of speech code is prepended to the synset number.
    • synsetsToWords

      public Map<String,List<String>> synsetsToWords
      Keys are String POS-prefixed synsets. Values are ArrayList(s) of String(s) which are words. Note that the order of words in the file is preserved.
    • exceptionVerbHash

      public Map<String,String> exceptionVerbHash
    • exceptionVerbPastProgHash

      public Map<String,String> exceptionVerbPastProgHash
    • exceptionVerbPastHash

      public Map<String,String> exceptionVerbPastHash
    • exceptVerbProgHash

      public Map<String,String> exceptVerbProgHash
    • exceptionNounHash

      public Map<String,String> exceptionNounHash
      list of irregular plural forms where the key is the plural, singular is the value.
    • exceptionNounPluralHash

      public Map<String,String> exceptionNounPluralHash
    • relations

      public Map<String,List<com.articulate.sigma.utils.AVPair>> relations
      Keys are POS-prefixed synsets, values are ArrayList(s) of AVPair(s) in which the attribute is a pointer type according to http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 and the value is a POS-prefixed synset @see WordNetUtilities.convertWordNetPointer
    • wordCoFrequencies

      public Map<String,Map<String,Integer>> wordCoFrequencies
      a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet. The value is a HashMap of words and the number of times that word occurs in sentences with the word sense given in the key.
    • wordFrequencies

      protected Map<String,Set<com.articulate.sigma.utils.AVPair>> wordFrequencies
      a HashMap of HashMaps where the key is a word and the value is a HashMap of 9-digit POS-prefixed senses which is the value of the AVPair, and the number of times that sense occurs in the Brown corpus, which is the key of the AVPair
    • caseMap

      public Map<String,String> caseMap
    • senseFrequencies

      public Map<String,Integer> senseFrequencies
      a HashMap where the key is a 9-digit POS-prefixed sense and the value is a the number of times that sense occurs in the Brown corpus.
    • stopwords

      public List<String> stopwords
      English "stop words" such as "a", "at", "them", which have no or little inherent meaning when taken alone.
    • senseIndex

      public Map<String,String> senseIndex
      A HashMap where the keys are of the form word_POS_sensenum (alpha POS like "VB") and values are 8 digit WordNet synset byte offsets. Note that all words are from index.sense, which reduces all words to lower case
    • senseKeys

      public Map<String,String> senseKeys
      A HashMap where the keys are of the form word%POS:lex_filenum:lex_id (numeric POS) and values are 8 digit WordNet synset byte offsets. Note that all words are from index.sense, which reduces all words to lower case
    • reverseSenseIndex

      public Map<String,String> reverseSenseIndex
      A HashMap where the keys are 9 digit POS prefixed WordNet synset byte offsets, and the values are of the form word_POS_sensenum (alpha POS like "VB"). Note that all words are from index.sense, which reduces all words to lower case
    • verbFrames

      public Map<String,List<String>> verbFrames
      A HashMap where keys are 8 digit WordNet synset byte offsets or synsets appended with a dash and a specific word such as "12345678-foo" or in the case where the frame applies to the entire synset, it's just the synset number. Values are ArrayList(s) of String verb frame numbers.
    • wordsToSenseKeys

      public Map<String,List<String>> wordsToSenseKeys
      A HashMap with words as keys and ArrayList as values. The ArrayList contains word senses which are Strings of the form word_POS_num (alpha POS like "VB") signifying the word, part of speech and number of the sense in WordNet. Note that all words are from index.sense, which reduces all words to lower case
    • multiWords

      public MultiWords multiWords
    • NOUN

      public static final int NOUN
      See Also:
    • VERB

      public static final int VERB
      See Also:
    • ADJECTIVE

      public static final int ADJECTIVE
      See Also:
    • ADVERB

      public static final int ADVERB
      See Also:
    • ADJECTIVE_SATELLITE

      public static final int ADJECTIVE_SATELLITE
      See Also:
    • OMW

      public Map<String,Map<String,String>> OMW
      A HashMap with language name keys and HashMapinvalid input: '<'String,String> values. The interior HashMap has String keys which are PWN30 synsets with 8-digit synsets a dash and then a alphabetic part of speech character. Values are words in the target language.
    • regexPatternStrings

      public static final String[] regexPatternStrings
      This array contains all of the regular expression strings that will be compiled to Pattern objects for use in the methods in this file.
    • VerbFrames

      public static List<String> VerbFrames
  • Constructor Details

    • WordNet

      public WordNet()
  • Method Details

    • getMultiWords

      public MultiWords getMultiWords()
    • compileRegexPatterns

      public void compileRegexPatterns()
      This method compiles all of the regular expression pattern strings in regexPatternStrings and puts the resulting compiled Pattern objects in the Pattern[] regexPatterns.
    • getWnFile

      public File getWnFile(String key, String override)
      Returns the WordNet File object corresponding to key.
      Parameters:
      key - A descriptive literal String that maps to a regular expression pattern used to obtain a WordNet file.
      Returns:
      A File object
    • splitToArrayList

      public static List<String> splitToArrayList(String st)
      Return an ArrayList of the string split by spaces.
    • splitToArrayListSentence

      public static List<String> splitToArrayListSentence(String st)
      Return an ArrayList of the string split by periods.
    • getSUMOMapping

      public String getSUMOMapping(String synset)
      Get the SUMO mapping for a POS-prefixed synset
    • setMaxNounSynsetID

      protected void setMaxNounSynsetID(String synset)
    • setMaxVerbSynsetID

      protected void setMaxVerbSynsetID(String synset)
    • processNounLine

      protected boolean processNounLine(String line)
    • mergeWordCoFrequencies

      public void mergeWordCoFrequencies(Map<String,Map<String,Integer>> senses)
      Merge a new set of word co-occurrence statistics into the existing set.
    • writeWordCoFrequencies

      public static void writeWordCoFrequencies(String fname, Map<String,Map<String,Integer>> senses)
      Write a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet. The value is a HashMap of words and the number of times that word occurs in sentences with the word sense given in the key.
    • readWordCoFrequencies

      public void readWordCoFrequencies()
      Return a HashMap of HashMaps where the key is a word sense of the form word_POS_num signifying the word, part of speech and number of the sense in WordNet. The value is a HashMap of words and the number of times that word occurs in sentences with the word sense given in the key.
    • readStopWords

      public void readStopWords()
    • readSenseIndex

      public void readSenseIndex(String filename)
      Note that WordNet forces all these words to lowercase in the index.xxx files
    • readSenseCount

      public void readSenseCount()
      Read word sense frequencies into a HashMap of PriorityQueues containing AVPairs where the value is a word and the attribute (on which PriorityQueue is sorted) is an 8 digit String representation of an integer count.
    • addToWordFreq

      public void addToWordFreq(String word, com.articulate.sigma.utils.AVPair avp)
      Add an entry to the wordFrequencies list, checking whether it has a valid count and synset pair.
    • sumoSentenceDisplay

      public String sumoSentenceDisplay(String input, String context, String params)
      A routine which looks up a given list of words in the hashtables to find the relevant word definitions and SUMO mappings.
      Parameters:
      input - is the target sentence to be parsed. See WordSenseBody.jsp for usage.
      context - is the larger context of the sentence. Can mean more accurate results.
      params - is the set of html parameters
    • sumoSentimentDisplay

      public String sumoSentimentDisplay(String sentence)
      A routine that uses computeSentiment in DB.java to display a sentiment score for a single sentence as well as the individual scores of scored descriptors.
      Parameters:
      sentence - is the target sentence to be scored. See WordSenseBody.jsp for usage.
    • sumoFileDisplay

      public String sumoFileDisplay(String pathname, String counter, String params)
      A routine which takes a full pathname as input and returns a sentence by sentence display of sense and sentiment analysis
      Parameters:
      pathname -
      counter - is used to keep track of which sentence is being displayed
      params - is the set of html parameters
    • isFile

      public boolean isFile(String s)
      Returns:
      true if the input String is a file pathname. Determined by whether the string contains a forward or backward slash. This is only used in WordSense.jsp and will fail if a sentence that is not a file contains a forward or back slash.
    • isHyponymRecurse

      public boolean isHyponymRecurse(String synset, String hypo, List<String> visited)
      Returns:
      true if the first POS-prefixed synset is a hyponym of the second POS-prefixed synset. This is a recursive method.
    • isHyponym

      public boolean isHyponym(String synset, String hypo)
      Returns:
      true if the first POS-prefixed synset is a hyponym of the second POS-prefixed synset. This is a recursive method.
    • removeStopWords

      public String removeStopWords(String sentence)
      Remove stop words from a sentence.
    • removeStopWords

      public List<String> removeStopWords(List<String> sentence)
      Remove stop words from a sentence.
    • isStopWord

      public boolean isStopWord(String word)
      Check whether the word is a stop word
    • collectCountedWordSenses

      public Map<String,Integer> collectCountedWordSenses(String sentence)
      Collect all the synsets that represent the best guess at meanings for all the words in a sentence. Keep track of how many times each sense appears.
    • encoder

      public static void encoder(Object object)
    • decoder

      public static <T> T decoder()
    • serializedExists

      public static boolean serializedExists()
    • serializedOld

      public static boolean serializedOld()
      Check whether sources are newer than serialized version.
    • loadSerialized

      public static void loadSerialized()
      Loads the most recently saved serialized version.
    • serialize

      public static void serialize()
      save serialized version.
    • initOnce

      public static void initOnce()
      Read the WordNet files only on initialization of the class.
    • nounRootForm

      public String nounRootForm(String mixedCase, String input)
      Return the root form of the noun, or null if it's not in the lexicon.
    • verbRootForm

      public String verbRootForm(String mixedCase, String input)
      Return the present tense singular form of the verb, or null if it's not in the lexicon.
    • prependPOS

      public Set<String> prependPOS(Set<String> synsets, String POS)
      Prepend a POS number to a set of 8 digit synsets
      Returns:
      an ArrayList of 9 digit synset Strings
    • getSynsetsFromWord

      public Set<String> getSynsetsFromWord(String word)
      Get all the synsets for a given word. Print an error if this routine gives a result and getSenseKeysFromWord() doesn't
      Returns:
      an ArrayList of 9 digit synset Strings
    • getSenseKeysFromWord

      public Map<String,List<String>> getSenseKeysFromWord(String word)
      Get all the synsets for a given word.
      Returns:
      a TreeMap of sense keys in the form of word_POS_num and values that are ArrayLists of synset Strings
    • getWordsFromTerm

      public Map<String,String> getWordsFromTerm(String SUMOterm)
      Get the words and synsets corresponding to a SUMO term. The return is a Map of words with their corresponding synset number.
    • getWordsFromSynset

      public List<String> getWordsFromSynset(String synset)
    • containsWord

      public boolean containsWord(String word, int pos)
      Does WordNet contain the given word.
    • containsWord

      public boolean containsWord(String word)
      Does WordNet contain the given word.
    • containsWordIgnoreCase

      public boolean containsWordIgnoreCase(String word)
      Does WordNet contain the given word, ignoring case.
    • page

      public String page(String inp, int pos, String kbname, String synset, String params)
      This is the regular point of entry for this class. It takes the word the user is searching for, and the part of speech index, does the search, and returns the string with HTML formatting codes to present to the user. The part of speech codes must be the same as in the menu options in WordNet.jsp and Browse.jsp
      Parameters:
      inp - The string the user is searching for.
      pos - The part of speech of the word 1=noun, 2=verb, 3=adjective, 4=adverb
      Returns:
      A string contained the HTML formatted search result.
    • getDocumentation

      public String getDocumentation(String synset)
      Parameters:
      synset - is a synset with POS-prefix
    • displaySynset

      public String displaySynset(String sumokbname, String synset, String params)
      Parameters:
      synset - is a synset with POS-prefix
    • displayByKey

      public String displayByKey(String sumokbname, String key, String params)
      Parameters:
      key - is a WordNet sense key
      Returns:
      9-digit POS-prefix and synset number
    • writeXML

      public void writeXML()
    • getTransitivity

      public String getTransitivity(String synset, String word)
      Frame transitivity intransitive - 1,2,3,4,7,23,35 transitive - everything else ditransitive - 15,16,17,18,19
    • writeProlog

      public void writeProlog(KB kb)
    • senseKeyPOS

      public String senseKeyPOS(String senseKey)
    • writeWordNetS

      public void writeWordNetS()
      Write WordNet data to a prolog file with a single kind of clause in the following format: s(Synset_ID, Word_No_in_the_Synset, Word, SS_Type, Synset_Rank_By_the_Word,Tag_Count)
    • writeWordNetHyp

      public void writeWordNetHyp()
    • processPrologString

      public String processPrologString(String doc)
      Double any single quotes that appear.
    • writeWordNetG

      public void writeWordNetG()
    • writeWordNetProlog

      public void writeWordNetProlog() throws IOException
      Throws:
      IOException
    • generateSynsetID

      public String generateSynsetID(String l)
      Generate a new 8 digit synset ID that doesn't have an existing hash
    • generateNounSynsetID

      public String generateNounSynsetID()
      Generate a new eight digit noun synset ID that doesn't have an existing hash
    • generateVerbSynsetID

      public String generateVerbSynsetID()
      Generate a new eight digit verb synset ID that doesn't have an existing hash
    • nounSynsetFromTermFormat

      public String nounSynsetFromTermFormat(String tf, String SUMOterm, KB kb)
      Generate a new noun synset from a termFormat
    • verbSynsetFromTermFormat

      public String verbSynsetFromTermFormat(String tf, String SUMOterm, KB kb)
      Generate a new verb synset from a termFormat
    • synsetFromTermFormat

      public void synsetFromTermFormat(Formula form, String tf, String SUMOterm, KB kb)
      Generate a new synset from a termFormat statement
      Parameters:
      form - is the entire termFormat statement
      tf - is the lexical item (word). note that in the case of a multi-word lexical item it should already have had spaces replaced by underscores
      SUMOterm - is the SUMO term that the lexical item is mapped to
    • termFormatsToSynsets

      public void termFormatsToSynsets(KB kb)
      Generate a new synset from a termFormat
    • testWordFreq

      public static void testWordFreq()
      A method used only for testing. It should not be called during normal operation.
    • testProcessPointers

      public static void testProcessPointers()
      A method used only for testing. It should not be called during normal operation.
    • checkWordsToSenses

      public static void checkWordsToSenses()
    • getEntailments

      public static void getEntailments()
    • showHelp

      public static void showHelp()
    • main

      public static void main(String[] args)
      A main method, used only for testing. It should not be called during normal operation.