Friday, July 8, 2016

Natural Language Processing (NLP) Notes

This is the working notes for the Natural Language Processing (NLP) that is part of the bioinformatics project.
  • Definition: Natural Language Processing (NLP) is the study of the computational treatment of natural (human) language, i.e., teaching computer how to understand and generate human language.
  • Research Resource: NLP draws on research in 
    • Linguistics 
    • Theoretical Computer Science
    • Mathematics
    • Statistics
    • Artificial Intelligence
    • Psychology
    • Database
  • Language and Communication (Speaker):
    • Intention- goals, shared knowledge and beliefs
    • Generation- tactical
    • Synthesis- text or speech
  • Language and Communication (Listener):
    • Perception
    • Interpretation- syntactic, semantic, pragmatic
    • Incorporation- internalization. understanding 
  • Basic NLP Pipeline: Language --(Understanding)--> Computer --(Generation)--> Language
  • Challenges: coming from the text
    • current events, background events, speculation, property, reference to previous sentence
    • genres of text: blogs, emails, press releases, chats, debates, etc.
    • special types of text and terminologies
    • incomplete sentences
    • fiction text, rare words,
    • extremely long and complex sentences
    • multiple possible interpretations, ambiguous sentences
    • lexical, structural, scope ambiguity
  • Textbook
  • Courses
    • JHU (Jason Eisner)
    • Cornell (Lillian Lee)
    • Stanford (Chris Manning)
    • U. Maryland (Hal Daume)
    • Berkeley (Dan Klein)
    • U. Texas (Ray Mooney)
    • Coursera (Manning/Jurafsky, survey)
    • Coursera (Michael Collins, advanced, 2013)
  • Related Fields
    • CL (Computational Linguistics)- more mathematical and formal treatment of linguistics, less confusing on applications
    • IR (Information Retrieval)- study of finding information and documents (text, speech or video).
    • SP (Speech Processing)-  deal with understanding and generation of spoken signals.
    • HLT ( Human Language Technology)- uses of the applied component of the NLP.
    • NLE (Natural Language Engineering)- synonym of HLT.  
    • ML (Machine Learning)- computational, statistical study of learning.
  • Research in NLP
    • Conference: ACL/NAACL, EMNLP, SIGIR, AAAI/IJCAI, Coling, HLT, EACL/NAACL, AMTA/MT Summit, ICSLP/Eurospeech
    • Journals: Computational Linguistics, TACL, Natural Language Engineering, Information Retrieval, Information Processing and Management, ACM Trans. on Information Systems, ACM TALIP, ACM TSLP
    • ACL Anthology
    • ACL Anthology Network (AAN)
  • Difficulties of NLP
    • Similar forms does not imply similar meanings (e.g., "Beverly Hills/Sills")
    • Computer is not good at metaphors (e.g., "Time flies like an arrow.")
    • Word order is important (e.g., "The box/pen is in the pen/box.") 
    • Not the same semantic relationship (e.g., Mary and Sue are mothers/sisters.")
    • Not the same order of magnitude due to semantic change (e.g., "Every American has a mother/president.")
    • Draw on common sense (e.g., "We gave monkeys bananas because they were hungry/over-ripe.")
    • Syntactic correctness does not guarantee semantic correctness (e.g., "Colorless green ideas sleep furiously.")
    • Ambiguous words due to multiple meanings (e.g., "ball"), part of speech (e.g., "fly"), different pronunciations (e.g., "address"), noun-noun phrases (e.g., "Science fiction writer"), etc.
  • Types of Ambiguity
    • Morphological- "Joe is quite impossible/important."
    • Phonetic- "Joe's finger got number."
    • Part of Speech- "Joe won the first round."
    • Syntactic- "Call Joe a taxi."
    • Pp Attachment- "Joe ate pizza with a for/with meatballs/with Sam/with pleasure."
    • Sense- "Joe took the bar exam."
    • Modality- "Joe may win the lottery."
    • Subjectivity- "Joe believes the stocks will rise."
    • Cc Attachment- "Joe likes ripe apples and pears."
    • Negation- "Joe likes his pizza with no cheese and tomatoes."
    • Referential- "Joe yelled at Mike. He had broken the bike." "Joe yelled at Mike.  He was angry at him."
    • Reflexive- "John brought him/himself a present."
    • Ellipsis and Parallelism- "Joe gave Mike a beer and James a glass of wine."
    • Metonymy- "Boston called and left a message for Joe."
  • Other Difficulties
    •  Non-standard- "+1-212-772-1220", "A360"
    •  Slang- "friend (verb)", "spam" 
    •  Novel words and usages- "yolo", "selfie"
    •  Inconsistencies- "Junior college" vs. "college junior"
    • Typos and grammatical errors
    • Parsing problems- "cup holder", "Federal Reserve Board Chairman"
    • Complex sentences
    • Counterfactual sentences- " if you were to do..."
    • Humor and sarcasm
    • Implicature/inference/world knowledge- "I was late because my car broken down." implies I have a car, I used the car to get to places, etc.
    • Semantics vs. pragmatics- "Do you know the time?"
    • Language is even hard for human (both L1 and L2).
  • Synonyms and Paraphrases
    • Synonyms- "climbed", "gained" and "rose"
    • Paraphrases- "its best close", "for its best showing" and "its highest level"
  •  Linguistics Knowledge
    • Constituents- "My cousin's neighbor's children eat pizza." and "Eat pizza"
    • Colocations- "strong beer" but *"powerful beer"
  • Get Linguistics Knowledge into the System
    • Manual rules
    • Automatically acquired from large text collections (corpora)