Named entities

Setoshi Sekine at NYU has created an Extended Named Entity (ENE) hierarchy which organizes approximately 200 different types of entities.  Prototypes and both positive and negative examples are provided along with “problematic points” which indicate ambiguities and other difficulties. 

Sekine’s set of entity types is neither complete nor completely correct for many uses; He developed it for a question-answering task based on a corpus of news articles.  However, it is interesting and directionally correct in its description of named entity phenomena.

I have converted Sekine’s ENE data to LexiLink format and have self-aligned entities after normalizing the problematic points.  By loading the data in this way, the type of problem leading to an ambiguity and the types of entities that may be confused are readily inspected.  Below is a screenshot of ENE loaded into LexiLink focused on “Company”.  Note the expansion of relationships in the right-hand pane.  (I will replace the image below with the Headspace Sprockets version shortly).

 

 

Named entity identification is performed using a variety of techniques.  The Headspace Sprockets approach is consistent with SBCG and views different classes as semi-productive constructions.

 

Proper Name Variations

Variations in the form of proper names for the same entity are exhibited in the same document (usually in a more abbreviated form) and of course in different documents.  The problem is to determine whether two proper names refer to the same entity.  This is complicated because often the same proper name may be used to refer to different entities (a form of the sense disambiguation problem).

Proper name variations are based on combinatorial regularities which can be partly enumerated by considering features that may be expressed.  Other variations (such as misspellings) are best evaluated using an appropriate distance metric.

The feature space for proper name variations differs for each type of named entity.  For people, the feature space includes the following (in roughly, most expressive to least expressive):

  • Honorific Equal: active if both tokens are honorifics and identical. An honorific is a title such as “Mr.”, “Mrs.”, “President”, “Senator” or “Professor”.
  • Honorific Equivalence: active if both tokens are honorifics, not identical, but equivalent (“Professor”, “Prof.”).
  • Honorific Mismatch : active for different honorifics.
  • Equality: active if both tokens are identical.
  • Case-Insensitive Equal: active if the tokens are case-insensitive equal.
  • Nickname: active if tokens have a “nickname” relation (“Thomas” and “Tom”).
  • Prefix Equality: active if the prefixes of both tokens are equal. The prefix is defined as substring starting at the beginning of a token and going until the first vowel (or second if the first letter is a vowel).
  • Substring: active if one of the tokens is a substring of the other.
  • Abbreviation: active if one of the tokens is an abbreviation of the other; ie “Corporation” and “Corp.”.
  • Prefix Edit Distance: active if the prefixes of both tokens have an edit-distance of 1.
  • Edit Distance: active if the tokens have an edit-distance of 1.
  • Initial: active if one of the tokens is an initial of another; i.e., “Paul” and “P.”.
  • Symbol Map: active if one token is a symbolic representative of the other (“and” and “&”).

(Adapted from Li, et al. Semantic Integration in Text: From Ambiguous Names to Identifiable Entities, AI Magazine: Special Issue on Semantic Integration, 2005).

Suitably tokenized, a similarity metric can be established by comparing the token strings using the prioritized feature space. 

 

Proper Name Regularities

Below is the beginning of a discussion of Proper Name regularities taken from a ESL handout.  Refer to CGEL for more.

Rule #1: If the word of is part of the name, you need to use the.

For example, we say:

the University of West Georgia, but we say

Georgia State University.

In this example, the preposition “of” helps specify which university we are talking about (of British Columbia).

Rule #2: Place names that are plural usually use the.

For example, we say:

the Philippines, but we say

Mexico.

We also say:

the Rocky Mountains, but we say:

Whistler Mountain.

Rule #3: When a place name includes geographical words like ocean, sea, gulf, peninsula, river and desert, we use the. However, place names with some other geographical words like lake, mountain, bay, hill, island and park do not use an article if they are singular.

For example:

 

Use the.

No article

The Pacific Ocean

Cultus Lake

The Caspian Sea

Grouse Mountain

The Persian Gulf

English Bay

The Sinai Peninsula

Beacon Hill

The Fraser River

Vancouver Island

The Gobi Desert

Stanley Park

 

Rule #4: When a place name is the name of a geographical region, we use the.

For example, we say:

the Middle East

the Prairies

the North

Rule #5: Names of organizations often need the.

For example, we say:

the World Health Organization

the Supreme Court

the Vancouver Art Gallery

the New Westminster Public Library

the Coquitlam Chamber of Commerce

the National Hockey League

the Conservative Party

 

Context of proper names

•{TITLE}{PERSON}

Ex: “U.S. President George Bush”, “Mr. Frank Leonard”

•{PERSON}, the{TITLE}of{ORGANIZATION}

Ex: “Fred Martin, the CEO of XYZ Corp.”

•{PERSON}joined{COMPANY}

Ex: “Mary Smith joined Microsoft.”

• headquarters in{LOCATION}Ex: “headquarters in London”

•{LOCATION},{LOCATION}

Ex: “Salt Lake City, Utah”

 

Candidate Proper Names

Ontological concepts are usually proper names that are generated in a pattern that may identified using the shallow techniques of tagging and regular expression recognition.  (Of course, a lexicon (including domain specific lexicons) may include other parts of speech, particularly non-auxiliary verbs. Here our discussion concerns NP ontological concepts.) 

Candidate noun phrases that may be concept identifiers (names or synonyms)  may be taken as having a structure that was proposed by J. Justeson and S. Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text,” Natural Language Engineering 1, 9-27 (1995).

Noun = NN | NP | NNS | NPS

AdjectivalModifier = Adj. (CC. Adj)*

Candidate = (DT . (VBG | VBN)+) . (Noun | AdjectivalModifier)* . Noun

Where DT denotes a determiner; VBG and VBN denote gerund-participle and past participle forms of verbs, respectively; Adj denotes an adjective; CC denotes a conjunction; NN and NNS denote singular and plural nouns; NP and NPS denote singular and plural proper nouns.

An alternative formulation is:

((Adj | Noun)+ | (((Adj | Noun)* (Noun Prep)? ) (Adj | Noun)* )) Noun

(Adapted from Kozakov et al., “Glossary extraction and utilization in the information search and delivery system for IBM Technical Support,” IBM Systems Journal, Vol 43, No 3 (2004). )

Filtering is then applied to candidates to discard seemingly irrelevant terms and, particularly, non-domain modifiers from noun phrases.  Irrelevant terms may include: long (>6) word collocations, proper nouns, and strings that denote special identifiers such as compound names, addresses and URLs.

An example of a non-domain modifier is “new” in the candidate “new kinase target”.

Named Entity Recognition Feature Space

Nadeau, David and Sekine, Satoshi, “A survey of named recognition and classification”.

 

Word-level features

Features

Examples

Case

 

- Starts with a capital letter

- Word is all uppercased

- The word is mixed case (e.g., iXL, eBay)

Punctuation

 

- Ends with period, has internal period (e.g., St., I.B.M.)

- Internal apostrophe, hyphen or ampersand (e.g., O’Connor)

Digit

 

- Digit pattern (3.1.1, dates, percentages, SSNs, etc.)

- Cardinal and Ordinal

- Roman number

- Word with digits (e.g., W3C, 3M)

Character

 

- Possessive mark, first person pronoun

- Greek letters

Morphology

 

- Prefix, suffix, singular version, stem

- Common ending (ist, ese, ian, tion, ….)

Part-of-speech

- proper name, verb, noun, foreign word

Function

 

- Alpha, non-alpha, n-gram

- lowercase, uppercase version

- pattern, summarized pattern

- token length, phrase length

 

 

Named Entity Recognition Clues

Words that indicate type – for example, XXX Associates, XXX Foundation, XXX and Sons.

Designators – Inc., Ltd.

 

Word-level features

Features

Examples

Multiple occurrences

 

- Other entities in the context

- Uppercased and lowercased occurrences (often common nouns)

- Anaphora, coreference

Local syntax

 

- Enumeration, apposition

- Position in sentence, in paragraph, and in document

Meta information

- URL, URI, Email header, XML section

- Bulleted / numbered lists, tables, figures

Corpus frequency

- Word and phrase frequency

- Co-occurrences

 

Finding coreferences and aliases in a text can be reduced to the same problem of finding all occurrences of an entity in a document. This problem is of great complexity. R. Gaizauskas et al. (1995) use 31 heuristic rules to match multiple occurrences of company names. For instance, two multi-word expressions match if one is the initial subsequence of the other. An even more complex task is the recognition of entity mention across documents. X. Li et al. (2004) propose and compare a supervised and an unsupervised model for this task. They propose the use of word-level features engineered to handle equivalences (e.g., prof. is equivalent to professor) and relational features to encode the relative order of tokens between two occurrences.