We use cookies to improve user experience and analyze website traffic. Read about how we use cookies and how you can control them by clicking "Privacy Preferences".

Privacy Preferences

When you visit any website, it may store or retrieve information through your browser, usually in the form of cookies. Since we respect your right to privacy, you can choose not to permit data collection from certain types of services. However, not allowing these services may impact your experience.


Named Entity Recognition and Normalization

Named Entities (NEs) are the entities that denote a real world entity, such as an organization, a person, a location, a geographical entity, usually mentioned as proper names. Usually, numerals, temporal and monteray expressions are also classified as named entities.

Named Entity Recognition (NER) is the task of determining the named entities in the given text. It is a non-trivial task since it is infeasible to cover all possible NEs in a simple lookup-list. Thus, we usually train Machine Learning/Deep Learning models to identify and classify the types of NEs in a sentence. Basically, NER is marking NEs as seen in the following example:

Ukrayna Devlet Başkanı Zelenski yarın Cumhurbaşkanı Erdoğan ile görüşmek üzere İstanbul 'a geliyor.

Additional to identifying NEs, it is usually preferrable to determine the classes of these NEs. The common classes are person, organization and location names. More advanced NER systems have fine-grained classes such as product names, titles, event names and so on. Also, custom entity types are required for domain-specific solutions. For example, IBAN and account numbers are potential NE types in a banking solution.

While the NE classes usually includes numerals, monetary and temporal expressions, most of the NER systems are usually concentrated on person, location and organization names. However, for real-world solutions, even recognizing numerals may be tricky. For example, "20 milyon" (20 million) is a valid representation for a number in Turkish which has numeral and textual components forming the value. Considering additional possible expressions like "buçuk" (half), ranges, or ordinal suffixes as in "iki bininci", the problem quickly escalates. While identifying these numerals is a highly complex problem, we need more in real life. We usually need the exact normalized value of the numerals for accurate structured data. Referring our previous example, it is preferred to get the value 20000000 as an annotation of the numerical expression "20 milyon".

NE linking is another task where the mention of a NE in the text is associated with an entity in a knowledgebase such as Wikidata. For instance, the mention in the above example includes "Erdoğan" and this entity should be linked to the wikipedia entity with Q39259 as the unqiue identifier, which refers to the entity Recep Tayyip Erdoğan, who is the president of Turkiye. Again, this is not a trivial task since the mention does not explicitly cover the full name and the knowledge base contains multiple possible entities such as Yılmaz Erdoğan who is a film director and screenwriter, Ömer Erdoğan who is a Turkish soccer player or Nehir Erdoğan who is a Turkish actress.