Technological Overview

Turkish.AI provides multiple web services for various tasks. Our fundamental API is the Core Task which conducts basic linguistic processing on the input task and returns a JSON. Content Moderation task service is built for the companies which conduct manual moderation of user generated content against vulgarity, hate speech, illegal and sensitive content. This page presents some details of the technologies under the hood of our products.

Core API

Our Core API is built for conducting basic linguistic analysis on Turkish texts, ranging from basic tokenization to sentimental analysis. The major analysis components are text normalization, advanced tokenization (which covers sentence splitting), morphological analysis and disambiguation, syntax analysis, named entity recognition and sentiment analysis. All of these modules are executed as default while a customer can customize the processing components and get faster response by avoiding unnecessary lingustic processing components.

As all other services, the Core API also works as a web service. You may want to check our documentation for more technical details.

Text Normalization

Almost all of the NLP tools are based on correctly spelled words. However, the majority of the user generated content is subject to misspellings, typos, informal abbrevations and intentionally made mistakes. In order to get the most benefit from the text, our text normalization module corrects most of those issues prior to further processing.

Below, you may find actual user generated content instances gathered from product reviews, Twitter, news comments and forum entries:

                        Actual:     o kadar kucukki ama buyuk olsaydi mukkemle olurdu
                            Normalized: O kadar küçük ki ama büyük olsaydı mükemmel olurdu.
Actual:     Bayıldımmmm harika 1.70 55 kilo s aldım Normalized: Bayıldım harika 1.70 55 kilo S aldım.
Actual:     Sifir yildiz vermek istiyorum ama mumkun degil oyuzden bir veriyorum Normalized: Sıfır yıldız vermek istiyorum ama mümkün değil, o yüzden bir veriyorum.
Actual:     gıttıgım yolda mazot parası kdr ücret ödeyeceksem neden kullanayım Normalized: Gittiğim yolda mazot parası kadar ücret ödeyeceksem neden kullanayım?
Actual:     ülkerin menfaat için mecburuzz bazıları memnun olmucak ama mecbur Normalized: Ülkenin menfaati için mecburuz bazıları memnun olmayacak ama mecbur.
Actual:     @jkookjr maci fransa alir gibi geldi izleyemiyom ama Normalized: @jkookjr maçı Fransa alır gibi geldi izleyemiyorum ama.

Most of the NLP components like morphological analysis, morphological disambiguation and parser will fail to correctly analyze the original user generated texts. Our text normalization module is usually the very first pre-processing step where most of those issues are solved and the normalized form of the words are generated as the output.

The execution of this component is optional since there is no need to recover errors if your text is not noisy, which means you have a more reliable data source that obeys linguistic rules and produce well-formed Turkish sentences. Contrarily, if your text is a user generated text such as product comments or complaint e-mails, the odds of having a noisy text is high and the application of text normalization component may help to improve overall accuracy.

Tokenization

One of the most critical components in text analytics is decomposing a sequence of words and punctuations into sentences and tokens. Although this task is assumed to be trivial frequently, real world problems are beyond this simplifying assumption.

For example, determining sentence boundaries can easily turn into a pain when non-terminating punctuations exist in the sentence. It can quickly become a nightmare if there exists multiple sub-sentences enclosed in quotation marks.

A token may be a word, a symbol, an abbreviation, or any other element like URL.

Morphological Analysis and Disambiguation

The productive nature of the Turkish morphology raise serious issues when adapting NLP models performing well in other languages. Due to this large vocabulary size problems, morphological analysis is a key part of Turkish AI NLP system where each token (actually wordform) is processed to extract the lemma and morphological tags.

You may also prefer to read our blog post on Turkish Morphology.

Lemmatization

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Syntactical Analysis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Coreference Resolution

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Noun Phrase Chunking

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Services

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Named Entity Recognition

Named Entities play crucial roles in many NLP applications. Real-world entities such as person, location, organization are referred as named entities. Recognizing these entities is a non-trivial task since using a lookup list is not sufficient. Although there are many studies in Turkish Names Entity Recognition, almost all of them are limited only with Location, Person and Organization entity types. However, our Core Service offers a very broad range of entity types augmented with entity normalization.

Here is a full list of entity types that can be recognized by our core analysis service:

TYPE DESCTIPTION
PER Person
NORP Nationality, religious or political group
GPE Geographical/Social/Political Entities
LOC Location
ORG Organization
FAC Facility (Building, highway, bridge)
EVENT Event
LANG Language
TITLE Title
PRODUCT Product
WORK_OF_ART Book, song titles
LAW Named documents
NUM_CARD Numbers
NUM_ORD Ordinal numbers
NUM_DIST Distributional numbers
FRACTION Fractional numbers
PERCENTAGE Percentage
RANGE Number Range
MONEY Monetary expressions
E-MAIL E-mail addresses
MENTION User Mention (such as @elonmusk)
HASHTAG Hashtag (such as #trendtopic)
URL URL addresses
IP IP addresses (IPv4, IPv6)
IBAN International Bank Account Number
TCKN Turkish Identity Number
TEMP Temperature
VELOCITY Velocity
VOLUME Volume measurement
AREA Area measurement
LENGTH Length measurement
WEIGHT Weight measurement
DATA Data
ENERGY Energy
POWER Power
PRESSURE Pressure
ELECTRICAL_CHARGE Electrical Charge
FREQUENCY Frequency
ADDRESS Address (currently only Turkey based)
We are aware of the fact that you need structured and actionable information from the text. That is why we have built our system to be able to process even complex cases and normalize them for you. Let's see how it works on this input:
                        İhale 5 milyon dolar ile Sabancı Holding'te kaldı.
                    
Here is a portion of the output JSON for the named entities captured from this text:

As you may notice, we have normalized the value of the monetary information in the text along with its currency. Also, the type of the organization is a commercial organization which is indicated by ORG_COM field.

Each entity type may have different attributes. For a detailed list of entity types and possible attributes, you may click here.

Temporal Expressions

Temporal Expressions are time related expressions in text. Our comprehensive temporal expression tagger is able to process specific dates, specific time referrals, durations and time intervals. Our API can not only capture both absolute and relative expressions, but also it can normalize these expressions in TIMEX3 standard.

Sample Expression Normalized Value
dün akşam saatlerinde 2022-12-31TEV
geçen yıl 19 Haziran saat 18:30'da 2021-06-19T18:30
45 dakika sonra 2022-06-03T13:45
dün akşam saatlerinde 20220-12-31TEV
beş yıl süreyle PT5H
Mart ayları XXXX-03 P1Y
Please note that to normalization of relative temporal expressions necessitates a reference date and time for the analyzed text.

Entity Linking

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Text Classification

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Sentiment Analysis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Profanity Filter

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.