Technology

Technological Overview

Turkish.AI provides multiple web services for various tasks. Our fundamental API is the Core Task which conducts basic linguistic processing on the input task and returns a JSON. Content Moderation task service is built for the companies which conduct manual moderation of user generated content against vulgarity, hate speech, illegal and sensitive content. This page presents some details of the technologies under the hood of our products.

Core API

Our Core API is built for conducting basic linguistic analysis on Turkish texts, ranging from basic tokenization to sentimental analysis. The major analysis components are text normalization, advanced tokenization (which covers sentence splitting), morphological analysis and disambiguation, syntax analysis, named entity recognition and sentiment analysis. All of these modules are executed as default while a customer can customize the processing components and get faster response by avoiding unnecessary lingustic processing components.

As all other services, the Core API also works as a web service. You may want to check our documentation for more technical details.

Text Normalization

Almost all of the NLP tools are based on correctly spelled words. However, the majority of the user generated content is subject to misspellings, typos, informal abbrevations and intentionally made mistakes. In order to get the most benefit from the text, our text normalization module corrects most of those issues prior to further processing.

Below, you may find actual user generated content instances gathered from product reviews, Twitter, news comments and forum entries:

Actual: o kadar kucukki ama buyuk olsaydi mukkemle olurdu
Normalized: O kadar küçük ki ama büyük olsaydı mükemmel olurdu.

Actual: Bayıldımmmm harika 1.70 55 kilo s aldım
Normalized: Bayıldım harika 1.70 55 kilo S aldım.

Actual: Sifir yildiz vermek istiyorum ama mumkun degil oyuzden bir veriyorum
Normalized: Sıfır yıldız vermek istiyorum ama mümkün değil, o yüzden bir veriyorum.

Actual: gıttıgım yolda mazot parası kdr ücret ödeyeceksem neden kullanayım
Normalized: Gittiğim yolda mazot parası kadar ücret ödeyeceksem neden kullanayım?

Actual: ülkerin menfaat için mecburuzz bazıları memnun olmucak ama mecbur
Normalized: Ülkenin menfaati için mecburuz bazıları memnun olmayacak ama mecbur.

Actual: @jkookjr maci fransa alir gibi geldi izleyemiyom ama
Normalized: @jkookjr maçı Fransa alır gibi geldi izleyemiyorum ama.

Most of the NLP components like morphological analysis, morphological disambiguation and parser will fail to correctly analyze the original user generated texts. Our text normalization module is usually the very first pre-processing step where most of those issues are solved and the normalized form of the words are generated as the output.

The execution of this component is optional since there is no need to recover errors if your text is not noisy, which means you have a more reliable data source that obeys linguistic rules and produce well-formed Turkish sentences. Contrarily, if your text is a user generated text such as product comments or complaint e-mails, the odds of having a noisy text is high and the application of text normalization component may help to improve overall accuracy.

Tokenization

One of the most critical components in text analytics is decomposing a sequence of words and punctuations into sentences and tokens. Although this task is assumed to be trivial frequently, real world problems are beyond this simplifying assumption.

For example, determining sentence boundaries can easily turn into a pain when non-terminating punctuations exist in the sentence. It can quickly become a nightmare if there exists multiple sub-sentences enclosed in quotation marks.

A token may be a word, a symbol, an abbreviation, or any other element like URL.

Morphological Analysis and Disambiguation

The productive nature of the Turkish morphology raise serious issues when adapting NLP models performing well in other languages. Due to this large vocabulary size problems, morphological analysis is a key part of Turkish AI NLP system where each token (actually wordform) is processed to extract the lemma and morphological tags.

You may also prefer to read our blog post on Turkish Morphology.

Lemmatization

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.

Syntactical Analysis

Coreference Resolution

Noun Phrase Chunking

Services

Named Entity Recognition

Named Entities play crucial roles in many NLP applications. Real-world entities such as person, location, organization are referred as named entities. Recognizing these entities is a non-trivial task since using a lookup list is not sufficient. Although there are many studies in Turkish Names Entity Recognition, almost all of them are limited only with Location, Person and Organization entity types. However, our Core Service offers a very broad range of entity types augmented with entity normalization.

Here is a full list of entity types that can be recognized by our core analysis service:

TYPE	DESCTIPTION
PER	Person
NORP	Nationality, religious or political group
GPE	Geographical/Social/Political Entities
LOC	Location
ORG	Organization
FAC	Facility (Building, highway, bridge)
EVENT	Event
LANG	Language
TITLE	Title
PRODUCT	Product
WORK_OF_ART	Book, song titles
LAW	Named documents
NUM_CARD	Numbers
NUM_ORD	Ordinal numbers
NUM_DIST	Distributional numbers
FRACTION	Fractional numbers
PERCENTAGE	Percentage
RANGE	Number Range
MONEY	Monetary expressions
E-MAIL	E-mail addresses
MENTION	User Mention (such as @elonmusk)
HASHTAG	Hashtag (such as #trendtopic)
URL	URL addresses
IP	IP addresses (IPv4, IPv6)
IBAN	International Bank Account Number
TCKN	Turkish Identity Number
TEMP	Temperature
VELOCITY	Velocity
VOLUME	Volume measurement
AREA	Area measurement
LENGTH	Length measurement
WEIGHT	Weight measurement
DATA	Data
ENERGY	Energy
POWER	Power
PRESSURE	Pressure
ELECTRICAL_CHARGE	Electrical Charge
FREQUENCY	Frequency
ADDRESS	Address (currently only Turkey based)

We are aware of the fact that you need structured and actionable information from the text. That is why we have built our system to be able to process even complex cases and normalize them for you. Let's see how it works on this input:

İhale 5 milyon dolar ile Sabancı Holding'te kaldı.

Here is a portion of the output JSON for the named entities captured from this text:

As you may notice, we have normalized the value of the monetary information in the text along with its currency. Also, the type of the organization is a commercial organization which is indicated by ORG_COM field.

Each entity type may have different attributes. For a detailed list of entity types and possible attributes, you may click here.

#	Entity	Type	Features	Normalized
1	123	NUMBER	type: numeric	123.0
2	123.45	NUMBER	type: numeric	123.45
3	123.	NUMBER	category: ordinal type: numeric	123.0
4	123'üncü	NUMBER	category: ordinal type: numeric	123.0
5	üçer	NUMBER	category: distributional type: text	3.0
6	2 buçuk	NUMBER	type: numeric	2.5
7	2 milyon 500 bin	NUMBER	type: numeric	2500000.0
8	XXII	NUMBER	type: roman	22.0

#	Entity	Type	Features	Normalized
1	üç bin lira	MONEY	type: exact currency: TRY	3000.0
2	7-8 milyon dolar	MONEY	type: range min: 7000000.0 max: 8000000.0 currency: USD

#	Entity	Type	Features	Normalized
1	20 Mayıs	TIMEX3	tid: t5 type: DATE	2023-05-20
2	19.06.1987	TIMEX3	tid: t6 type: DATE	1987-06-19
3	geçen ay	TIMEX3	tid: t7 type: DATE	2023-05
4	2000'ler	TIMEX3	tid: t8 type: DATE	2000
5	her hafta	TIMEX3	periodicity: P1W quant: EVERY tid: t11 type: SET	P1W
6	dün akşam saatlerinde	TIMEX3	tid: t1 type: TIME	2023-06-24TEV
7	geçen yıl 19 Haziran saat 18:30'da	TIMEX3	tid: t2 type: TIME	2023-06-19T18:30
8	45 dakika sonra	TIMEX3	tid: t3 type: TIME	2023-06-25T05:46
9	dün akşam saatlerinde	TIMEX3	tid: t1 type: TIME	2023-06-24TEV
10	beş yıl süreyle	TIMEX3	tid: t10 type: DURATION	P5Y
11	Mart	TIMEX3	tid: t9 type: DATE	2023-03

#	Entity	Type	Features
1	Pınar Mah. Katar Cad, İstinye Park AVMNo:73, 34460 Sarıyer/İstanbul	ADDRESS	MAHALLE: Pınar Mah. CADDE: Katar Cad BINA: İstinye Park Avm NO: 73 PK: 34460 ILCE: Sarıyer IL: İstanbul

#	Entity	Type	Features	Normalized
1	TR33 0006 1005 1978 6457 8413 26	IBAN	validated: true country: Turkey national_bank_code: 00061 account_number: 0519786457841326	TR330006100519786457841326

#	Entity	Type	Features	Normalized
1	on metre 30 cm	LENGTH	type: mixed unit: cm	1030.0

Temporal Expressions

Temporal Expressions are time related expressions in text. Our comprehensive temporal expression tagger is able to process specific dates, specific time referrals, durations and time intervals. Our API can not only capture both absolute and relative expressions, but also it can normalize these expressions in TIMEX3 standard.

Sample Expression	Normalized Value
dün akşam saatlerinde	2022-12-31TEV
geçen yıl 19 Haziran saat 18:30'da	2021-06-19T18:30
45 dakika sonra	2022-06-03T13:45
dün akşam saatlerinde	20220-12-31TEV
beş yıl süreyle	PT5H
Mart ayları	XXXX-03 P1Y

Please note that to normalization of relative temporal expressions necessitates a reference date and time for the analyzed text.

Technologies Involved

Technological Overview

Core API

Text Normalization

Tokenization

Morphological Analysis and Disambiguation

Lemmatization

Syntactical Analysis

Coreference Resolution

Noun Phrase Chunking

Services

Named Entity Recognition

Temporal Expressions

Entity Linking

Text Classification

Sentiment Analysis

Profanity Filter

Privacy Preferences

Registered Users

New Customers

Technologies Involved

Technological Overview

Core API

Text Normalization

Tokenization

Morphological Analysis and Disambiguation

Lemmatization

Syntactical Analysis

Coreference Resolution

Noun Phrase Chunking

Services

Named Entity Recognition

Temporal Expressions

Entity Linking

Text Classification

Sentiment Analysis

Profanity Filter