Technologies Involved
Technological Overview
Turkish.AI provides multiple web services for various tasks. Our fundamental API is the Core Task which conducts basic linguistic processing on the input task and returns a JSON. Content Moderation task service is built for the companies which conduct manual moderation of user generated content against vulgarity, hate speech, illegal and sensitive content. This page presents some details of the technologies under the hood of our products.
Core API
Our Core API is built for conducting basic linguistic analysis on Turkish texts, ranging from basic tokenization to sentimental analysis. The major analysis components are text normalization, advanced tokenization (which covers sentence splitting), morphological analysis and disambiguation, syntax analysis, named entity recognition and sentiment analysis. All of these modules are executed as default while a customer can customize the processing components and get faster response by avoiding unnecessary lingustic processing components.
As all other services, the Core API also works as a web service. You may want to check our documentation for more technical details.
Text Normalization
Almost all of the NLP tools are based on correctly spelled words. However, the majority of the user generated content is subject to misspellings, typos, informal abbrevations and intentionally made mistakes. In order to get the most benefit from the text, our text normalization module corrects most of those issues prior to further processing.
Below, you may find actual user generated content instances gathered from product reviews, Twitter, news comments and forum entries:
Actual: o kadar kucukki ama buyuk olsaydi mukkemle olurdu Normalized: O kadar küçük ki ama büyük olsaydı mükemmel olurdu.
Actual: Bayıldımmmm harika 1.70 55 kilo s aldım Normalized: Bayıldım harika 1.70 55 kilo S aldım.
Actual: Sifir yildiz vermek istiyorum ama mumkun degil oyuzden bir veriyorum Normalized: Sıfır yıldız vermek istiyorum ama mümkün değil, o yüzden bir veriyorum.
Actual: gıttıgım yolda mazot parası kdr ücret ödeyeceksem neden kullanayım Normalized: Gittiğim yolda mazot parası kadar ücret ödeyeceksem neden kullanayım?
Actual: ülkerin menfaat için mecburuzz bazıları memnun olmucak ama mecbur Normalized: Ülkenin menfaati için mecburuz bazıları memnun olmayacak ama mecbur.
Actual: @jkookjr maci fransa alir gibi geldi izleyemiyom ama Normalized: @jkookjr maçı Fransa alır gibi geldi izleyemiyorum ama.
Most of the NLP components like morphological analysis, morphological disambiguation and parser will fail to correctly analyze the original user generated texts. Our text normalization module is usually the very first pre-processing step where most of those issues are solved and the normalized form of the words are generated as the output.
The execution of this component is optional since there is no need to recover errors if your text is not noisy, which means you have a more reliable data source that obeys linguistic rules and produce well-formed Turkish sentences. Contrarily, if your text is a user generated text such as product comments or complaint e-mails, the odds of having a noisy text is high and the application of text normalization component may help to improve overall accuracy.
Tokenization
One of the most critical components in text analytics is decomposing a sequence of words and punctuations into sentences and tokens. Although this task is assumed to be trivial frequently, real world problems are beyond this simplifying assumption.
For example, determining sentence boundaries can easily turn into a pain when non-terminating punctuations exist in the sentence. It can quickly become a nightmare if there exists multiple sub-sentences enclosed in quotation marks.
A token may be a word, a symbol, an abbreviation, or any other element like URL.
Morphological Analysis and Disambiguation
The productive nature of the Turkish morphology raise serious issues when adapting NLP models performing well in other languages. Due to this large vocabulary size problems, morphological analysis is a key part of Turkish AI NLP system where each token (actually wordform) is processed to extract the lemma and morphological tags.
You may also prefer to read our blog post on Turkish Morphology.
Lemmatization
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Syntactical Analysis
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Coreference Resolution
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Noun Phrase Chunking
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Services
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Named Entity Recognition
Named Entities play crucial roles in many NLP applications. Real-world entities such as person, location, organization are referred as named entities. Recognizing these entities is a non-trivial task since using a lookup list is not sufficient. Although there are many studies in Turkish Names Entity Recognition, almost all of them are limited only with Location, Person and Organization entity types. However, our Core Service offers a very broad range of entity types augmented with entity normalization.
Here is a full list of entity types that can be recognized by our core analysis service:
TYPE | DESCTIPTION |
PER | Person |
NORP | Nationality, religious or political group |
GPE | Geographical/Social/Political Entities |
LOC | Location |
ORG | Organization |
FAC | Facility (Building, highway, bridge) |
EVENT | Event |
LANG | Language |
TITLE | Title |
PRODUCT | Product |
WORK_OF_ART | Book, song titles |
LAW | Named documents |
NUM_CARD | Numbers |
NUM_ORD | Ordinal numbers |
NUM_DIST | Distributional numbers |
FRACTION | Fractional numbers |
PERCENTAGE | Percentage |
RANGE | Number Range |
MONEY | Monetary expressions |
E-mail addresses | |
MENTION | User Mention (such as @elonmusk) |
HASHTAG | Hashtag (such as #trendtopic) |
URL | URL addresses |
IP | IP addresses (IPv4, IPv6) |
IBAN | International Bank Account Number |
TCKN | Turkish Identity Number |
TEMP | Temperature |
VELOCITY | Velocity |
VOLUME | Volume measurement |
AREA | Area measurement |
LENGTH | Length measurement |
WEIGHT | Weight measurement |
DATA | Data |
ENERGY | Energy |
POWER | Power |
PRESSURE | Pressure |
ELECTRICAL_CHARGE | Electrical Charge |
FREQUENCY | Frequency |
ADDRESS | Address (currently only Turkey based) |
İhale 5 milyon dolar ile Sabancı Holding'te kaldı.Here is a portion of the output JSON for the named entities captured from this text:
As you may notice, we have normalized the value of the monetary information in the text along with its currency. Also, the type of the organization is a commercial organization which is indicated by ORG_COM field.
Each entity type may have different attributes. For a detailed list of entity types and possible attributes, you may click here.
Temporal Expressions
Temporal Expressions are time related expressions in text. Our comprehensive temporal expression tagger is able to process specific dates, specific time referrals, durations and time intervals. Our API can not only capture both absolute and relative expressions, but also it can normalize these expressions in TIMEX3 standard.
Sample Expression | Normalized Value |
---|---|
dün akşam saatlerinde | 2022-12-31TEV |
geçen yıl 19 Haziran saat 18:30'da | 2021-06-19T18:30 |
45 dakika sonra | 2022-06-03T13:45 |
dün akşam saatlerinde | 20220-12-31TEV |
beş yıl süreyle | PT5H |
Mart ayları | XXXX-03 P1Y |
Entity Linking
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Text Classification
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Sentiment Analysis
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.
Profanity Filter
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pellentesque neque eget diam posuere porta. Quisque ut nulla at nunc vehicula lacinia. Proin adipiscing porta tellus, ut feugiat nibh adipiscing sit amet. In eu justo a felis faucibus ornare vel id metus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In eu libero ligula. Fusce eget metus lorem, ac viverra leo. Nullam convallis, arcu vel pellentesque sodales, nisi est varius diam, ac ultrices sem ante quis sem. Proin ultricies volutpat sapien, nec scelerisque ligula mollis lobortis.