Explorations in imbalanced data classification


Human vs. automatic classification

As human beings, we are desperate classifiers. For our ancestors it was essential for survival to distinguish a lion from a cat. Thousands of years later we still do the same: each morning when we start working, our brain needs to be able to classify the objects around us as a chair, a desk, a computer, a mouse, etc. Classification is one of the most basic, and even more importantly, one of the most complex cognitive abilities of humans. Consequently, one would assume that humans are masters of classification and for artificial intelligence-aid machines it must be a great challenge to learn the way humans acquire and apply their categories. So, are computers worse classifiers than humans?

It depends on the task in question. Computers have already produced better performance in image recognition tasks – like sorting images into categories – with a lower error rate than humans. However, humans can cope with evaluations that strongly depend on subjective factors more successfully than computers do. Our data science team accepted the challenge and decided to train an algorithm that can cope with highly complex classification issues.

Human-annotated and imbalanced data

Our data, provided by our customer, Járókelő, was made up of suppliers’ responses to users’ complaints. You can read more about our joint project with Járókelő here. As a first step, annotators evaluated the texts produced by the authorities in charge of tackling the problems raised by the users on a 1 to 5 scale. The better the performance was deemed, the higher score it received. The scores were not given randomly, but certain factors were taken into consideration: e.g. the degree of politeness, the presence of addressing the user, the length of the reply, etc. Later, these factors were mapped into the automatic classifier as features in order to teach our algorithm to classify the responses as similarly to humans as possible. Naturally, such features were also taken into account, in the identification of which computers are more precise than humans, like the proportion of nouns and that of the positive, negative and neutral sentiments in the texts. Human annotators may also have had a general impression of these factors and been influenced by them in articulating their overall evaluation. Finally, we got a dataset that contained automatically selected features and five classes.

However, it turned out soon that the distribution of the five target variables is uneven. In other words, we had much less data of class 1 and 2 than that of 3, 4 and 5. The overrepresentation of certain classes made us expect that our algorithm would overlearn them and would show a poorer performance in case of the underrepresented classes. Consequently, we needed to train a classifier that could handle human-annotated and imbalanced data.

And the winner is…

To pursue our mission, Random Forest was chosen as a learning algorithm due to its impressive performance compared to other learning methods. This task was carried out in the platform of Orange, which, as a great advantage, can be fed with separate training and testing sets. The Random Forest classifier was tested on four versions of the same training data set, namely on the original imbalanced data set, on an oversampled one, on an undersampled one and on a combination of the latter two. To be able to contrast these methods, we tested the four training sets on the same imbalanced data. Let us see which one could get the closest to the classification done by the annotator team.

1. Imbalanced training set

Teaching the algorithm to classify the imbalanced data through an imbalanced data set resulted in a precision of 0.420, a recall of 0.500 and a classification accuracy of 0.479. The number of instances of class 3 was predicted to be much higher than it actually is (see the confusion matrix below). This phenomenon can be explained with the lack of clear boundary identification among the classes, namely that the algorithm is not able to properly distinguish the instances of class 3 from the instances of class 4 and class 2. Furthermore, since the classes of 5, 4 and 3 are overrepresented in the training dataset, the algorithm tends to overscore these classes in the testing dataset, too.


2. Undersampled training set

When the classes of 3, 4 and 5 were undersampled in the training data, the score of precision increased significantly: it raised from 0.420 to 0.472. In addition, the numbers of instances of the predicted and actual classes are far closer than they were before, presented in the confusion matrix below. But the results of recall and classification accuracy did not show such an improvement.


3. Oversampled training set

Surprisingly, balancing the imbalanced data by oversampling the poorly represented classes of 1 and 2 did not generate an improvement of precision or recall. However, the classification accuracy did improve: it turned from 0.479 to 0.491, which means that our classifier trained on oversampled data was able to classify more examples correctly than the one trained on imbalanced data.

4. Smoteenn training set

The combination of the methods of over- and undersampling the training set (called Smoteenn) seems to be our Number 1 solution in terms of precision, with its remarkable performance of 0.667. However, as the confusion matrix suggests below, it highly overpredicts the numbers of instances in classes of 1 and 2. It is exactly the opposite of what happens in the case of an imbalanced data set.


Boosting our achievements further

What this experiment suggests to us is that it is much harder to approximate human performance in automatic classification in case of imbalanced, human-annotated data with five classes than in case of less complex classification tasks. It can be concluded that there is no single ‘take-all’ method, but there are several solutions, which can be adopted in harmony with the needs of your clients. Focusing on our clients’ satisfaction, we are still improving our automatic classifier by optimizing the distance measures of the algorithms.


Real-World Deep Learning for NLP @ Budapest Data Forum

Today we are presenting our experiments at Budapest Data Forum.  This talk is a summary of two task; finding synonyms (or more exactly word pairs that can be synonyms) using the word2vec algorithm and using seq2seq on synthetic misspelling data to train a spelling corrector.

The Sounds of Migration – Data Sonification Experiment

Migration is a central issue of political debates in Europe, esp. in Hungary. We collected more than 40.000 articles on this topic to see how it is described in the online media. We did a “serious” analysis of the data, you can read the first part of it here. However, we think this issue has got a very strong emotional side and we are looking for tools to make it explicit. We think simple time series of the emotional tone of texts is just part of the story. That’s why we’ve been experimenting with data sonification for a while. You can read about our first try here. We know it is far from perfect, but we are at the beginning of our journey in the field of data sonification. This time we chose a different tool, the wavesurfer JavaScript library. We used our original data, but this time we created a six-channel audio file from the time series. Each emotion has its own channel in the wav file and the visualization shows the wave form of each channel. By pressing the play button, a time bar helps you to see the progress of the music, also it helps you to compare the channels. You can find more details on the project site. Have fun and keep in mind, this is just an experiment!


lda2vec: The Best of Both Worlds

In our previous analysis, we used LDA to discover the topics of the discourse on CEU and NGOs in the Hungarian online media. We love LDA and we were shocked when it put articles on the same issue (the legislation process and the reaction of the EU, and the effected institutions) into two topics (check out the pyldaviz output here). This was due to the very different word usage of the sites; the independent (or left, or liberal, choose your favorite term) media like official names, for example Central European University, NGOs, EU and etc, while the quasi state financed (or right, or pro-government) side is using terms like “Soros University”, “foreign organizations”, “Brussels” and etc. Our word2vec model built on the corpus shows similar terms are close to each other in the semantic space, i.e. “Soros University” and “Central European University” are occupying a very similar position in the space (you can explore the 3D t-sne projection of the word2vec model here). That’s why we gave a try to Christopher E. Moody’s lda2vec algorithm; we hoped it can overcome this word usage problem.

Although we should work on the algorithm before using it in real-life scenarios, the first results are very promising; in our case, we got more descriptive topics which opens the possibility to find articles on the same issue from the opposing narratives.


Migrants, refugees, immigrants: what is the media suggesting?

Visual and textual representation of immigration in the Hungarian online media

In the autumn of 2016, the referendum on the so-called forced-settlement of migrants was looming over our heads. Media was certainly putting an enormous pressure on society; but what was this whole fuss about? Although modern computational linguistics cannot come up with exact answers, it may assist us in getting an idea of the wide range of emotions stirred by various sites. Precognox research team is presenting their profound analysis on how media is attempting to sell the referendum.

Kitti Balogh, Nóra Fülöp, Virág Ilyés, Zoltán Varjú

Originally published: nyest.hu, September 29, 2016

It is indisputable that although a huge number of refugees reached Hungary, yet we could not bump into them around every corner since most of them almost immediately left the country. For the public, it is the media that represents a direct link to the refugees, so we wanted to find out how the news on migration are presented in the online media. We analyzed more than 40000 articles published between Sep 27, 2014 and July 11, 2016 with text mining and image processing methods. The texts and their metadata are available in a searchable form on our dashboard. In this article, we are giving a broad outline of what possibilities the dashboard provides. We are also trying to present the information content of the images in an easily understandable form.


As opposed to the mostly qualitative research widely applied in media content and media representation analysis, we used methods which support the automatic processing of large amount of data as well as the simultaneous analysis of both visual and textual content. This way the research period can be extended and the number of content providers can be increased. The simultaneous analysis is to be completed in our intern’s thesis on the fine-tuning of cluster analysis application and evaluation. The aim is to make the interpretation of both textual and the increasing proportion of visual content easier in the future.

The necessary data was collected from 25 online news sites including the prominent index.hu, origo.hu, the online version of mno and hvg as well as minor portals. The selected sites cover a wide spectrum of the Hungarian online media; articles have been taken from pestisracok.hu, abcug.hu, kuruc.info as well as from the popular pages of yellow journalism. We also collected data from online TV channels- atv.hu, rtl.hu, hirek.hu- and the official police reports from police.hu.


The number of articles in the corpus based on content providers


We could find the articles related to migration on most sites with their own search engines but there were some cases when it was not feasible. In lack of- or besides- the search function, labels and headings guided us to find the relevant content. First, we collected the article URLs manually with Link Klipper Google Chrome extension. Then, having these references, we automatized the crawling process of both the visual and the textual content.


The number of articles in the corpus based on keywords

To be able to interpret the composition of the corpus, it is essential to mention the process of how the URLs and the content are filtered since it decreased the reference list by 30 thousand items. During the process, several methods were used to add relevant and unique articles to the corpus. We got rid of invalid links which led to either recommendations or sites listing search results with simhash. Duplications within one domain were filtered with similarity measures based on tf-id statistics. We also removed duplicated URLs. We applied the heuristic technique to filter duplicates; the article published the earliest was left in the corpus. We also discarded the articles with no timestamp where, for this reason, the date of publication could not be identified. Although we did our best to eliminate irrelevant articles with our statistical tools, we cannot be certain to have only proper content. It is also important to remember that based on the corpus composition only careful conclusions can be drawn concerning either the number of articles published on a certain subject, or which site was the most active to represent a given topic. The reason for this is that adding an article to the corpus meant its real publication, the fact that we could use the crawling method on the site, and that the content met the filter criteria: it was not identified as a duplicate, or it wasn’t an invalid link and had a timestamp. At the end of the process what we got was a corpus of 42845 articles.


We worked with the following keywords: “immigration”, “immigrant”, “migrant”, “migration”, “refugee” and “asylum-seeker”. Almost half of the articles in the corpus are hits for the keyword “refugee”. The second most frequent was “immigration” followed by “immigrant”, “migrant”, “migration” and “asylum-seeker”. Unique labels and headings were used in case of three sites: on kuruc.info we collected articles under the heading “immigrant crime” as well besides the keywords mentioned above. On “kettősmérce.blog.hu” the column “immigrant affairs” was great help to find the relevant articles. On blikk.hu the label “refugee crisis” was used to get the news on this topic.

Word usage is a crucial element of media representation research. The modality of expressions can be either alienating or fear-provoking. It would certainly be wrong to jump to far-reaching conclusions in the lack of context and judge the strategies content providers used to present refugee-affairs based exclusively on the keywords. Below however, we can see the hit results of our keywords categorized by sites: which expressions were preferred and which ones were ignored.


Hit results for keywords based on sites

To get a more profound understanding after the descriptive analysis of the corpus we also completed the content analysis of the articles as well. In the pre-work phase the first step was to remove parts with incorrectly coded characters. Then, with the use of magyarlánc we stemmed the words and carried out part of speech tagging: we classified words into their parts of speech and labelled them. To achieve more relevant results, we removed certain words- the most frequently used ones due to natural language usage- with the help of a stopword list.

To present the results of our complex analysis an interactive dashboard was created which hopefully completes, corrects or specifies our general suspicion on the representation of the refugee crisis and gives an overall picture of the world the Hungarian online media shows about immigration.


Trends in time

We can easily get an idea of how media reacted to immigration based on the time and the number of the articles published. This phenomenon was clearly presented on the dashboard created for the text corpus which showed an increase from May 2015 in the number of published articles. Most of them were created between the end of August and the middle of September 2015. From October 2015 to May 2016 articles were being published evenly, then in July 2016, right at the end of the collecting period, another rise can be seen.


Time distribution of all news

It is possible to find the words and expressions used in the articles with the Search field on the sites. For instance, if we search for the words “refugee”, “migrant”, “migration” or “immigrant” what we get is a trend fairly similar to the original one. However, there are expressions which were not typically used during the whole period. One such instance is “immigrant-for-a-living” which can be found by searching for “living” AND “immigrant” or “migrant crime” in the keywords category. If we check the timeline of these expressions we can see that this phrase was favored roughly until the middle of 2015, mostly in the news of nepszava.hu. The tag “immigrant crime” became a pet expression on kurucinfo.hu from the early 2016.


Time distribution of articles with the phrase “immigrant-for-a-living”


Time distribution of “immigrant crime” tag

We can also find words which are more generally connected to the topic. Such as “immigration” for example where the time distribution has a peak in several places indicating the unfolding of the phenomenon well before media attention.


Time distribution of articles containing the phrase “immigration”

Emotions and sentiments

It is important to identify the emotions and attitudes evoked by events when analyzing the discourse of online media. Although journalists generally aim to be objective and neutral, the phrases they use are often giveaways of their mindset- not to mention articles where the opinion of the author is far from being disguised.

With the two tabs of the dashboard it is possible to study the sentiments and emotions identified in the articles. During the sentiment and emotion analysis our goal was to identify opinions, attitudes and emotions expressed by the articles. Sentiment analysis normally uses 3 categories (negative, neutral and positive) or their various stages while emotion analysis tries to detect the 6 basic human emotions (sadness, anger, joy, disgust, fear and surprise). We used our Precognox dictionaries to identify sentiments and emotions. The sentiment dictionaries are available free here for research purposes. Although the emotion dictionaries can still be improved on and therefore should be used carefully, they are appropriate for a rough analysis.

To receive the sentiment or emotive value of an article we divided the number of words identified by our dictionaries with the total number of words. We got a value between 0 and 1 for negative and positive sentiments respectively for each article as well as for the emotions of sadness, anger, joy, disgust, fear and surprise. Then we added the positive and negative values. The cumulative sentiment had a value between -1 and 1 within one article. However, on the dashboard the values of articles published on a specific day are summed up- that’s why the sentiment values may range from 10 to -8.

The cumulative sentiment of the news on immigration is neither positive nor negative in nature. The daily value is rather neutral with only one or two peaks. When positive and negative sentiment values are considered separately we can see that they are both represented in significant numbers. When summing them up however, they cancel each other out. This means that the sentiments of the collected sources cover a wide range of spectrum and with some exceptions they are balanced.


Time distribution of cumulative sentiments


Time distribution of negative sentiments


Time distribution of positive sentiments

Based on the emotion timelines sadness and fear are the first to be revealed in the news. Since however, the dictionaries differ in length, comparing the volume of emotions should be done with care. When looking for a certain date with the Time window panel it is possible to read the news published on the very date. Also, we can find out what event triggered the increase of emotions. For instance, 31 August 2015 was a day when both sadness and fear were at a high peak. We can see that lots of articles were focusing on the following topics: a humanitarian catastrophe due to the refugees gathered at the Keleti station, the congestion on both public roads and railways, the negative reception of Hungary’s immigration policy, the rejection of the quota system, the number of refugees entering the country, the high alert of border control and the impossible situation of volunteers in the transit zones.


Time distribution of sadness and fear


It is also worth checking the domains to see which sentiment or emotion dominates the online news portals. Let us look at 444.hu where all sentiments except surprise show a constant radical shift similarly to the cumulated value which also changes dramatically between the positive and negative sentiments.

Besides the timelines, the words belonging to given sentiments and emotions are also shown on the Dashboard. Let’s look at two examples: expressions like “unpleasantness”, “problem”, “war”, “terrorist” and “illness” are typical in news where negative sentiments are dominant. In articles where the emotion “fear” is powerful, words like “concern”, “dread”, “terror” and “worry” appear in the greatest number.


Word cloud of negative sentiments


Word cloud of fear



To make the content of more than 40.000 news more manageable thematic groups sharing the same semantic features were created. For this process, we used the Mallet tool’s topic model called Latent Dirichlet Allocation (LDA). The classification process of the LDA algorithm is based on how the words in the document are distributed. Naming the topics is done by analysts. The output of the algorithm is two lists: one containing the most typical words used in each topic, and another one showing the rate of how the various topics are represented in each document. We got as many as 47 topics altogether which were named based on either their keywords or their most typical news. When modeling a topic each piece of news gets assigned to each topic to a certain extent. It may be more prominent in case of 1-3 items and may be relatively insignificant in case of others. For the sake of simplicity each piece of news was assigned to the most relevant topic. Therefore, we may have the impression in some cases that only few sentences refer to the given topic but all in all this method gives a good model of the thematic structure of the corpus.

The dashboard created for the texts and their metadata has a separate tab for topic analysis. Here’s the list of 15 topics embracing the most news, the number of the news in parenthesis:

  • The EU-Turkey Refugee Deal (2313)
  • The criticism of the EU’s immigration policy (FIDESZ-KDNP) (2124)
  • Catching illegal immigrants and human traffickers (2123)
  • Migrant surge in Southeastern Europe (2060)
  • The journey of migrants to Western-Europe through Hungary (2028)
  • Accidents of refugee boats (1723)
  • Restriction on the right of Asylum (1696)
  • Refugee incidents in Germany (1555)
  • War in the Middle East (1469)
  • Hungarian border barrier (1440)
  • Merkel’s refugee policy and its criticism (1412)
  • Aid programs of international and civilian organizations to help Syrian refugees (1311)
  • Foreign reaction to the refugee crisis (1238)
  • Austrian-Hungarian border barrier (1225)
  • The political crisis caused by refugees (1121)

The dashboard clearly shows which words are typical and which positive and negative expressions are favored when a certain topic is being discussed. For instance, the most frequently occurring words of the topic “The EU-Turkey Refugee Deal” are the following: “unio”, “refugee”, “state”, “world” and “role”. As for negative words “burden”, “nuisance”, “inconvenience”, “problem” stand out while the positive ones are: “important”, “free”, “entitled” and “respect”. As opposed to this here are the words the topic “Catching illegal immigrants and human traffickers” had in the greatest number: “police officer”, “police station”, “male”, “illegal” and “Syrian”. The word “forbidden” is the most important negative one whereas the positive expressions seem somewhat insignificant.

We chose two topics out of the 47: “The criticism of the EU’s immigration policy (FIDESZ-KDNP)” and “Liberal attitude with the migrants”. These topics are the subject of a further analysis at the end of this article.


Who are mentioned in the news?

With DBpedia Spotlight we extracted the name entities from the collected articles (Named Entity Recognition) and we examined three types: personal names, geographical names and institution names. We created graphs where the nodes represent the entities and the edges show that they have been mentioned together in one article.

The graph of personal names contains a relatively high number of nodes- 2345 entities altogether- with 13473 edges. For the sake of clarity here are some informative graph parameters: the average path length is 3,3, the diameter- the distance of the two farthest nodes- is 10, the clustering coefficient- which indicates how frequently two nodes that are both connected to a third one is connected- is 0,75. Since we have a relatively complicated network, it seemed practical to reduce its size during the analysis and the representation to make the central nodes more visible. Therefore, the graph below shows nodes with at least 12 connections which is above the average degree in the original network. Each of them belongs to the giant component of the network- i.e. there are no isolated nodes and there must be at least one edge between any two random entities.

In case of the name-graph numerous relevant groups can be identified. Among them the ones with political characteristics are the most dominant- these form the central core which is the biggest related component. Also, the entities with the highest degree can be found here. The impressive blue cluster in the center of the graph is basically the collection point of the Hungarian political scene. Prime Minister Viktor Orbán has the highest degree not only here but also in the entire graph. Other key characters of the Fidesz regime with a relatively high degree are Péter Szijjártó, Antal Rogán, János Lázár, together with other past and present party leaders such as Gábor Vona or Ferenc Gyurcsány. The political elite of Western Europe also form a well-defined block (magenta). The graph shows that within the same cluster politicians of either similar or rather different opinions on migrants are mentioned several times in the same piece of news. Angela Merkel with an impressive degree is a good example. She relates to politicians like Francois Hollande, Federica Mogherini and Martin Schulz- all sharing her liberal views on refugee policy. Out of the politicians supporting anti-migrant policy Donald Tusk, David Cameron and Nicolas Sárközy with their connections could be mentioned. Connections spanning the two blocks aren’t rare either; they can also be found within the graph. The green cluster indicates the political elite of Russia and America as well as the central figures and terrorists of the war in Iraq and Syria.

Close to the center the group of the Church-related people- shown in light grey-, and the circle of Hungarian writers, poets and actors- shown in orange- can be seen. Not politically-related groups like Nobel-prize winner scientists and explorers, footballers, foreign actors and celebrities take place further away from the core.


Connections between personal names

In case of institution names, we have a relatively smaller network with 602 nodes and 3215 edges. Here are some of the graph’s interesting parameters: the average path length is 2,535, the diameter is 6 and the clustering coefficient is 0,74. When representing the results, we used filtering also based on the degree. Entities with at least 10 connections- this is the average degree in the network- were put on the dashboard. The green cluster represents political parties. Fidesz is mentioned together with other parties such as Jobbik, Demokratikus Koalíció and the Ellenzéki Párt in several articles. It also shows the strong connection between Jobbik and the last two. The political parties, the traditional and community sites- TV and radio channels, Facebook and Twitter- are sort of intertwined. A nicely highlighted thick edge is visible between M1 and the governing party. The reddish nodes indicate the German political parties while the grey nodes refer to the Austrian ones. The light blue cluster shows mostly international organizations. The violet one looks like a “melting pot” with MTI as its primary core and telecommunication companies, foreign parties and charity organizations as other members. MTI (Hungarian Telegraphic Office), is the entity with the highest degree being connected to almost every single institution on the graph. Knowing MTI’s profile- a Hungarian news agency, one of the oldest news agencies in the world-, this fact may not be surprising.


Connections between institutions


For the sake of clarity, the size of nodes in case of geographical names is unified. Altogether 28147 geographical names and their 46907 connections are shown. The diameter is 6, the average path length is 2667. Most nodes are situated in Hungary. Source countries of migration as well as the target ones are also significantly represented on the graph. Hungarian settlements close to the border have the highest degree; these are the ones mentioned the most frequently in the news: Bácsborsód and Zákányszék in case of the Serbian-Hungarian border; Csanádpalota, Mátészalka, Nyírmada and Nyírbogát in case of the Romanian-Hungarian border. Moving away from Hungary, Brussels has a considerably high degree with its connections spanning continents. Not surprisingly it is mentioned together with several Hungarian settlements along the border.


Connections between geographical names


Visual representation

Nowadays most articles contain images too, not only texts and their role is getting more and more important since they make more people read the story. As for social media, a good photo is simply a must. Therefore, together with the news we also collected the images. For a reader, it is easy to decide which image goes with which piece of news but to do the same is a challenge for a computer. We used several heuristics to tackle this problem; we assumed, for instance, that images of a tiny size were either logos or other design elements. We took the date of the first publication into account on several websites because of the visual recommendations at the end of the articles. Finally, some of the most frequent images were removed manually. Since processing images requires extensive hardware resources, it was important to remove duplicates. Finally, we had 38266 images left appearing in 28456 documents 62762 times altogether.

However, it is impossible and not even worth going through all of them. To have any kind of idea of what these images are about, a tool is needed. Luckily there are more than just one way of processing images. We chose Clarifai which adds tags- even in Hungarian- to the photos. Having a special dataset, we couldn’t use our results in an instant. Clarifai seems to have done its internship on images of white, middle-class western people, since photos of masses shot in refugee camps were constantly tagged as “festival”, but some tags like “rally” and “entertainment” were also over-represented. It seems we need to learn to live with these shortcomings so we simply got rid of certain tags (e.g. musician), while we kept others (e.g. festival) but with a significantly modified meaning. For instance, the festival tag in this case may refer to either a crowd, often behind the wall of law enforcement officers, or refugees resting somewhere. Although imperfect, the tags enable us to transform visual information to textual one and this way we can analyze the dataset.

We classified the images into eight topics by using the LDA method. In case of embedded images it’s worth studying research on the visual representation of minorities, such as for example Bernáth- Messing or Wright. Representation strategies, which often aim to alienate, are well-known from the literature and can also be found among the categories. A typical example for this is, when refugees are shown as masses, their faces hardly recognizable; or as “waves of humans” flowing towards Europe. A sharp contrast with this representation is how the politicians are shown: clearly and openly with their names and faces. This contrast and the negative connotation are intensified by the fact that in most cases the face of a refugee gets known only when they are wanted by the police. Very often the first photo of the person is shot during police action. Other representation strategies are also revealed by the topic model results. There are images of war areas or smaller groups and families with children on their way which make us more sensitive to their fate. The following photomontages show images which are the most characteristic of certain topics.


Faceless crowd


Maps, charts and screenshots




War areas, refugee camps and temporary residence of refugees


Members of armed forces, soldiers of war and target countries


Portraits, close-ups and “wanted” photos


At the border, at the fence, on the road and on the water


Images of smaller groups: children, families and young people

Time distribution of the topics above


Migrants, refugees, immigrants: what is the media suggesting?


Migrants, refugees, immigrants: what is the media suggesting?


Migrants, refugees, immigrants: what is the media suggesting?


The extreme values at the topics “Faceless crowd” and “Images of smaller groups and families” are present partly because we are not yet able to perfectly detach the images belonging to the given article from the other images on the same page.

House Prices 3D Visualization

We collected almost 200.000 house ads from the Hungarian web. First, we extracted the basic info for each unit and calculated their price per square metre, than we calculated the median price for each district. Finally, we made a 3D theejs visualization with QGIS.


You can find the visualization here.


We used Python for crawling and for data processing – we love BeautifulSoup! We used the fantastic open source QGIS program and its qgis2threejs plugin to visualize our data.


Geo search is one of the hottest topics in search right now, and well, Precognox is specialized in NLP and search, so we think it’s high time to get our hands dirty with geo data. Housing is a big issue everywhere in the world and we think technology can help to understand it and it could help to come up with solutions (yes, we are idealists).

Analyzing discourse on recent issues in Hungary


The Hungarian government passed a restrictive bill against the Central European University. This was followed by a storm of social media posts and protesters flowed on the streets to express their opinion that was backed by torrents of articles in the media. We are touched by the recent issues in our country and we wanted to see the national and international discourse around it.

Global discourse on lex-CEU

We collected data from Twitter to visualize the discourse (topic models) and to show the geographic distribution of the participants of the discussion.


You can find the visualization of the topics here.

You can find the 3D visualization of the geographic distribution of tweets here.

We used the Twitter API to collect 7822 tweets written in English containing one of the terms ‘CEU’, ‘Central European University’ or the hashtag #istandwithceu. We used the Stanford CoreNLP tool for lemmatization and named entity extraction. Topic modelling was done by the gensim package and the interactive topic model visualization is generated with the pyldavis library. For the interactive globe visualization, we did the same search, which gave us 9745 tweets. We extracted the geo-location data from tweets to map them on Google’s Globe WebGL.


Local discourse on mass protests

After the CEU, the government targeted NGOs with a new proposal which requires civil organizations accepting financial help from abroad to register at the court as a “foregin funded organization” despite the fact that they are already obliged to publish their books as any other NGO in the world. Citizens responded with peaceful mass protests and the online media followed the story closely. However, the pro-government media interpreted the news in a very different way.

We collected articles related to lex CEU, the anti-NGO bill and the protest from four Hungarian news site (two independent: 444.hu and index.hu and two pro-government: 888.hu and origo.hu). We analyzed 513 articles that appeared between 01.04.2017 and 13.04.2017. We found no significant differences between the two groups at the level of text statistics (lexical diversity and the length of the articles).

Below, you can have a look at the top 150 most frequent words of each site. The word clouds were made by using Processing and the WordCram library.

There are no big differences between the word frequencies. So we examined the keywords of each site with the help of the fantastic AntConc software for corpus linguistics. This word cloud shows that there is a great divide between pro-government and independent media.


We found that the volume of coverage is much less on the pro-government sites (162 vs 351 number of articles). We used Latent Dirichlet Allocation to analyze the topics of the articles and we found that although the two sides covers the same issues, there are two big topics which can be identified by lda due to the different linguistic features of the two groups. While the pro-government media prefers terms like Soros University (CEU), Soros funded NGOs, and foreign actors, the independent media is using a more neutral language and it is using the official names of persons and institutions. You can find our interactive visualization of the topics here.


Topic 1 is mainly about the mass demonstrations against lex CEU and the proposed anti-NGO bill. Surprisingly, this topic contains articles exclusively from 444.hu and index.hu


Topic 2 is also about the mass demonstrations against lex CEU and the proposed anti-NGO bill. However this topic contains articles exclusively from origo.hu and 888.hu due to the very different terms.

A walk into the semantic space

Having found dramatic differences between the independent and pro-government media, we were wondering how strong is the difference between the languages used by these sites. We trained a word2vec model on the corpus and plotted its 3D t-SNE projection using the threejs R package to see how the words used in the articles are related to each other.


You can find our visualization here.

We plotted only the most frequent five hundred words from each site. There are commonly used words in the top 500, these words occupy the central part of the plot. It seems that origo has no distinct language, as we can barely see yellow dots on the plot. Given the recent story of the site, we cannot wonder; origo has been bought by a group standing close to the government and most of its staffs left it, so the site recruited new people and started to collaborate sites on the right side of the political spectrum. Although 444 and Index covered the same stories, it seems the two sites developed their own languages.

Follow the narratives, but don’t be a solutionist

Tracking down narratives on an issue and visualizing your findings is super easy in 2017 thanks to the open source community. We love technology and we are happy whenever it can help us to see the big picture. We see there are two narratives on the same topic, but closing the gap between the two groups and starting a rational discussion between citizens is not about technology. There is no app that can help us. We hope that the followers of these distinct narratives can find a common ground and start a discussion in the real life before they lost the ability to understand each other.

Read on

Bill Bishop: The Big Sort: Why the Clustering of Like-Minded America is Tearing Us Apart, Mariner Books, 2009

Eli Pariser: The Filter Bubble: How the New Personalized Web Is Changing What We Read and How We Think, Penguin Books, 2012

George Lakoff: Don’t Think of an Elephant!: Know Your Values and Frame the Debate–The Essential Guide for Progressives, Chelsea Green Publishing, 2004

Norman Fairclough: Language and Power, 3rd Edition, Routledge, 2014

What we have presented at the Applied Linguistics Conference?

On the 3rd of February, 2017 our company had the honor of participating in the 11th Conference for PhD Students of Applied Linguistics not with one, neither with two, but with three presentations. Martina Szabó, our colleague has recently finished her PhD studies at the University of Szeged in applied linguistics and led multiple researches with our NLP team on the field of Hungarian emotion and sentiment analysis.

Martina, Gergő, Berni and Zsófi

    We presented our findings in three articles: Martina Szabó and Fanni Drávucz wrote about the problem of subjectivity in connection with emotions and sentiments. They were looking for the linguistic signs of uncertainty in our emotion and sentiment corpus and they found that in the emotion corpus there are 2.5 times more linguistic signs of uncertainty, which proves that they are indeed more personal and subjective than sentiments. They also found that the negative emotions and the negative sentiment are more closely connected to uncertainty, which could arise from the more polite and indirect expressing of such emotions or opinions.

    The second article were based on a research where we collected and analyzed a corpus of Hungarian tweets, looking for polarity-changing elements. These lexically negative linguistic items can lose or change their polarity and bear a positive or neutral value as an intensifier. In their research, Martina Szabó, Zsófi Nyíri, Bernadett Lázár and Gergő Morvay analyzed the usage of such intensifiers by male and female Twitter-users, and found that while female users preferred to use them connected to negative adjectives, male users used them more often with positive or negative adjectives and had an overall preference to swear words.

    In an other research, Martina Szabó, Zsófi Nyíri and Bernadett Lázár examined the translatability of negative intensifiers from Russian to English. These linguistic elements are so delicate and complex that their complete meaning is often lost in translation. They analyzed a Russian-English corpus with parallel texts and found that such intensifiers in Russian are often translated into English with a neutral intensifier, partly losing the original meaning, but there is a difference in interpreting negative intensifiers in connection with negative or positive adjectives.

We are really proud of Martina and our NLP team for such a hard work!

2016 in Retrospect

Time flies and the end of the year is coming, so it’s high time to summarize what’s happened to us in 2016.

Precognox in the world


We are participating in the KConnect Horizon 2020 project, that aims to bring semantic technologies into the medical field. We are proud of being a partner in a truly European project!

This year, Precognox visited the new world and build partnership with Basis Technology. One of our colleague spent three months in Boston, MA as the first step of our co-operation.

We are really multilingual, we were working with texts in (Simplified) Chinese Mandarin, Spanish, Arabic, Russian, English, and Hungarian. We gained experience with these languages as part of our projects with Meltwater, the biggest social media monitoring company.

Business as usual

According to the basic law of software development, projects occupy the available resources and more resources mean more projects. Precognox is not an exception to this, our team is growing, so we are managing more and more projects. We are continuously working on large scale Java based software development projects for various customers, just have a look at the list of our customers, and you’ll understand why I mention only one of them here. We are about to start major enterprise search and text mining projects, one of them is upgrading the semantic search solutions developed for  Profession‘s online job search portals. Precognox is working on the backend of Profession’s sites for years, so we literally grow up with it, it taught us a lot about enterprise search, so we are excited about the upgrade.


We have a new product called TAS (Text Analytics System). We had several data collecting and cleaning projects and we distilled our experiences into a new tool. TAS helps you to collect, clean and analyze unstructured data , learn more about it on our website.

For profit, and for the greater good

For years, Precognox employs trainees. Usually, we have software developer and data analyst trainees who are working with us on a part-time basis, and we are welcome students for summer internships too. We are very proud of our former trainees, many of them started their career at top companies, one is doing his PhD in the Netherlands, and many of them is our full-time colleague now. From this September, we are participating in the new collaborative teaching scheme, which means the incoming students spend one-two days at the university as ordinary students and the rest of the week is spent at our company as a full-time employee. We believe that this practice oriented scheme will help students to jumpstart their careers upon graduation.

This year we were working on data driven projects with two NGOs and two research institutions.

We were working on an information visualization dashboard with EMMA (an NGO dedicated to inform and help pregnant women). As part of an European project, EMMA’s volunteers interviewed several women on their experiences during pregnancy and motherhood across the country and we analyzed this data by using various text mining tools. This project helped us to design a workflow for rapid prototyping text mining solutions, you can find projects based on this here and here. We do hope EMMA can use our dashboard for analyzing their data and we can work together on interesting projects in the future.


This summer, we started working with Járókelő, a platform for reporting potholes and other anomalies in the city to the authorities. We’d like to develop a scoring mechanism for the stakeholders.



We are processing public procurement data for the Corruption Research Centre Budapest, and the Government Transparency Institute. Our partners’ research on monitoring procurement related corruption has been featured on The Economist recently.


Precognox is committed to open data, that’s why we published our Hungarian sentiment lexicon on opendata.hu under a permissive licence.


We publish on our research project on Nyelv és Tudomány (Language and Science, a popular scientific online magazine). E.g. we wrote a long article on the media representation of migrants in the Hungarian online media, we published several pieces on the social and ethical questions of AI and big data, and we made style transfer videos for the portal in 2016.

Work should be fun!

While we have lots of projects, we are continuously improving ourselves. That’s why we are organizing the Hungarian Natural Language Processing Meetup since 2012. This year, we teamed up with Meltwater and nyest.hu and took the meetup to the next level. We had six meetings with speakers from industry and academia. Two meetups were held in English with speakers from Oxford, San Francisco (Meltwater), and London (BlackSwan).


Precognox is a distributed company with offices in Kaposvár and Budapest, and team members from Szeged and other parts of the country. Several times a year,  we get together to talk about our projects and just to have a blast. Of course, we are real geeks, so we organized in-house hackathons at these events, and we loved hacking on data projects.


We are addicted to conferences. Every year, we attend MSZNY (the Hungarian Computational Linguistics conference), BI Forum (the yearly business intelligence conference in Hungary) and many more. We are happy to present our research to the public and get feedback from the community. Also, we love sharing our knowledge with others, e.g. this year, we gave a lesson on text mining at Kürt Academy’s Data Science course, and a lesson on content analysis and text mining for master students at Statistics Department of ELTE TATK.


This year, we made lots of dashboards for profit and for help scientific inquiries. Having finished these projects, we felt the need for introspection. Although, we were working hard to show what data tells us, we didn’t use the full potential of data analysis for advancing humanity. We needed a reason to continue our efforts, we needed a new goal. We turned to the Jedi Church for consolation, the church connected us to the Force, and the Force helped us to visualize the Start Wars texts.


We are so artsy

Everything started with a job ad. We were looking for a new intern, we needed a photo for the post that describes the ideal applicant. It seemed to be a good idea to give a try to style transfer and attach an image of our team in the style of Iranian mosaics.


Later, our Budapest unit moved to a new office, so we thought it is a good idea to develop our own decoration for the new place. The results are hilarious, a new typeface (yes!) with characters composed from graphs, and the following pictures.

Finally “SEMMI” (means nothing) got on the wall of our room.


We are very keen on style transfer, so we made videos too.


Having worked with pictures and characters, we needed a new challenge, so we sonified emotion time series extracted from Hungarian news published during the migration crisis.

And now for something completely different

This year we were working hard and playing hard, it’s time to have a short break. Next year, Precognox will start offering new solutions to its customers and exciting new projects to its employees. Stay tuned, we’re going to blog about these!