Migrants, refugees, immigrants: what is the media suggesting?

Visual and textual representation of immigration in the Hungarian online media

In the autumn of 2016, the referendum on the so-called forced-settlement of migrants was looming over our heads. Media was certainly putting an enormous pressure on society; but what was this whole fuss about? Although modern computational linguistics cannot come up with exact answers, it may assist us in getting an idea of the wide range of emotions stirred by various sites. Precognox research team is presenting their profound analysis on how media is attempting to sell the referendum.

Kitti Balogh, Nóra Fülöp, Virág Ilyés, Zoltán Varjú

Originally published: nyest.hu, September 29, 2016

It is indisputable that although a huge number of refugees reached Hungary, yet we could not bump into them around every corner since most of them almost immediately left the country. For the public, it is the media that represents a direct link to the refugees, so we wanted to find out how the news on migration are presented in the online media. We analyzed more than 40000 articles published between Sep 27, 2014 and July 11, 2016 with text mining and image processing methods. The texts and their metadata are available in a searchable form on our dashboard. In this article, we are giving a broad outline of what possibilities the dashboard provides. We are also trying to present the information content of the images in an easily understandable form.


As opposed to the mostly qualitative research widely applied in media content and media representation analysis, we used methods which support the automatic processing of large amount of data as well as the simultaneous analysis of both visual and textual content. This way the research period can be extended and the number of content providers can be increased. The simultaneous analysis is to be completed in our intern’s thesis on the fine-tuning of cluster analysis application and evaluation. The aim is to make the interpretation of both textual and the increasing proportion of visual content easier in the future.

The necessary data was collected from 25 online news sites including the prominent index.hu, origo.hu, the online version of mno and hvg as well as minor portals. The selected sites cover a wide spectrum of the Hungarian online media; articles have been taken from pestisracok.hu, abcug.hu, kuruc.info as well as from the popular pages of yellow journalism. We also collected data from online TV channels- atv.hu, rtl.hu, hirek.hu- and the official police reports from police.hu.


The number of articles in the corpus based on content providers


We could find the articles related to migration on most sites with their own search engines but there were some cases when it was not feasible. In lack of- or besides- the search function, labels and headings guided us to find the relevant content. First, we collected the article URLs manually with Link Klipper Google Chrome extension. Then, having these references, we automatized the crawling process of both the visual and the textual content.


The number of articles in the corpus based on keywords

To be able to interpret the composition of the corpus, it is essential to mention the process of how the URLs and the content are filtered since it decreased the reference list by 30 thousand items. During the process, several methods were used to add relevant and unique articles to the corpus. We got rid of invalid links which led to either recommendations or sites listing search results with simhash. Duplications within one domain were filtered with similarity measures based on tf-id statistics. We also removed duplicated URLs. We applied the heuristic technique to filter duplicates; the article published the earliest was left in the corpus. We also discarded the articles with no timestamp where, for this reason, the date of publication could not be identified. Although we did our best to eliminate irrelevant articles with our statistical tools, we cannot be certain to have only proper content. It is also important to remember that based on the corpus composition only careful conclusions can be drawn concerning either the number of articles published on a certain subject, or which site was the most active to represent a given topic. The reason for this is that adding an article to the corpus meant its real publication, the fact that we could use the crawling method on the site, and that the content met the filter criteria: it was not identified as a duplicate, or it wasn’t an invalid link and had a timestamp. At the end of the process what we got was a corpus of 42845 articles.


We worked with the following keywords: “immigration”, “immigrant”, “migrant”, “migration”, “refugee” and “asylum-seeker”. Almost half of the articles in the corpus are hits for the keyword “refugee”. The second most frequent was “immigration” followed by “immigrant”, “migrant”, “migration” and “asylum-seeker”. Unique labels and headings were used in case of three sites: on kuruc.info we collected articles under the heading “immigrant crime” as well besides the keywords mentioned above. On “kettősmérce.blog.hu” the column “immigrant affairs” was great help to find the relevant articles. On blikk.hu the label “refugee crisis” was used to get the news on this topic.

Word usage is a crucial element of media representation research. The modality of expressions can be either alienating or fear-provoking. It would certainly be wrong to jump to far-reaching conclusions in the lack of context and judge the strategies content providers used to present refugee-affairs based exclusively on the keywords. Below however, we can see the hit results of our keywords categorized by sites: which expressions were preferred and which ones were ignored.


Hit results for keywords based on sites

To get a more profound understanding after the descriptive analysis of the corpus we also completed the content analysis of the articles as well. In the pre-work phase the first step was to remove parts with incorrectly coded characters. Then, with the use of magyarlánc we stemmed the words and carried out part of speech tagging: we classified words into their parts of speech and labelled them. To achieve more relevant results, we removed certain words- the most frequently used ones due to natural language usage- with the help of a stopword list.

To present the results of our complex analysis an interactive dashboard was created which hopefully completes, corrects or specifies our general suspicion on the representation of the refugee crisis and gives an overall picture of the world the Hungarian online media shows about immigration.


Trends in time

We can easily get an idea of how media reacted to immigration based on the time and the number of the articles published. This phenomenon was clearly presented on the dashboard created for the text corpus which showed an increase from May 2015 in the number of published articles. Most of them were created between the end of August and the middle of September 2015. From October 2015 to May 2016 articles were being published evenly, then in July 2016, right at the end of the collecting period, another rise can be seen.


Time distribution of all news

It is possible to find the words and expressions used in the articles with the Search field on the sites. For instance, if we search for the words “refugee”, “migrant”, “migration” or “immigrant” what we get is a trend fairly similar to the original one. However, there are expressions which were not typically used during the whole period. One such instance is “immigrant-for-a-living” which can be found by searching for “living” AND “immigrant” or “migrant crime” in the keywords category. If we check the timeline of these expressions we can see that this phrase was favored roughly until the middle of 2015, mostly in the news of nepszava.hu. The tag “immigrant crime” became a pet expression on kurucinfo.hu from the early 2016.


Time distribution of articles with the phrase “immigrant-for-a-living”


Time distribution of “immigrant crime” tag

We can also find words which are more generally connected to the topic. Such as “immigration” for example where the time distribution has a peak in several places indicating the unfolding of the phenomenon well before media attention.


Time distribution of articles containing the phrase “immigration”

Emotions and sentiments

It is important to identify the emotions and attitudes evoked by events when analyzing the discourse of online media. Although journalists generally aim to be objective and neutral, the phrases they use are often giveaways of their mindset- not to mention articles where the opinion of the author is far from being disguised.

With the two tabs of the dashboard it is possible to study the sentiments and emotions identified in the articles. During the sentiment and emotion analysis our goal was to identify opinions, attitudes and emotions expressed by the articles. Sentiment analysis normally uses 3 categories (negative, neutral and positive) or their various stages while emotion analysis tries to detect the 6 basic human emotions (sadness, anger, joy, disgust, fear and surprise). We used our Precognox dictionaries to identify sentiments and emotions. The sentiment dictionaries are available free here for research purposes. Although the emotion dictionaries can still be improved on and therefore should be used carefully, they are appropriate for a rough analysis.

To receive the sentiment or emotive value of an article we divided the number of words identified by our dictionaries with the total number of words. We got a value between 0 and 1 for negative and positive sentiments respectively for each article as well as for the emotions of sadness, anger, joy, disgust, fear and surprise. Then we added the positive and negative values. The cumulative sentiment had a value between -1 and 1 within one article. However, on the dashboard the values of articles published on a specific day are summed up- that’s why the sentiment values may range from 10 to -8.

The cumulative sentiment of the news on immigration is neither positive nor negative in nature. The daily value is rather neutral with only one or two peaks. When positive and negative sentiment values are considered separately we can see that they are both represented in significant numbers. When summing them up however, they cancel each other out. This means that the sentiments of the collected sources cover a wide range of spectrum and with some exceptions they are balanced.


Time distribution of cumulative sentiments


Time distribution of negative sentiments


Time distribution of positive sentiments

Based on the emotion timelines sadness and fear are the first to be revealed in the news. Since however, the dictionaries differ in length, comparing the volume of emotions should be done with care. When looking for a certain date with the Time window panel it is possible to read the news published on the very date. Also, we can find out what event triggered the increase of emotions. For instance, 31 August 2015 was a day when both sadness and fear were at a high peak. We can see that lots of articles were focusing on the following topics: a humanitarian catastrophe due to the refugees gathered at the Keleti station, the congestion on both public roads and railways, the negative reception of Hungary’s immigration policy, the rejection of the quota system, the number of refugees entering the country, the high alert of border control and the impossible situation of volunteers in the transit zones.


Time distribution of sadness and fear


It is also worth checking the domains to see which sentiment or emotion dominates the online news portals. Let us look at 444.hu where all sentiments except surprise show a constant radical shift similarly to the cumulated value which also changes dramatically between the positive and negative sentiments.

Besides the timelines, the words belonging to given sentiments and emotions are also shown on the Dashboard. Let’s look at two examples: expressions like “unpleasantness”, “problem”, “war”, “terrorist” and “illness” are typical in news where negative sentiments are dominant. In articles where the emotion “fear” is powerful, words like “concern”, “dread”, “terror” and “worry” appear in the greatest number.


Word cloud of negative sentiments


Word cloud of fear



To make the content of more than 40.000 news more manageable thematic groups sharing the same semantic features were created. For this process, we used the Mallet tool’s topic model called Latent Dirichlet Allocation (LDA). The classification process of the LDA algorithm is based on how the words in the document are distributed. Naming the topics is done by analysts. The output of the algorithm is two lists: one containing the most typical words used in each topic, and another one showing the rate of how the various topics are represented in each document. We got as many as 47 topics altogether which were named based on either their keywords or their most typical news. When modeling a topic each piece of news gets assigned to each topic to a certain extent. It may be more prominent in case of 1-3 items and may be relatively insignificant in case of others. For the sake of simplicity each piece of news was assigned to the most relevant topic. Therefore, we may have the impression in some cases that only few sentences refer to the given topic but all in all this method gives a good model of the thematic structure of the corpus.

The dashboard created for the texts and their metadata has a separate tab for topic analysis. Here’s the list of 15 topics embracing the most news, the number of the news in parenthesis:

  • The EU-Turkey Refugee Deal (2313)
  • The criticism of the EU’s immigration policy (FIDESZ-KDNP) (2124)
  • Catching illegal immigrants and human traffickers (2123)
  • Migrant surge in Southeastern Europe (2060)
  • The journey of migrants to Western-Europe through Hungary (2028)
  • Accidents of refugee boats (1723)
  • Restriction on the right of Asylum (1696)
  • Refugee incidents in Germany (1555)
  • War in the Middle East (1469)
  • Hungarian border barrier (1440)
  • Merkel’s refugee policy and its criticism (1412)
  • Aid programs of international and civilian organizations to help Syrian refugees (1311)
  • Foreign reaction to the refugee crisis (1238)
  • Austrian-Hungarian border barrier (1225)
  • The political crisis caused by refugees (1121)

The dashboard clearly shows which words are typical and which positive and negative expressions are favored when a certain topic is being discussed. For instance, the most frequently occurring words of the topic “The EU-Turkey Refugee Deal” are the following: “unio”, “refugee”, “state”, “world” and “role”. As for negative words “burden”, “nuisance”, “inconvenience”, “problem” stand out while the positive ones are: “important”, “free”, “entitled” and “respect”. As opposed to this here are the words the topic “Catching illegal immigrants and human traffickers” had in the greatest number: “police officer”, “police station”, “male”, “illegal” and “Syrian”. The word “forbidden” is the most important negative one whereas the positive expressions seem somewhat insignificant.

We chose two topics out of the 47: “The criticism of the EU’s immigration policy (FIDESZ-KDNP)” and “Liberal attitude with the migrants”. These topics are the subject of a further analysis at the end of this article.


Who are mentioned in the news?

With DBpedia Spotlight we extracted the name entities from the collected articles (Named Entity Recognition) and we examined three types: personal names, geographical names and institution names. We created graphs where the nodes represent the entities and the edges show that they have been mentioned together in one article.

The graph of personal names contains a relatively high number of nodes- 2345 entities altogether- with 13473 edges. For the sake of clarity here are some informative graph parameters: the average path length is 3,3, the diameter- the distance of the two farthest nodes- is 10, the clustering coefficient- which indicates how frequently two nodes that are both connected to a third one is connected- is 0,75. Since we have a relatively complicated network, it seemed practical to reduce its size during the analysis and the representation to make the central nodes more visible. Therefore, the graph below shows nodes with at least 12 connections which is above the average degree in the original network. Each of them belongs to the giant component of the network- i.e. there are no isolated nodes and there must be at least one edge between any two random entities.

In case of the name-graph numerous relevant groups can be identified. Among them the ones with political characteristics are the most dominant- these form the central core which is the biggest related component. Also, the entities with the highest degree can be found here. The impressive blue cluster in the center of the graph is basically the collection point of the Hungarian political scene. Prime Minister Viktor Orbán has the highest degree not only here but also in the entire graph. Other key characters of the Fidesz regime with a relatively high degree are Péter Szijjártó, Antal Rogán, János Lázár, together with other past and present party leaders such as Gábor Vona or Ferenc Gyurcsány. The political elite of Western Europe also form a well-defined block (magenta). The graph shows that within the same cluster politicians of either similar or rather different opinions on migrants are mentioned several times in the same piece of news. Angela Merkel with an impressive degree is a good example. She relates to politicians like Francois Hollande, Federica Mogherini and Martin Schulz- all sharing her liberal views on refugee policy. Out of the politicians supporting anti-migrant policy Donald Tusk, David Cameron and Nicolas Sárközy with their connections could be mentioned. Connections spanning the two blocks aren’t rare either; they can also be found within the graph. The green cluster indicates the political elite of Russia and America as well as the central figures and terrorists of the war in Iraq and Syria.

Close to the center the group of the Church-related people- shown in light grey-, and the circle of Hungarian writers, poets and actors- shown in orange- can be seen. Not politically-related groups like Nobel-prize winner scientists and explorers, footballers, foreign actors and celebrities take place further away from the core.


Connections between personal names

In case of institution names, we have a relatively smaller network with 602 nodes and 3215 edges. Here are some of the graph’s interesting parameters: the average path length is 2,535, the diameter is 6 and the clustering coefficient is 0,74. When representing the results, we used filtering also based on the degree. Entities with at least 10 connections- this is the average degree in the network- were put on the dashboard. The green cluster represents political parties. Fidesz is mentioned together with other parties such as Jobbik, Demokratikus Koalíció and the Ellenzéki Párt in several articles. It also shows the strong connection between Jobbik and the last two. The political parties, the traditional and community sites- TV and radio channels, Facebook and Twitter- are sort of intertwined. A nicely highlighted thick edge is visible between M1 and the governing party. The reddish nodes indicate the German political parties while the grey nodes refer to the Austrian ones. The light blue cluster shows mostly international organizations. The violet one looks like a “melting pot” with MTI as its primary core and telecommunication companies, foreign parties and charity organizations as other members. MTI (Hungarian Telegraphic Office), is the entity with the highest degree being connected to almost every single institution on the graph. Knowing MTI’s profile- a Hungarian news agency, one of the oldest news agencies in the world-, this fact may not be surprising.


Connections between institutions


For the sake of clarity, the size of nodes in case of geographical names is unified. Altogether 28147 geographical names and their 46907 connections are shown. The diameter is 6, the average path length is 2667. Most nodes are situated in Hungary. Source countries of migration as well as the target ones are also significantly represented on the graph. Hungarian settlements close to the border have the highest degree; these are the ones mentioned the most frequently in the news: Bácsborsód and Zákányszék in case of the Serbian-Hungarian border; Csanádpalota, Mátészalka, Nyírmada and Nyírbogát in case of the Romanian-Hungarian border. Moving away from Hungary, Brussels has a considerably high degree with its connections spanning continents. Not surprisingly it is mentioned together with several Hungarian settlements along the border.


Connections between geographical names


Visual representation

Nowadays most articles contain images too, not only texts and their role is getting more and more important since they make more people read the story. As for social media, a good photo is simply a must. Therefore, together with the news we also collected the images. For a reader, it is easy to decide which image goes with which piece of news but to do the same is a challenge for a computer. We used several heuristics to tackle this problem; we assumed, for instance, that images of a tiny size were either logos or other design elements. We took the date of the first publication into account on several websites because of the visual recommendations at the end of the articles. Finally, some of the most frequent images were removed manually. Since processing images requires extensive hardware resources, it was important to remove duplicates. Finally, we had 38266 images left appearing in 28456 documents 62762 times altogether.

However, it is impossible and not even worth going through all of them. To have any kind of idea of what these images are about, a tool is needed. Luckily there are more than just one way of processing images. We chose Clarifai which adds tags- even in Hungarian- to the photos. Having a special dataset, we couldn’t use our results in an instant. Clarifai seems to have done its internship on images of white, middle-class western people, since photos of masses shot in refugee camps were constantly tagged as “festival”, but some tags like “rally” and “entertainment” were also over-represented. It seems we need to learn to live with these shortcomings so we simply got rid of certain tags (e.g. musician), while we kept others (e.g. festival) but with a significantly modified meaning. For instance, the festival tag in this case may refer to either a crowd, often behind the wall of law enforcement officers, or refugees resting somewhere. Although imperfect, the tags enable us to transform visual information to textual one and this way we can analyze the dataset.

We classified the images into eight topics by using the LDA method. In case of embedded images it’s worth studying research on the visual representation of minorities, such as for example Bernáth- Messing or Wright. Representation strategies, which often aim to alienate, are well-known from the literature and can also be found among the categories. A typical example for this is, when refugees are shown as masses, their faces hardly recognizable; or as “waves of humans” flowing towards Europe. A sharp contrast with this representation is how the politicians are shown: clearly and openly with their names and faces. This contrast and the negative connotation are intensified by the fact that in most cases the face of a refugee gets known only when they are wanted by the police. Very often the first photo of the person is shot during police action. Other representation strategies are also revealed by the topic model results. There are images of war areas or smaller groups and families with children on their way which make us more sensitive to their fate. The following photomontages show images which are the most characteristic of certain topics.


Faceless crowd


Maps, charts and screenshots




War areas, refugee camps and temporary residence of refugees


Members of armed forces, soldiers of war and target countries


Portraits, close-ups and “wanted” photos


At the border, at the fence, on the road and on the water


Images of smaller groups: children, families and young people

Time distribution of the topics above


Migrants, refugees, immigrants: what is the media suggesting?


Migrants, refugees, immigrants: what is the media suggesting?


Migrants, refugees, immigrants: what is the media suggesting?


The extreme values at the topics “Faceless crowd” and “Images of smaller groups and families” are present partly because we are not yet able to perfectly detach the images belonging to the given article from the other images on the same page.

Is Kenny Baker the Kevin Bacon of Star Wars? Does every movie have a happy ending?

How do we quantify the importance of the nodes in a network? To answer this question mathematicians came up with the so-called Erdős number to show how far someone is from “the master” in a network of publications. Movie-enthusiasts have created the Bacon number as its analogy, based on co-occurrences in movies. But what does it have to do with Star Wars? Which character or actor is the key person in this universe? Is it really true that every blockbuster has a happy ending? We are trying to answer these questions with the revised version of our study carried out last year and hope to find answers with the help of interactive visualisations.

Erdős and Bacon

What is needed to create a new theory in network science? Apparently, a windy winter night is enough when Footloose and The Air Up There are on TV one after the other. And of course three American university students who having watched the movies begin to speculate: Kevin Bacon has played in so many movies that maybe there is no actor in Hollywood who hasn’t played with him yet. Well, probably it is not true, but backing up the theory with a bit of mathematics and research, a new term, the Bacon number has been born.

The Erdős number was defined in 1969 by Casper Goffmann in his famous article ‘And what is your Erdős number?’ It is based on a similar observation about the legendary productive Hungarian mathematician Paul Erdős who had so many publications in his life (approx. 1525 articles) in so different fields, that it was possible and worth classifing mathematicians and scientists based on their distance from Erdős in a network of publications. According to this, Paul Erdős’s Erdős nuber is 0, since he is the origo of this theory. Any scientist who has ever published anything together with Erdős, has the Erdős number 1. Anyone who has published together with someone with the Erdős number 1 will get the Erdős number 2, and so on. Generally speaking, everyone has the Erdős number of the person of the lowest Erdős number they have published with, plus one.

In case of Kevin Bacon and Hollywood the principle is the same, but instead of publications it is based on movies and the connection is not authoring an article with someone but playing in the same movie with someone. It is only a coincidence and a historical legacy that it is called Bacon number, because although Erdős is the most productive mathematician in history with almost twice as many publications as Euler, who came second on the list, Bacon is not really a central figure in Hollywood. If we check the network of actors in Hollywood, the average distance from everyone else is 2.79 in case of Bacon, which is enough only for the 876th place in the ranking. As a comparison Rod Steiger, who is the first on this list, has a value of 2.53.

One Saga, Seven Episodes

But what does it have to do with Kenny Baker? We were wondering who the Kevin Bacon of the Star Wars universe was therefore we collected the cast members of both the original and the prequel trilogy also adding the actors of Episode VII that was released last December. We visualised our findings on an interactive graph. The title – ‘The center of the Star Wars universe’ – is honorary, because the concept of distance related to the Bacon number can hardly be interpreted on this graph. Nevertheless, the prestige value of the origo and the position it occupies within the network can be a valid basis of comparison as well as the definition of the relations based on the co-starring of the actors.

On the visualisation – to make the network more transparent – we only show the actors who played in at least two different Star Wars movies. There is a relationship between two actors if they have starred in the same movie. The more movies the actors have co-starred in, the stronger their relationship is.


Network of actors having played in at least two different Star Wars movies. The interactive version of the graph can be found here.

By clicking on the nodes of the interactive visualisation you can see the number of movies the actors played in, which characters they embodied, as well as the number of their relations. The colors of the nodes correspond to the set of trilogies the actors played in. There is a clear distinction between actors only starring in the original – light blue – and the ones who played in the prequel trilogy – dark blue. This may not be so surprising considering that 16 years passed between the releases of Episode VI and I and 28 years between Episode IV and Episode III.

Naturally, there are actors who connect the two trilogies’ crew, although their number is limited. They are forming the nodes in the center of the network and as for their size they are the biggest ones. This also indicates that these actors have the largest number of relations and the highest number of shortest paths between two peaks. Actors of this group played in both the original and the prequel trilogy (light green nodes), another group of them additionally got roles in Episode VII as well (dark green nodes).

We can also find two additional subgroups on the graph. The light blue one shows the actors playing a key role in the original trilogy and in Episode VII as well. Carry Fisher playing Leia and Harrison Ford playing Han Solo are the most typical representatives of this category. Alec Guinness, who played Obi-Wan Kenobi in the original trilogy, may certainly be the most interesting member of this group. Although he passed away in 2000 he still appears in the credits of Episode VII, due to an archive voice recording. Finally, the only actor appearing in both the prequel trilogy and Episode VII is Ewan McGregor, also with a voice recording – it seems that the latest episode couldn’t decide which Jedi master to favor: the young or the old one.

The Big Four

Let’s take a look from a different angle and see how the actors, according to their characters, are set in the network of the Star Wars universe and who is the luckiest to call himself the origo.

There are four characters altogether who have appeared in all the seven Star Wars movies so far: Anakin Skywalker, Obi-Wan Kenobi, C-3PO and R2-D2. Of course the young and the old Anakin and Obi-Wan are played by different actors, therefore they can’t make it to the very top with the two droids. There was a very close competition between Kenny Baker (R2-D2) and Anthony Daniels (C-3PO), but in Episode VII Anthony Daniels took the leading role, since Kenny Baker was only a consultant in playing R2-D2. This fact however doesn’t affect their roles in the network since both of them appeared in the credits of all seven movies. What is more, they are both versatile actors; they played more than one character – Kenny Baker was also Paploo, the ewok in Episode VI and Anthony Daniels was also Dannl Faytonni in Episode II. Considering the recent death of Kenny Baker however, we decided to claim him the winner of the title ‘Kevin Bacon of the Star Wars universe’ as a posthumous award. (In reality, Anthony Daniels is just as worthy of the title as he is.)

The runner-up is of course Frank Oz, who played Yoda in six of the seven Star Wars movies (in Episode VII with his voice). Actors like Ian McDiarmid playing Senator Palpatine (in Episode V only on the DVD-edition) and Peter Mayhew playing Chewbacca- both actors played in five-five movies- have a distinctive place on the list. Last but not least actors of the original trilogy also appearing in Episode VII, like Carry Fisher or Mark Hamill may claim the third place.

The most universal node of the network is no doubt Natalie Portman playing Padmé Amidala in the prequel trilogy. Her Baker number is of course 1, her Bacon number is 2 and her Erdős number is 5. She did a PhD in Psychology at Harvard and published several papers, earning a decent Erdős number (among the 134 thousand scientists with Erdős number, the median is 5).

Sentimental Scenes

We automatically split the Star Wars movie scripts stored in the IMSDb database into scenes, then we analysed them with the help of Hu and Liu’s sentiment dictionary. The sentiment scores of each scene from all the episodes can be seen in the interactive visualisations below. The bars marked with brighter colors represent scenes with positive sentiment, the darker bars denote negative ones. The deeper a dark bar reaches the more negative the sentiment of a scene is; the higher a bright bar reaches, the more positive its sentiment is. In case of neutral sentiment scores there is no visible bar. If we point our cursor at the visualizations’ bars beside the exact sentiment score we can see the given scene’s location and the top 3 characters as well – i.e. the characters who either played or are mentioned in the scene.

Generally speaking, the episodes of Star Wars can be characterized mainly by negative sentiment – which is especially true for the episodes of the original trilogy (Episode IV, V and VI). The most negative ones are Episode V and VI and the most positive one is Episode II. In Episode VII the distribution of positive-negative sentiments is more similar to the movies of the prequel trilogy. If we look for the indicators of happy ending, we can find them in Episode I, III and V; these movies end with either positive or neutral scenes. Although positive scenes can be found at the end of each movie, based on the script analysis only half of the movies have a ‘happy ending’.


The sentiment scores of the original trilogy’s movies . The interactive version of the graph can be found here.


The sentiment scores of the prequel trilogy . The interactive version of the graph can be found here.


The sentiment scores of Episode VII. The interactive version of the graph can be found here.

Movies are worth analysing from the characters’ point of view as well. To do this another interactive data visualisation lends a helping hand which presents the dialogs between characters in a network format. It also shows which characters play most frequently in the movies and what kind of sentiment is typical when they occur.


The conversation graph of the prequel trilogy. The interactive version can be found here.

The conversation graphs reveal that the dialogs in the original trilogy were more focused and mainly the main characters were involved – several supporting actors didn’t even get an opportunity to speak out. In contrast, the conversations are more equally distributed between the main and the supporting characters in the episodes of the prequel trilogy. This trend can also be seen on the graph of Episode VII. The characters of Anakin Skywalker and Darth Vader are good examples of sentiment changes; since in the first two episodes Anakin equally appears in both negative and positive roles then a shift occurs: in the third episode he takes part in more and more scenes filled with negative sentiment, and after his transition to Darth Vader he appears almost only in negative scenes.


Written by Kitti Balogh, Virág Ilyés, and Gergely Morvay

Culture independence vs context dependency- Ekman’s “dangerous” theory

This post is part of a case study of emotion analysis focusing primarily on the theoretical background of text based emotion representation.

Here I wish to point out that exploring the field of text based emotions may reveal information otherwise inaccessible in sentiment analysis. Therefore it may even result in a different kind of benefit that enhances its value.

In order to find out what kind of emotions are “hiding” in texts it is first needed to be defined what we are actually looking for. The simplest solution seems to search for linguistic expressions explicitly indicating a certain emotion. Let’s take a look at some real-life examples:

1 XDDDDDDD well, you know even an innocent smiley can freak you out 🙂

2 Still terrified, the actress turned to the public.

The highlighted items seem worth collecting and adding to a dictionary based on the emotions they express. In order to do that however first the system of categorization need to be defined. The next obvious step for a linguist therefore is to check what psychology has to say about which emotion categories are worth the time.

The method above is the so-called current beaten track of emotion analysis– if such a track exists at all considering the insignificant number of international and Hungarian publications. While searching for the relevant psychological data the language technologist comes across Paul Ekman’s theory. According to Ekman there are six basic emotions– sadness, anger, fear, surprise, happiness and disgust– the facial expressions of which are universal, i.e. independent of the person’s cultural background and mean the same emotional state for everyone.


In the 1970s Ekman and Friesen developed the Facial Action Coding System (FACS) to taxonomize every human facial expression. The method, which is the result of decades of research, describes all observable facial movements for every emotion and by analysing them it determines the emotional state of the person. The fact that both genuine and fake emotions can be precisely identified is the eloquent proof of its reliability.

No wonder Ekman was named one of the top 100 most influential people in the May 2009 edition of Time magazine.


Paul Ekman and Tim Roth, the star of the TV series “Lie to me”.


The widespread popularity of this categorization provided a solid base for emotion analysis in language technology as well. Most relevant studies categorize emotion expressions either directly based on Ekman’s theory (Liu et al. 2003; Alm et al. 2005; Neviarouskaya et al. 2007 a,b; Aman-Szpakowicz 2007) or like us take it for their basis adding some other classes as well e.g.: attraction or tension (Szabó et al. 2015). The argument that these emotions are universal is so convincing that computational linguists almost forget to ask whether this is the very feature they need at all or if this otherwise important fact disguises features which should be an essential part of the analysis?

As I promised in the title of the post I intend to write about Ekman’s “dangerous” theory. I am referring to the book “Darwin’s Dangerous Idea: Evolution and the Meanings of life (1995)” by Daniel C. Dennett here and also drawing a parallel with Ekman’s theory. According to Dennett there are two reasons why Darwin’s theory may be dangerous: First, because his thoughts questioning the privileged role humans were said to enjoy in the universe profoundly shook the foundation of the traditional cosmological approach. He also doubted that life itself should actually have a peculiar ontological status. Second, according to Dennett Darwin’s theory is easy to misunderstand therefore it may generate dangerous misinterpretations. The reason why Ekman’s theory- the ability to read emotions on faces is innately hardwired- is “dangerous” is that it’s so convincing that other aspects of expressing emotions– like facial or linguistic– are easily ignored. One important factor is the role of context in the interpretation of emotions, and it is not exclusively about text analysis.  Let us take a closer look at the phenomenon:

In their article– Language as context for the perception of emotion, 2007– Barrett and her co-authors challenge the idea of innate emotion perception by using a certain photo as an example. The photo was taken of United States Senator Jim Webb celebrating his 2007 electoral victory. Experiments revealed when subjects saw the image of the Senator taken out of context (see image a.) they all said he looked angry and aggressive. When situated however in the original context subjects agreed that he appeared happy and excited.

The result is remarkable considering that not once did the subjects find the senator’s facial expression misunderstandable or confusing but came to the conflicting conclusions automatically and effortlessly.


Barrett (Barrett at al. 2007) considers this phenomenon a paradoxon since it’s rather controversial that there are six facial expressions which are biologically perfectly distinguishable but their interpretations may be absolutely context-dependent. The authors try to come up with an explanation such as words ground category acquisition, but in my opinion this argument is not convincing enough.

In exchange for the Ekman categories here linguistics seems to lend psychology a conceptual framework which needs to be traced back as far as Wilson and Sperber’s Relevance theory (2004). It argues that in any given communication the hearer or audience will search for meaning and having found the one that fits their expectation of relevance will stop processing. In the conceptual framework of lexical pragmatics it all means that the lexeme itself is nothing but an underspecified semantic representation. Consequently it gains its complete meaning only in context (Bibok 2014). Where does this underdetermined meaning come from? Obviously there must be a pragmatic knowledge embracing all information necessary for code development.

As all this sounds rather complicated let us demonstrate how the theory works with an example from the field of sentiment and emotion analysis.

3.a. Suspect of bestial double murder in custody. (mno.hu)

b An American lady had a formidable experience while taking part in a shark cage watch program in Mossel Bay, South-Africa. (www.erdekesvilag.hu)

4 Debut of a bestial Volkswagen GTI Supersport Vision Gran Turismo (…) A formidable fastback implementing other aspects of the “GTI” concept.(http://auto-live.hu/)

According to the idea introduced above in sentences 3a and 3b understanding the highlighted words is based on encyclopaedic information stored in our pragmatic knowledge. This means we have some kind of an idea based on our previous experience of what something bestial or formidable is like. This is basically the encyclopaedic information stored in the underspecified semantic representations of the expressions in question. Using these pieces of information we can find out what they meant to express in the given context. In sentence 4 this encyclopaedic information is not perfectly in line with the current context so the encyclopaedic information in the underspecified semantic representation is not enough and therefore „further” information is necessary. In example 4 the “further” information is the emotive feature of the expressions “bestial” and “formidable”. Consequently we can say that in a situation like this during interpretation it’s the semantic feature indicating emotion or intensity of the studied lexemes that gets activated instead of the prototypical or stereotypical meaning. Put more simply: we don’t think that the new Volkswagen is as bestial as a murder and we need to be scared but instead we know that it’s as effective, impressive and surprising as the amount of emotiveness the phrases “bestial” and “formidable” have.

Considering this process of interpretation a certain parallel may easily be detected between expressing emotions at a textual level and understanding the emotional information faces display. It is evident how these two processes are similar; we are able to interpret the word “bestial” correctly in a context where this interpretation is required based on the sheer emotive semantic features and ignore its prototypical or stereotypical meaning. We are also able to interpret the face of the senator displaying the obvious signs of anger as the expression of excitement and joy if this is the interpretation the context requires.

Although obviously exciting and remarkable in itself I did have a specific reason to discuss the theoretical parallel above. My primary goal was to point out that while emotion analysts (and let’s face it: sentiment analysts as well) often focus on categories, their problems and possibilities, they sometimes forget about significant aspects like the role of context in the interpretation of linguistic- in case of facial expressions non-linguistic- signs. As a result a relevant psychological theory that can successfully be applied in linguistics may easily become “dangerous”.


Alm, C.O.-Roth, D.-Sproat, R. 2005. Emotions from text: machine learning for textbased emotion prediction. In Proceedings of the Joint Conference on Human Language Technology / Empirical Methods in Natural Language Processing (HLT/EMNLP 2005). Vancouver, Canada. 579-586.

Aman, S.-Szpakowicz, S. 2007. Identifying Expressions of Emotion in Text. In Proceedings of the 10th International Conference on Text, Speech, and Dialogue (TSD- 2007), Plzeň, Czech Republic, Lecture Notes in Computer Science (LNCS). SpringerVerlag. 196-205.

Barrett, L.F.-Lindquist, K.A.-Gendron, M. 2007. Language as context in the perception of emotion.Trends in Cognitive Sciences 11. 327-332.

Bibok, K. 2014. Lexical semantics meets pragmatics. Argumentum 10. Debrecen University Press 221-231.

Ekman, P.-Friesen, W.V. 1969. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica 1. 49-98.

Ekman, P.-Friesen, W. V.-Ellsworth, P. 1982. What emotion categories or dimensions can observers judge from facial behavior? In P. Ekman Ed. Emotion in the human face. New York: Cambridge University Press. 39-55.

Liu, H.-Lieberman, H.-Selker, T. 2003. A Model of Textual Affect Sensing using Real World Knowledge. In Proceedings of the International Conference on Intelligent User Interfaces, IUI 2003, Miami, Florida, USA.Wilson, D.-Sperber, D. 2004. Relevance Theory. In Ward, G.-Horn, L. eds. Handbook of Pragmatics. Oxford, Blackwell. 607−632.

Neviarouskaya, A.-Prendinger, H.-Ishizuka, M. 2007a. Analysis of affect expressed through the evolving language of online communication. In Proceedings of the 12th International Conference on Intelligent User Interfaces (IUI-07). Honolulu, Hawaii, USA. 278-281.

Neviarouskaya, A.-Prendinger, H.-Ishizuka, M. 2007b. Narrowing the Social Gap among People involved in Global Dialog: Automatic Emotion Detection in Blog Posts, In Proceedings of the International Conference on Weblogs and Social Media (ICWSM 2007). Boulder, Colorado, USA. 293-294.

M.K.Szabó– V. Vincze– G. Morvay 2015. Challenges in theoretical linguistics and language technology of Hungarian textbased emotion analysis. Language– Language technology– Language Pedagogy 21st century outlook 25 MANYE Congress, Budapest

Our Hungarian Sentiment Lexicon Is Available on opendata.hu

We’ve just released our Hungarian sentiment lexicons on opendata.hu, the Hungarian open data hub. You can download our sentiment lexicons here.


The sentiment dictionaries were created for automated sentiment analysis of Hungarian texts. The dictionaries were manually created on the basis of this English lexicon. The dictionaries are in plain text format with UTF-8 encoding.
The sentiment dictionary consists of lists of positive and negative polarity words .
The lexicon consist of two lists, one with 1748 positive, and one with 5940 negative words.

The dictionaries are freely available for research purposes under Attribution-NonCommercial 4.0 international Creative Commons License provided that the user properly cites the appropriate paper in the Reference section below.

For use of the sentiment dictionaries please refer to Szabó (2014).
Commercial users should contact us at labs(at)precognox(dot)com

Martina Katalin Szabo, Gergely Morvay, Zoltan Varju, Zsofi Nyiri, Zsolt Hajnal

Szabó Martina Katalin 2014. Egy magyar nyelvű szentimentlexikon létrehozásának tapasztalatai [Experiences of creation of a Hungarian sentiment lexicon]. Conference „Nyelv, kultúra, társadalom”, Budapest, Hungary.