2016 in Retrospect

preco2

Time flies and the end of the year is coming, so it’s high time to summarize what’s happened to us in 2016.

Precognox in the world

download

We are participating in the KConnect Horizon 2020 project, that aims to bring semantic technologies into the medical field. We are proud of being a partner in a truly European project!

This year, Precognox visited the new world and build partnership with Basis Technology. One of our colleague spent three months in Boston, MA as the first step of our co-operation.

We are really multilingual, we were working with texts in (Simplified) Chinese Mandarin, Spanish, Arabic, Russian, English, and Hungarian. We gained experience with these languages as part of our projects with Meltwater, the biggest social media monitoring company.

Business as usual

According to the basic law of software development, projects occupy the available resources and more resources mean more projects. Precognox is not an exception to this, our team is growing, so we are managing more and more projects. We are continuously working on large scale Java based software development projects for various customers, just have a look at the list of our customers, and you’ll understand why I mention only one of them here. We are about to start major enterprise search and text mining projects, one of them is upgrading the semantic search solutions developed for  Profession‘s online job search portals. Precognox is working on the backend of Profession’s sites for years, so we literally grow up with it, it taught us a lot about enterprise search, so we are excited about the upgrade.

tas_text_analytics_sytem_webversion

We have a new product called TAS (Text Analytics System). We had several data collecting and cleaning projects and we distilled our experiences into a new tool. TAS helps you to collect, clean and analyze unstructured data , learn more about it on our website.

For profit, and for the greater good

For years, Precognox employs trainees. Usually, we have software developer and data analyst trainees who are working with us on a part-time basis, and we are welcome students for summer internships too. We are very proud of our former trainees, many of them started their career at top companies, one is doing his PhD in the Netherlands, and many of them is our full-time colleague now. From this September, we are participating in the new collaborative teaching scheme, which means the incoming students spend one-two days at the university as ordinary students and the rest of the week is spent at our company as a full-time employee. We believe that this practice oriented scheme will help students to jumpstart their careers upon graduation.

This year we were working on data driven projects with two NGOs and two research institutions.

We were working on an information visualization dashboard with EMMA (an NGO dedicated to inform and help pregnant women). As part of an European project, EMMA’s volunteers interviewed several women on their experiences during pregnancy and motherhood across the country and we analyzed this data by using various text mining tools. This project helped us to design a workflow for rapid prototyping text mining solutions, you can find projects based on this here and here. We do hope EMMA can use our dashboard for analyzing their data and we can work together on interesting projects in the future.

kitti_emma

This summer, we started working with Járókelő, a platform for reporting potholes and other anomalies in the city to the authorities. We’d like to develop a scoring mechanism for the stakeholders.

jarokelo

 

We are processing public procurement data for the Corruption Research Centre Budapest, and the Government Transparency Institute. Our partners’ research on monitoring procurement related corruption has been featured on The Economist recently.

 

Precognox is committed to open data, that’s why we published our Hungarian sentiment lexicon on opendata.hu under a permissive licence.

 

We publish on our research project on Nyelv és Tudomány (Language and Science, a popular scientific online magazine). E.g. we wrote a long article on the media representation of migrants in the Hungarian online media, we published several pieces on the social and ethical questions of AI and big data, and we made style transfer videos for the portal in 2016.

Work should be fun!

While we have lots of projects, we are continuously improving ourselves. That’s why we are organizing the Hungarian Natural Language Processing Meetup since 2012. This year, we teamed up with Meltwater and nyest.hu and took the meetup to the next level. We had six meetings with speakers from industry and academia. Two meetups were held in English with speakers from Oxford, San Francisco (Meltwater), and London (BlackSwan).

Nyomtatott

Precognox is a distributed company with offices in Kaposvár and Budapest, and team members from Szeged and other parts of the country. Several times a year,  we get together to talk about our projects and just to have a blast. Of course, we are real geeks, so we organized in-house hackathons at these events, and we loved hacking on data projects.

 

We are addicted to conferences. Every year, we attend MSZNY (the Hungarian Computational Linguistics conference), BI Forum (the yearly business intelligence conference in Hungary) and many more. We are happy to present our research to the public and get feedback from the community. Also, we love sharing our knowledge with others, e.g. this year, we gave a lesson on text mining at Kürt Academy’s Data Science course, and a lesson on content analysis and text mining for master students at Statistics Department of ELTE TATK.

Infoviz

This year, we made lots of dashboards for profit and for help scientific inquiries. Having finished these projects, we felt the need for introspection. Although, we were working hard to show what data tells us, we didn’t use the full potential of data analysis for advancing humanity. We needed a reason to continue our efforts, we needed a new goal. We turned to the Jedi Church for consolation, the church connected us to the Force, and the Force helped us to visualize the Start Wars texts.

kep5_prequel_convgraph

We are so artsy

Everything started with a job ad. We were looking for a new intern, we needed a photo for the post that describes the ideal applicant. It seemed to be a good idea to give a try to style transfer and attach an image of our team in the style of Iranian mosaics.

output-csapat-mozaik2

Later, our Budapest unit moved to a new office, so we thought it is a good idea to develop our own decoration for the new place. The results are hilarious, a new typeface (yes!) with characters composed from graphs, and the following pictures.

Finally “SEMMI” (means nothing) got on the wall of our room.

semmi

We are very keen on style transfer, so we made videos too.

 

Having worked with pictures and characters, we needed a new challenge, so we sonified emotion time series extracted from Hungarian news published during the migration crisis.

And now for something completely different

This year we were working hard and playing hard, it’s time to have a short break. Next year, Precognox will start offering new solutions to its customers and exciting new projects to its employees. Stay tuned, we’re going to blog about these!

 

 

Is Kenny Baker the Kevin Bacon of Star Wars? Does every movie have a happy ending?

2000px-star_wars_logo-svg

How do we quantify the importance of the nodes in a network? To answer this question mathematicians came up with the so-called Erdős number to show how far someone is from “the master” in a network of publications. Movie-enthusiasts have created the Bacon number as its analogy, based on co-occurrences in movies. But what does it have to do with Star Wars? Which character or actor is the key person in this universe? Is it really true that every blockbuster has a happy ending? We are trying to answer these questions with the revised version of our study carried out last year and hope to find answers with the help of interactive visualisations.

Erdős and Bacon

What is needed to create a new theory in network science? Apparently, a windy winter night is enough when Footloose and The Air Up There are on TV one after the other. And of course three American university students who having watched the movies begin to speculate: Kevin Bacon has played in so many movies that maybe there is no actor in Hollywood who hasn’t played with him yet. Well, probably it is not true, but backing up the theory with a bit of mathematics and research, a new term, the Bacon number has been born.

The Erdős number was defined in 1969 by Casper Goffmann in his famous article ‘And what is your Erdős number?’ It is based on a similar observation about the legendary productive Hungarian mathematician Paul Erdős who had so many publications in his life (approx. 1525 articles) in so different fields, that it was possible and worth classifing mathematicians and scientists based on their distance from Erdős in a network of publications. According to this, Paul Erdős’s Erdős nuber is 0, since he is the origo of this theory. Any scientist who has ever published anything together with Erdős, has the Erdős number 1. Anyone who has published together with someone with the Erdős number 1 will get the Erdős number 2, and so on. Generally speaking, everyone has the Erdős number of the person of the lowest Erdős number they have published with, plus one.

In case of Kevin Bacon and Hollywood the principle is the same, but instead of publications it is based on movies and the connection is not authoring an article with someone but playing in the same movie with someone. It is only a coincidence and a historical legacy that it is called Bacon number, because although Erdős is the most productive mathematician in history with almost twice as many publications as Euler, who came second on the list, Bacon is not really a central figure in Hollywood. If we check the network of actors in Hollywood, the average distance from everyone else is 2.79 in case of Bacon, which is enough only for the 876th place in the ranking. As a comparison Rod Steiger, who is the first on this list, has a value of 2.53.

One Saga, Seven Episodes

But what does it have to do with Kenny Baker? We were wondering who the Kevin Bacon of the Star Wars universe was therefore we collected the cast members of both the original and the prequel trilogy also adding the actors of Episode VII that was released last December. We visualised our findings on an interactive graph. The title – ‘The center of the Star Wars universe’ – is honorary, because the concept of distance related to the Bacon number can hardly be interpreted on this graph. Nevertheless, the prestige value of the origo and the position it occupies within the network can be a valid basis of comparison as well as the definition of the relations based on the co-starring of the actors.

On the visualisation – to make the network more transparent – we only show the actors who played in at least two different Star Wars movies. There is a relationship between two actors if they have starred in the same movie. The more movies the actors have co-starred in, the stronger their relationship is.

kép1_sw_actor_network3.png

Network of actors having played in at least two different Star Wars movies. The interactive version of the graph can be found here.

By clicking on the nodes of the interactive visualisation you can see the number of movies the actors played in, which characters they embodied, as well as the number of their relations. The colors of the nodes correspond to the set of trilogies the actors played in. There is a clear distinction between actors only starring in the original – light blue – and the ones who played in the prequel trilogy – dark blue. This may not be so surprising considering that 16 years passed between the releases of Episode VI and I and 28 years between Episode IV and Episode III.

Naturally, there are actors who connect the two trilogies’ crew, although their number is limited. They are forming the nodes in the center of the network and as for their size they are the biggest ones. This also indicates that these actors have the largest number of relations and the highest number of shortest paths between two peaks. Actors of this group played in both the original and the prequel trilogy (light green nodes), another group of them additionally got roles in Episode VII as well (dark green nodes).

We can also find two additional subgroups on the graph. The light blue one shows the actors playing a key role in the original trilogy and in Episode VII as well. Carry Fisher playing Leia and Harrison Ford playing Han Solo are the most typical representatives of this category. Alec Guinness, who played Obi-Wan Kenobi in the original trilogy, may certainly be the most interesting member of this group. Although he passed away in 2000 he still appears in the credits of Episode VII, due to an archive voice recording. Finally, the only actor appearing in both the prequel trilogy and Episode VII is Ewan McGregor, also with a voice recording – it seems that the latest episode couldn’t decide which Jedi master to favor: the young or the old one.

The Big Four

Let’s take a look from a different angle and see how the actors, according to their characters, are set in the network of the Star Wars universe and who is the luckiest to call himself the origo.

There are four characters altogether who have appeared in all the seven Star Wars movies so far: Anakin Skywalker, Obi-Wan Kenobi, C-3PO and R2-D2. Of course the young and the old Anakin and Obi-Wan are played by different actors, therefore they can’t make it to the very top with the two droids. There was a very close competition between Kenny Baker (R2-D2) and Anthony Daniels (C-3PO), but in Episode VII Anthony Daniels took the leading role, since Kenny Baker was only a consultant in playing R2-D2. This fact however doesn’t affect their roles in the network since both of them appeared in the credits of all seven movies. What is more, they are both versatile actors; they played more than one character – Kenny Baker was also Paploo, the ewok in Episode VI and Anthony Daniels was also Dannl Faytonni in Episode II. Considering the recent death of Kenny Baker however, we decided to claim him the winner of the title ‘Kevin Bacon of the Star Wars universe’ as a posthumous award. (In reality, Anthony Daniels is just as worthy of the title as he is.)

The runner-up is of course Frank Oz, who played Yoda in six of the seven Star Wars movies (in Episode VII with his voice). Actors like Ian McDiarmid playing Senator Palpatine (in Episode V only on the DVD-edition) and Peter Mayhew playing Chewbacca- both actors played in five-five movies- have a distinctive place on the list. Last but not least actors of the original trilogy also appearing in Episode VII, like Carry Fisher or Mark Hamill may claim the third place.

The most universal node of the network is no doubt Natalie Portman playing Padmé Amidala in the prequel trilogy. Her Baker number is of course 1, her Bacon number is 2 and her Erdős number is 5. She did a PhD in Psychology at Harvard and published several papers, earning a decent Erdős number (among the 134 thousand scientists with Erdős number, the median is 5).

Sentimental Scenes

We automatically split the Star Wars movie scripts stored in the IMSDb database into scenes, then we analysed them with the help of Hu and Liu’s sentiment dictionary. The sentiment scores of each scene from all the episodes can be seen in the interactive visualisations below. The bars marked with brighter colors represent scenes with positive sentiment, the darker bars denote negative ones. The deeper a dark bar reaches the more negative the sentiment of a scene is; the higher a bright bar reaches, the more positive its sentiment is. In case of neutral sentiment scores there is no visible bar. If we point our cursor at the visualizations’ bars beside the exact sentiment score we can see the given scene’s location and the top 3 characters as well – i.e. the characters who either played or are mentioned in the scene.

Generally speaking, the episodes of Star Wars can be characterized mainly by negative sentiment – which is especially true for the episodes of the original trilogy (Episode IV, V and VI). The most negative ones are Episode V and VI and the most positive one is Episode II. In Episode VII the distribution of positive-negative sentiments is more similar to the movies of the prequel trilogy. If we look for the indicators of happy ending, we can find them in Episode I, III and V; these movies end with either positive or neutral scenes. Although positive scenes can be found at the end of each movie, based on the script analysis only half of the movies have a ‘happy ending’.

kép2_original_trilogy2.png

The sentiment scores of the original trilogy’s movies . The interactive version of the graph can be found here.

kép3_prequel_trilogy2.png

The sentiment scores of the prequel trilogy . The interactive version of the graph can be found here.

kep4_force_aw

The sentiment scores of Episode VII. The interactive version of the graph can be found here.

Movies are worth analysing from the characters’ point of view as well. To do this another interactive data visualisation lends a helping hand which presents the dialogs between characters in a network format. It also shows which characters play most frequently in the movies and what kind of sentiment is typical when they occur.

kép5_prequel_convgraph.png

The conversation graph of the prequel trilogy. The interactive version can be found here.

The conversation graphs reveal that the dialogs in the original trilogy were more focused and mainly the main characters were involved – several supporting actors didn’t even get an opportunity to speak out. In contrast, the conversations are more equally distributed between the main and the supporting characters in the episodes of the prequel trilogy. This trend can also be seen on the graph of Episode VII. The characters of Anakin Skywalker and Darth Vader are good examples of sentiment changes; since in the first two episodes Anakin equally appears in both negative and positive roles then a shift occurs: in the third episode he takes part in more and more scenes filled with negative sentiment, and after his transition to Darth Vader he appears almost only in negative scenes.

 

Written by Kitti Balogh, Virág Ilyés, and Gergely Morvay

Sounds of a Story: Sonification of emotion time series extracted from Hungarian news published during the migration crisis

We harvested more than forty-two thousands articles on migration published on the main Hungarian news portals between 27/09/2014 and 11/06/2016. You can find an information visualization dashboard based on the corpus here. This sonification and the accompanying visualization are experimental tools, their sole porpuse is to give you a glimpse into how the emotions related to migration flowed in the online media. If you’d like to know more about the data, use our dashboard. If you can speak Hungarian, you can read our article on nyest.hu.

How it’s made

  • Emotion time series were extracted by using our own emotion lexicons.
  • Time series were mapped to midi notes by using the MIDITime Python library.
  • We used the Music21 library for assigning instruments to emotions.
    • distress: Violin
    • joy: Xylophone
    • fear: ChurchBells
    • anger: Woodblock
    • surprise: Bagpipes
    • disgust: Horn
  • The separate MIDI files were merged into a sound file by using LMMS.
  • The video was made by using the ggplot2 R package for plotting the emotion scores for every week.
  • Finally, we used ffmpeg to make a video from the plots and the sonified time series.

Team
Kitti Balogh
Zoltan Varju

Are keyboards changing our thinking? The QWERTY- effect

The QWERTY-effect as a concept first appears in a study of Daniel Casasanto & Kyle Jasmin. In their research paper Casasanto and Jasmin (hereinafter C&J) argue that because of the keyboard’s asymmetrical shape (more letters on the left than on the right when using English, Spanish or Dutch keyboards) letter combinations that fall on the right side of the keyboard tend to be easier to type than those on the left. Therefore words dominated by right-side letters subtly gain favor in our mind and are regarded as more appealing.

o-qwerty-effect-baby-names-facebook

What C&J say is that the position of the keys and the emotional valence of the words are related. This effect may be even stronger in case of words coined after the 60s.

kepernyofoto_2015-07-16_17_22_42

Well, so much about theory.

The researchers went even further by suggesting that if people tend to favor the positive side of the keyboard it may influence parents when picking names for their babies.

kepernyofoto_2015-07-16_17_04_50

Language Log “made mincemeat” of this theory and actually ripped the whole article and the QWERTY-effect apart practically questioning every single sentence while examining and statistically analyzing the data on the same corpora. They didn’t find any significant effects but they came up with lots of interesting questions like: why should the 60s be the dividing line for name giving tendencies? This phenomenon could be studied on a wider spectrum. The blog did exactly the same and found that the name preference discovered by C&J appears under different circumstances as well. This however, could be the reason for the popularity of certain names and not the QWERTY-effect.

Despite all this the authors (C&J) wanted to do a proper job and eventually they did find a relevant significant influence– although others were not so easily convinced. All in all it seems there must be something there, so this theory is well worth a mass or two.

kepernyofoto_2015-07-16_17_21_49

Even if we don’t go as far as to say that QWERTY influences name giving trends it is remarkable that since the birth and rapid spread of the internet the way we communicate has dramatically changed. Language is no longer solely oral but more and more of our word production happens on our keyboards. Although the source– our thoughts– is still the same, the way of expression has considerably changed and now its great part is shifted to the keyboard.
The assumption that there is an influence here cannot be debated. What it effects and how is a difficult question to answer though. What I find fascinating however in Casasanto and Jasmin’s work is the part which says: to a certain extent the keyboard is shaping the meaning of the words. And I also have the impression that popular media sort of overlooks this fact. No matter how slight this effect that modifies semantic meaning might be and even if the emotional valence of the word itself – whether it has a negative or positive connotation– probably outweighs the QWERTY-induced associations, it’s presence is still a remarkable phenomenon.

That’s why we decided to experiment a little using a Hungarian keyboard– which is special in this case because more letters can be found on the right, fewer on the left, so the asymmetry shifts.
Should we find not more than a tiny little difference as well in reverse, one more piece of evidence could prove that the assumption is correct and the way keys are positioned does have an effect on the physical and consequently the psychological well-being. Which fact therefore will have an influence on the meaning of the words when we read, speak or listen. We have chosen to test the effect traceable while reading. Our findings will be reported in our next post.

For those who wish to lose themselves in the topic, here’s the link to the original article:

http://link.springer.com/article/10.3758%2Fs13423-012-0229-7

Here’s a short summary presented by WIRED:  http://www.wired.com/2012/03/qwerty-effect-language/

Here’s the post of Language Log on the QWERTY-effect. The comments are worth reading too: http://languagelog.ldc.upenn.edu/nll/?p=3829

Another post from Language Log on the name giving trends with neat little graphs showing their results: http://languagelog.ldc.upenn.edu/nll/?p=12378

by Anna Régeni

Young Statistician Meeting 2016

kuruc

This week we are presenting our research on using topic models in search and content analysis at the Young Statistician Meeting 2016. You can find our abstract and the accompanying slides below.

kip

Kitti Balogh: Unveiling latent topic structure in anti-Roma discourse using Latent Dirichlet Allocation 

From the mid 2000’s the number of anti-Roma and racist utterances have been increasing in Hungary and this manner of speech has also become accepted in common discourse. The research focused on extracting anti-Roma topics over this period using a hierarchical Bayesian model called Latent Dirichlet Allocation (LDA). The source of the analysis was collected from kuruc.info online newsportal which is the flagship of the far-right media in Hungary. The corpus consists of more than 10.000 anti-Roma news from 2006 until 2015. 27 anti-Roma topics were extracted by using LDA which opens the possibility to analyze the distribution of various topics over time and see how they are connected to the most influential events during the period of investigation. The identified topics correspond to categories identified by qualitative studies on Roma media representation in Hungary. Our research suggests that topic modeling could be a useful supplementary tool to the toolbox of traditional qualitative discourse analysis researchers. Our research project culminated into an interactive data visualization and a data visualization dashboard which can be accessed on following links:

 

Culture independence vs context dependency- Ekman’s “dangerous” theory

This post is part of a case study of emotion analysis focusing primarily on the theoretical background of text based emotion representation.

Here I wish to point out that exploring the field of text based emotions may reveal information otherwise inaccessible in sentiment analysis. Therefore it may even result in a different kind of benefit that enhances its value.

In order to find out what kind of emotions are “hiding” in texts it is first needed to be defined what we are actually looking for. The simplest solution seems to search for linguistic expressions explicitly indicating a certain emotion. Let’s take a look at some real-life examples:

1 XDDDDDDD well, you know even an innocent smiley can freak you out 🙂

2 Still terrified, the actress turned to the public.

The highlighted items seem worth collecting and adding to a dictionary based on the emotions they express. In order to do that however first the system of categorization need to be defined. The next obvious step for a linguist therefore is to check what psychology has to say about which emotion categories are worth the time.

The method above is the so-called current beaten track of emotion analysis– if such a track exists at all considering the insignificant number of international and Hungarian publications. While searching for the relevant psychological data the language technologist comes across Paul Ekman’s theory. According to Ekman there are six basic emotions– sadness, anger, fear, surprise, happiness and disgust– the facial expressions of which are universal, i.e. independent of the person’s cultural background and mean the same emotional state for everyone.

emitiou

In the 1970s Ekman and Friesen developed the Facial Action Coding System (FACS) to taxonomize every human facial expression. The method, which is the result of decades of research, describes all observable facial movements for every emotion and by analysing them it determines the emotional state of the person. The fact that both genuine and fake emotions can be precisely identified is the eloquent proof of its reliability.

No wonder Ekman was named one of the top 100 most influential people in the May 2009 edition of Time magazine.

lie-to-me-production-aug-2008-download-2-087

Paul Ekman and Tim Roth, the star of the TV series “Lie to me”.

(www.paulekman.com)

The widespread popularity of this categorization provided a solid base for emotion analysis in language technology as well. Most relevant studies categorize emotion expressions either directly based on Ekman’s theory (Liu et al. 2003; Alm et al. 2005; Neviarouskaya et al. 2007 a,b; Aman-Szpakowicz 2007) or like us take it for their basis adding some other classes as well e.g.: attraction or tension (Szabó et al. 2015). The argument that these emotions are universal is so convincing that computational linguists almost forget to ask whether this is the very feature they need at all or if this otherwise important fact disguises features which should be an essential part of the analysis?

As I promised in the title of the post I intend to write about Ekman’s “dangerous” theory. I am referring to the book “Darwin’s Dangerous Idea: Evolution and the Meanings of life (1995)” by Daniel C. Dennett here and also drawing a parallel with Ekman’s theory. According to Dennett there are two reasons why Darwin’s theory may be dangerous: First, because his thoughts questioning the privileged role humans were said to enjoy in the universe profoundly shook the foundation of the traditional cosmological approach. He also doubted that life itself should actually have a peculiar ontological status. Second, according to Dennett Darwin’s theory is easy to misunderstand therefore it may generate dangerous misinterpretations. The reason why Ekman’s theory- the ability to read emotions on faces is innately hardwired- is “dangerous” is that it’s so convincing that other aspects of expressing emotions– like facial or linguistic– are easily ignored. One important factor is the role of context in the interpretation of emotions, and it is not exclusively about text analysis.  Let us take a closer look at the phenomenon:

In their article– Language as context for the perception of emotion, 2007– Barrett and her co-authors challenge the idea of innate emotion perception by using a certain photo as an example. The photo was taken of United States Senator Jim Webb celebrating his 2007 electoral victory. Experiments revealed when subjects saw the image of the Senator taken out of context (see image a.) they all said he looked angry and aggressive. When situated however in the original context subjects agreed that he appeared happy and excited.

The result is remarkable considering that not once did the subjects find the senator’s facial expression misunderstandable or confusing but came to the conflicting conclusions automatically and effortlessly.

nihms37844f1

Barrett (Barrett at al. 2007) considers this phenomenon a paradoxon since it’s rather controversial that there are six facial expressions which are biologically perfectly distinguishable but their interpretations may be absolutely context-dependent. The authors try to come up with an explanation such as words ground category acquisition, but in my opinion this argument is not convincing enough.

In exchange for the Ekman categories here linguistics seems to lend psychology a conceptual framework which needs to be traced back as far as Wilson and Sperber’s Relevance theory (2004). It argues that in any given communication the hearer or audience will search for meaning and having found the one that fits their expectation of relevance will stop processing. In the conceptual framework of lexical pragmatics it all means that the lexeme itself is nothing but an underspecified semantic representation. Consequently it gains its complete meaning only in context (Bibok 2014). Where does this underdetermined meaning come from? Obviously there must be a pragmatic knowledge embracing all information necessary for code development.

As all this sounds rather complicated let us demonstrate how the theory works with an example from the field of sentiment and emotion analysis.

3.a. Suspect of bestial double murder in custody. (mno.hu)

b An American lady had a formidable experience while taking part in a shark cage watch program in Mossel Bay, South-Africa. (www.erdekesvilag.hu)

4 Debut of a bestial Volkswagen GTI Supersport Vision Gran Turismo (…) A formidable fastback implementing other aspects of the “GTI” concept.(http://auto-live.hu/)

According to the idea introduced above in sentences 3a and 3b understanding the highlighted words is based on encyclopaedic information stored in our pragmatic knowledge. This means we have some kind of an idea based on our previous experience of what something bestial or formidable is like. This is basically the encyclopaedic information stored in the underspecified semantic representations of the expressions in question. Using these pieces of information we can find out what they meant to express in the given context. In sentence 4 this encyclopaedic information is not perfectly in line with the current context so the encyclopaedic information in the underspecified semantic representation is not enough and therefore „further” information is necessary. In example 4 the “further” information is the emotive feature of the expressions “bestial” and “formidable”. Consequently we can say that in a situation like this during interpretation it’s the semantic feature indicating emotion or intensity of the studied lexemes that gets activated instead of the prototypical or stereotypical meaning. Put more simply: we don’t think that the new Volkswagen is as bestial as a murder and we need to be scared but instead we know that it’s as effective, impressive and surprising as the amount of emotiveness the phrases “bestial” and “formidable” have.

Considering this process of interpretation a certain parallel may easily be detected between expressing emotions at a textual level and understanding the emotional information faces display. It is evident how these two processes are similar; we are able to interpret the word “bestial” correctly in a context where this interpretation is required based on the sheer emotive semantic features and ignore its prototypical or stereotypical meaning. We are also able to interpret the face of the senator displaying the obvious signs of anger as the expression of excitement and joy if this is the interpretation the context requires.

Although obviously exciting and remarkable in itself I did have a specific reason to discuss the theoretical parallel above. My primary goal was to point out that while emotion analysts (and let’s face it: sentiment analysts as well) often focus on categories, their problems and possibilities, they sometimes forget about significant aspects like the role of context in the interpretation of linguistic- in case of facial expressions non-linguistic- signs. As a result a relevant psychological theory that can successfully be applied in linguistics may easily become “dangerous”.

References

Alm, C.O.-Roth, D.-Sproat, R. 2005. Emotions from text: machine learning for textbased emotion prediction. In Proceedings of the Joint Conference on Human Language Technology / Empirical Methods in Natural Language Processing (HLT/EMNLP 2005). Vancouver, Canada. 579-586.

Aman, S.-Szpakowicz, S. 2007. Identifying Expressions of Emotion in Text. In Proceedings of the 10th International Conference on Text, Speech, and Dialogue (TSD- 2007), Plzeň, Czech Republic, Lecture Notes in Computer Science (LNCS). SpringerVerlag. 196-205.

Barrett, L.F.-Lindquist, K.A.-Gendron, M. 2007. Language as context in the perception of emotion.Trends in Cognitive Sciences 11. 327-332.

Bibok, K. 2014. Lexical semantics meets pragmatics. Argumentum 10. Debrecen University Press 221-231.

Ekman, P.-Friesen, W.V. 1969. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica 1. 49-98.

Ekman, P.-Friesen, W. V.-Ellsworth, P. 1982. What emotion categories or dimensions can observers judge from facial behavior? In P. Ekman Ed. Emotion in the human face. New York: Cambridge University Press. 39-55.

Liu, H.-Lieberman, H.-Selker, T. 2003. A Model of Textual Affect Sensing using Real World Knowledge. In Proceedings of the International Conference on Intelligent User Interfaces, IUI 2003, Miami, Florida, USA.Wilson, D.-Sperber, D. 2004. Relevance Theory. In Ward, G.-Horn, L. eds. Handbook of Pragmatics. Oxford, Blackwell. 607−632.

Neviarouskaya, A.-Prendinger, H.-Ishizuka, M. 2007a. Analysis of affect expressed through the evolving language of online communication. In Proceedings of the 12th International Conference on Intelligent User Interfaces (IUI-07). Honolulu, Hawaii, USA. 278-281.

Neviarouskaya, A.-Prendinger, H.-Ishizuka, M. 2007b. Narrowing the Social Gap among People involved in Global Dialog: Automatic Emotion Detection in Blog Posts, In Proceedings of the International Conference on Weblogs and Social Media (ICWSM 2007). Boulder, Colorado, USA. 293-294.

M.K.Szabó– V. Vincze– G. Morvay 2015. Challenges in theoretical linguistics and language technology of Hungarian textbased emotion analysis. Language– Language technology– Language Pedagogy 21st century outlook 25 MANYE Congress, Budapest

Budapest BI Forum

Today, we are presenting our recent works on information visualization and using images in content analysis at the Budapest BI Forum.

Kitti Balogh: Text Visualization Dashboards (in Hungarian)

This talk gives an overview of our visualization dashboards. If you speak Hungarian, you can find a short description along with the accompanying slides here.

Varjú Zoltán: A Review of Image Retrieval Methods – a journey from image descriptors to neural networks

Although visual information is getting more and more common in the online world and researchers gave us plenty of tools to deal with it, it is still hard to find the right solution to the most common information retrieval tasks like finding duplicates, similar items and forming meaningful clusters from images. On a dataset with about 50k images we went through the traditional approaches like using image hashing and image descriptors for finding duplicates and clusters, we tried out image labeling solutions and we tested state-of-the-art variational autoencoders too. Of course, we compared and evaluated each and every solution and now we would like to share our experiences with you.

Budapest.AI

Today, we are going to speak at Budapest.AI, the first city.ai meeting at Budapest. Our talk focuses on our dashboards developed during the last year. You can find our slides and links to our visualizations and dashboards below.

 

How to censor the internet?

After exploring, analysing and making presentations on a network of political blogs, we decided it was time to demolish it like a neat little LEGO- house. How can we know whether the network is robust enough? Will it fall apart from one well-aimed blow or will we need to use our tooth and nail to dismantle it? What should a wicked little goblin do if it doesn’t want us to find connections and paths between websites expressing various views of the world? This is what our research group was trying to find out by comparing two strategies of a network attack.

001

To bombard our network of 747 blogs and news-sites connected with 1195 paths we adopted strategies described in the article of Réka Albert, Hawoong Jeong and Albert-László Barabási. The first strategy imitated random attacks. Since an error comes up randomly we also chose to pick a node the same way and to delete it with all of its connections, then move on to the next node, delete that one as well, and so on. When adopting the second strategy however we didn’t simply rely on mere chance but we targeted the most vulnerable parts of the network. Unlike Barabási’s demonstration where the nodes with the highest vertex degree– the most connected nodes– were eliminated, here we deleted the ones having the highest PageRank value. (We had tried it previously and PageRank proved to be a relatively more useful destruction tool.) We could carry on destroying nodes until all of them disappeared but our aim was to feed our destructive tendencies so we were happy if we could rip the whole network apart by deleting the minimum number of pages.

The following two videos show which strategy was the winner:

Both of them show the elimination of 100 sites following the two strategies respectively. As we might have expected when sites having the highest PageRank values were deleted the network suffered major injuries and it soon fell apart. What may be surprising though is that random attacks were like a mere trifle to the network with hardly any effect on its structure.

As Barabási’s article says it’s due to the fact that the network in question– just like the majority of real networks– is scale-free. It means that there are a lot of nodes in the network which only have few connections and there are only a few which has many. This is the reason why in case of a random attack we should most probably find a site with only few connections and hardly any effect on the structure of the whole network when deleted. When however, we adopt the PageRank value strategy to attack we can see that it’s exactly the most significant nodes in the graph structure which are destroyed.

This phenomenon is shown in the next chart from a different point of view. It shows how the average path length changes in the network as a result of random or PageRank attacks. In the original network the average path length was 3.26 which means that generally one could get from one site to another through approximately three pages. After the PageRank attack the average path length almost immediately begins to increase in the deteriorating network showing that significant connecting elements have been removed. By deleting not more than only one tenth of the nodes the whole network falls into pieces and the value of the average path length is decreasing too. Random attacks do not have a huge influence on the average path length. At least ¾ of the nodes need to be deleted in order to make the network fall apart.

avl

To sum up: should we one day have the burning desire to rip a scale- free network to pieces the best to do is to terminate its most important parts. The level of importance can be established based on vertex degree, PageRank, betweenness centrality or other values.

Machines are not rising (yet)!

While movies based on artificial intelligence are flooding the market, Elon Musk and scientists like Stephen Hawking and Stuart Russell are becoming more and more worried about how modern technology is slowly but surely getting the upper hand. Even Russell, the co-author of the remarkable book “Artificial Intelligence: A Modern Approach” asked researchers in his open letter to work on systems which are good and the operations are reliable. Is there a real threat in artificial intelligence?

ai-image

What is it that we are so afraid of?

People who tend to be scared of artificial intelligence are basically worried about two things:

  •       during a process in the unpredictable swirl of algorithms somehow it’s always the human who ends up being the weakest link. Therefore, the machine coolly decides to eliminate them.
  •         machines may become conscious and go against their makers.

Warning us– or rather his co-workers– Russell finds the first threat feasible. The good news is that with proper consideration this problem can be resolved. But why does it require attention? Most systems labelled as that of artificial intelligence belong to the realm of computer aided learning. Essentially they are not imitating the human way of problem solving but rather carry out undefined and un-programmable tasks we are unable to resolve. Weather forecast may be the most widely known one but there are others too such as nowcasting or classification methods applied in medical diagnosis. Procedures like these may save lives or even determine the fate of entire communities like for instance “predictive policing“, a practice becoming more and more widespread nowadays. Luckily however a research in this field has to meet severe requirements. Codes can be monitored with QA methods used in software development and statistics also provides us with the possibility of evaluating results. Therefore, we can say with confidence that by “keeping an ear to the ground” the first threat, even if not entirely, but can be reduced.

How about the second one? It assumes a general, not task-specific machine that is able to set itself goals. Last year, Google DeepMind learnt to play Atari games fairly well, then this year it beat the European go champion Lee Sedol, who is one of the world’s best players. Now, the same basic techniques that mastered various games are used for analyzing patients’ records at the Moorfields Eye Hospital.

One may wonder… in the famous film Blade Runner some genetically engineered replicants- Nexus 6 models- are trying to avoid their pre-programmed demolition. While attacking their maker they create beautiful poetic images like the monologue Tears in Rain… Is the arrival of such a group imminent?

The unresolved issue of 2000 years

It’s not enough to understand reality in order to experience reality but understanding it should previously be enlightened. Understanding existence is in itself placed on a generally speaking shining horizon. (Heidegger: Basic problems of metaphysics, p 351)

Most AI books include a passage on the limits of artificial intelligence somewhere in the introduction section. Interestingly enough it’s  Hubert Dreyfus , an expert in continental, phenomenological and hermeneutical tradition– an entirely different field-, who gets quoted in these works rather than the aces of classical analytical philosophy. The reason for this is that his study “Alchemy and Artificial Intelligence” was published in 1965 and his books “What Computers Can’t Do” and “What Computers Still Can’t Do” also withstood the test of time and brilliantly forecast the limits and pitfalls of artificial intelligence research.

Studying traditional artificial intelligence Dreyfus found it to be based on four assumptions:

  1.  the biological assumption- i.e. the brain processes information in discrete operations.
  2.    the psychological assumption- the mind can be viewed as a device operating on bits of information according to formal rules which can be executed on a discrete information processing unit.
  3.  the epistemological assumption- knowledge can be formalized i.e. everything that can be understood by human beings can be expressed by context-dependent formal rules or definitions.
  4.    the ontological assumption- the world itself consists of independent facts that can be represented by independent symbols.

It is exactly the program that western philosophy and science commenced two thousand years ago. The traditional AI (or GOFAI, the good old fashioned AI) did believe that making artificial intelligence would help understand natural human intelligence and this is exactly the idea behind the psychological assumption. Being independent of the other three– all making up modern artificial intelligence– however, it was quickly discarded and referred to cognitive science.

Dreyfus acknowledges both the benefits and the constant development of artificial intelligence. He points out however, that by using the premises of AI it took western intellectuals a 2000-year-long struggle to realize that the problem of brain and mind cannot be resolved within the old-fashioned framework.

Dreyfus claims that only a certain part of human intelligence is built according to a method that is easily applicable by science. There are basic problem solving principles and operations applying certain patterns that can be expressed by rules or assumptions. Human experience and intelligence however are also crucial and cannot be ignored; we are the products of our environment at least as much as we are its observers and creators. Just like Quine, Dreyfus is also a holist. In order to recognize the basic elements of either the world or our knowledge a previous comprehensive image of the world itself is needed. Let us take an example:

As the famous gavagai example goes: should we find ourselves with an isolated tribe and want to note down their language what we would do is conduct observations, collect linguistic data and try to “fabricate” the rules of the language from the behavior and reactions of the speakers. If we follow a member of the tribe who suddenly sees a rabbit and cries out “gavagai”, we take notes and try to analyze this behavior. How to translate it into English is another matter.

“Rabbit”, or “there’s a rabbit”, or it may mean “it’s a rabbit over there”, but it may also be “there goes today’s dinner”. With some practical tools we are certainly able to narrow the possible interpretations down. For instance, if the same word “gavagai” is said in the evening when we have a piece of meat on our dinner plate, the options may be restricted but the translation can still be “dinner” or “rabbit”. Quine says the reason why it’s happening is that in order to make correct interpretations we should know the entire language “all together”. We don’t simply learn isolated sentences but their coherence and the related empiric experience as well therefore the sentences of a language are mere abstractions. Their meaning is received from the language as a whole instead of individual sentences constructing the language.

All this is true of intelligence as well. It must be remembered that we humans are “embedded” in the world surrounding us. The world is its best representation the way it is given to us and we are unconsciously using it on a daily basis. We are extending our minds when using a given object- for instance a church tower- as a signpost for direction. However, our mind is very much different from our brain. Our perception, just like our perfection in the world is defined by our physical body since we experience the world through our senses and change it with our body. This thought of Dreyfus is also the precursor of embodied cognition.

Robots!

Before anyone might think this is only philosophy let’s find out more about the Moravec-paradoxon. Moravec and Brooks, the pioneers of modern robotics, became interested in embodied cognition partly as a result of Dreyfus’s influence. They were attempting to break down the boundaries of the traditional approach by providing their intelligent systems with a body. During the program they discovered the following paradox: high-level, symbolic processing requires very little computation, low-level, sensorimotor processing however requires incredibly enormous amount of computational resources. What is more, symbolic processing is built on lower levels.

Let us suppose we are able to make a robot that is able carry out embodied cognitive processes. Let us assume that it has consciousness whatever that may be. This means that in its build it must be very similar to a human. So similar maybe that the Voight-Kampff test conducted in Blade Runner would be needed in order to decide whether it’s a human being we are interacting with or an android.

Significant steps have been taken in order to make more profound discoveries on the field of artificial intelligence. The project of Google DeepMind is now learning the general concepts. Dreyfus warns us that all this covers only a small part of the actual operations of the mind. To recognize unique elements a comprehensive approach is needed since various concepts are acquired together with their relations to each other. These relations are perceived when embodied in a world; without a body and a surrounding environment only partial success is available.