Near real-time information extraction from social media streams, online marketplaces and online forums is providing a new 'virtual sensor' capability for Open Source Intelligence (OSINT). End users such as law enforcement agencies (e.g. UK Border Force, National Crime Agency), emergency response agencies (e.g. Tsunami early warning centres, Civil protection authorities) and news agencies (e.g. Deutsche Welle, BBC News) and are all actively investigating how best to use this new data to support their day to day decision making requirements.
Research challenges in the area of online information extraction include how to assess the veracity of content, how to respond to its dynamic nature and how to develop algorithms which can cope with the growing volumes of online data. In the online world fake news is rife so analysing the sentiment, stance and provenance of sources is very important. Automated fact extraction and checking is a challenging task and still very much in its infancy. However techniques are emerging to support human verification processes, identifying contextual information around factual claims for cross-checking and collating content from different viewpoints and sources to develop a balanced picture of what is going on. Natural Language Processing (NLP) and Information Extraction (IR) approaches typically work with either large web-scale corpuses of example posts, or small hand crafted corpuses with annotated language patterns and/or vocabularies. In domains like breaking news the topic of interest changes every few hours, so compiling training data is not practical. In domains like cybercrime information exchanges are often hard to get and fragmented, with discussion threads switching between public forum exchanges and hidden private messaging frequently. Unsupervised Open Information Extraction (OpenIE) approaches are able to work with little or no training data, and incrementally self-learning strategies can be used to utilize relevance feedback and boost precision. Algorithm scalability is critical for near-real-time processing, so efficient indexing and/or naive parallelization are also becoming increasingly important.
In this seminar Prof Middleton will chart a path through his research into information extraction over the last 5 years, starting with algorithms to help breaking news verification and leading on to supporting sensemaking from OSINT for military intelligence analysis and law enforcement agencies. He will explain the algorithms used, results obtained and suggest some lessons learnt along the way.
Dr Stuart E. Middleton is a senior research engineer at the University of Southampton, Electronics and Computer Science (ECS), IT Innovation Centre. He has over the last 16 years made internationally recognized contributions to research in the computational linguistics and information extraction areas, often including interdisciplinary work. He has been a PI and CoI on various EU H2020, Innovate UK, Home Office and Research Council projects. Recent projects include EU FP7 REVEAL project (geoparsing, information extraction and social media verification for breaking news), EU H2020 GRAVITATE project (natural language processing and semantic enrichment of cultural heritage databases), DSTL ACE ‘Human-machine teaming for intelligence analysis’ project (open source information extraction for military intelligence analysis) and ESRC FloraGuard (information extraction around UK illegal plant trade from online marketplaces). He has over 40 peer reviewed research papers & journal articles and book chapters and created the Python PyPI geoparsing library ‘geoparsepy’.
When: Thursday 25th February 2019
Where: Meeting Room 2.13, CSRI, Level 2, Friary House, Greyfriars Road, Cardiff, CF10 3AE
Contact: Lydia Ball BallL2@cardiff.ac.uk
This event is open to Cardiff university staff and students. Please complete the registration form below to reserve your place.