Hedge scope detection

This summer I, together with my supervisor, managed to write and publish my first scientific article –  “Combining manual rules and supervised learning for hedge cue and scope detection” [LINK]. The idea is that authors use hedged or uncertain language if they are not sure of their claims. For example:

This gene is probably a pseudogene because it contains an
inframe TAA codon.

The gene might be a pseudogene or it might not – based on the text we don’t know for sure. At the same time the author is certain that it contains an inframe TAA codon. Systems that aim to automatically analyse text and build databases with that information want to leave out all the uncertain parts and extract only the facts. That is why it needs to be detected first.

My goal was to develop a system that could do that. The sentence above will become:

The gene is (<probably> a pseudogene) because it contains an
inframe TAA codon.

“probably” is a hedge cue that indicates hedged language, “probably a pseudogene” is the complete hedged part of the sentence. The final system uses eight manually written rules and some machine learning techniques to mark the correct area as being speculation.

The paper was published as part of CoNLL 2010 shared task – essentially a competition where the organisers set the goal, everyone can develop their own systems and  these are then evaluated on an equal setting. This system managed to rank second in the task of hedge scope detection; however, since this is a rather difficult problem, the accuracy still has room for improvement.

This entry was posted in Research. Bookmark the permalink.

Leave a Reply