Follow us on Facebook to receive important updates Follow us on Twitter to receive important updates Follow us on sina.com's microblogging site to receive important updates Follow us on Douban to receive important updates
Chinese Text Project

Semantic annotation

Introduction

Semantic annotation involves adding computer-readable data about the meaning of words and phrases in their given context to a text. This enables further processing, and allows the system to display additional relevant information. For example, in the following passage, the semantically annotated version (left) provides useful contextual information about dates, people, and written works:

With annotationWithout annotation
1 夏四月乙巳呂夷簡上《景祐法寶新錄》。甲子呂夷簡王曾宋綬蔡齊罷,以王隨門下侍郎同中書門下平章事昭文館大學士陳堯佐同中書門下平章事集賢殿大學士盛度知樞密院事韓億程琳石中立參知政事王鬷同知樞密院事
1 夏四月乙巳,呂夷簡上《景祐法寶新錄》。甲子,呂夷簡、王曾、宋綬、蔡齊罷,以王隨為門下侍郎、同中書門下平章事、昭文館大學士,陳堯佐同中書門下平章事、集賢殿大學士,盛度知樞密院事,韓億、程琳、石中立參知政事,王鬷同知樞密院事。

General principles

Semantic annotation in the Chinese Text Project involves creating three types of closely related data:

  1. Annotations. An annotation locates a short region of text - usually a word or short phrase - and provides information about what that word or phrase means in the particular context in which it occurs. For example, in the sentence "孔子適齊。" we might want to add an annotation for the word "孔子" indicating that in this sentence, "孔子" refers to a particular person: the historical individual Confucius.
    Two types of annotation are supported in ctext:
    • Entity annotations - indicate that the annotated text refers to a particular entity, such as "ctext:855132" (王安石).
    • Date annotations - indicate that the annotated text refers to a particular historical date. The date is specified by recording the era (or ruler) to which the date belongs, such as "ctext:27110" (天禧 era), as well as data about the meaning of the date, such as "year 1, month 2".
  2. Entity records. An entity record represents a unique thing. This may be a concrete object - such as a person, or a physical building - or an abstract or constructed object, like a bureaucratic office. For example, factual and fictional historical people - like Wang Anshi - have entity records; so do works - like the History of Song - and dynasties - like Northern Song. Entity records are used to contain information about entities, and as a reference point for annotations: the annotation of "孔子" in the example above would point to the entity record for Confucius. Entity records help distinguish between different things that sometimes have the same name, and identify the same thing when it may be referred to by different names. Every entity record has a unique identifier, e.g. "ctext:27110" (天禧 era). Using these identifiers allows us to precisely distinguish between entities with the same name - such as "ctext:474358" for the 紹興 era of the Song dynasty, and "ctext:63988" for the 紹興 era of the Western Liao dynasty. The page for each entity lists its identifer immediately below the title.
  3. Knowledge claims. A knowledge claim represents one piece of information about an entity; entity records are made up of knowledge claims about that entity. A knowledge claim primarily connects three things: a subject (the entity the claim relates to), a verb or relation, and an object or target of the relation. For example, a knowledge claim about Wang Anshi might connect Wang Anshi (subject) and Wang Yi (object), with the verb "father" - thus recording the fact that Wang Anshi's father is Wang Yi. As a second example, we might connect Wang Anshi and the office Hanlin Academic through the relation "held-office", to indicate that Wang Anshi held this particular bureaucratic office.
    Sometimes it is useful to record additional information about a claim. This can be done by adding one or more qualifiers to the claim. A qualifier is an additional part of a claim which connects that claim with two other pieces of information: an additional verb (the qualifier), and an additional object. For example, while it is true to say that Wang Anshi held the office of Hanlin Academic, it is useful to further explain this by indicating that he held the office starting from a particular date - this is done by adding the from-date qualifier to the claim, together with an object representing that particular date.

Citations

Citations are required for most types of claim. A citation is a specific textual reference in ctext citation format. A citation is composed of two parts: a URN identifying a particular chapter of one edition of a text, and the literal content of the text being cited (in Traditional Chinese); these two parts are combined using the symbol "@". For example:

The citation should be chosen to be a complete sentence or meaningful sentence fragment that justifies the claim. Context does not need to be cited, because the text will be linked directly to its source.

Most claims require evidence, with the following exceptions:

Annotation conventions

In order to promote consistency in the data and facilitate effective automated processing, please observe the following conventions when marking up texts:

Dates

Dates are important pieces of historical data that need to be annotated carefully. A date annotation connects a date in a text (e.g. "二月") with enough additional data to make the date unambiguous - for example, the information that the date refers to a particular year and month within some specific era. The annotation client provides a mechanism to input this information, by connecting each date annotation to an era. In many cases, dates in a text do not directly contain all of this information, as it is provided contextually - as in the following passage:

1 開寶九年冬十月癸丑太祖崩,帝遂即皇帝位。乙卯,大赦,常赦所不原者咸除之。

The first of the two dates in the above passage is "complete": it directly contains enough information, taken together with the era, to unambiguously point to a particular date - specifically, the information year 9, month 10, day 癸丑. The second date ("乙卯") does not directly contain this information because the information is implied by the context. Date annotation involves explicitly recording these separate values, so that digital systems can correctly process the date.

The annotation client will attempt to suggest appropriate values, however these will sometimes be incorrect. It is important to pay attention to the contextual flow of information when annotating dates, especially where parenthetical references to other years and eras do not affect the interpretation of dates later in the text. For example, in the following passage, purple arrows indicate the correct contextual flow of date information:

The annotation client will help by suggesting the correct values automatically for most cases - e.g. suggesting that "乙卯" refers to year 9 month 10 of the 開寶 era - but in this example will incorrectly propose that "十一月癸亥" refers to the 11th month of year 8 of 開寶, due to year 8 having been referenced immediately prior. In cases like these it is important to pay attention to the date flow: if "十一月癸亥" is marked as referring to year 8, then the annotation client will infer that 甲子 and 庚午 should also be marked as year 8, whereas in this passage they actually refer to year 9. Mistakes of this kind easily cascade to affect many dates in historical texts because much of the date information is implied contextually.

Texts and editions

Only one edition of each text should be annotated. This should normally be the representative edition.

Some annotations have been added to the following texts; please use the editions linked below when adding or correcting annotations:

Standard Histories

  1. 史記
  2. 漢書
  3. 後漢書
  4. 三國志
  5. 晉書
  6. 宋書
  7. 南齊書
  8. 梁書
  9. 陳書
  10. 魏書
  11. 北齊書
  12. 周書
  13. 南史
  14. 北史
  15. 隋書
  16. 舊唐書
  17. 新唐書
  18. 舊五代史
  19. 新五代史
  20. 宋史
  21. 遼史
  22. 金史
  23. 元史
  24. 明史
  25. 清史稿

Other historical works

Bibliographic works and catalogs

The above are only partial lists; other texts can also be annotated, provided that: