Help:Special characters

This is an archived version of this page, as edited by 62.189.34.58 (talk) at 15:46, 17 August 2005 (See also: put back section). It may differ significantly from the current version.

Systems for character encoding

From MediaWiki 1.5, all projects use Unicode (UTF-8) character encoding.

Until the end of June 2005, when this new version came into use on Wikimedia projects, the English, Dutch, Danish, and Swedish Wikipedias used windows-1252 (they declared themselves to be ISO-8859-1 but in reality browsers treat the two as synonmous and the mediawiki software made no attempt to prevent use of stuff from windows-1252). Pre-upgrade wikitext in thier databases remains stored in windows-1252 and is converted on load. Edits made since the upgrade will be stored as UTF-8 in the database. This conversion on load process is invisible to users.

  • Unicode (UTF-8)
    • a variable number of bytes per character
    • special characters, including CJK characters, can be treated like normal ones; not only the webpage, but also the edit box shows the character; in addition it is possible to use the multi-character codes; they are not automatically converted in the edit box.
  • ISO 8859-1
    • one byte per character
    • special characters that are not available in the limited character set are stored in the form of a multi-character code; there are usually two or three equivalent representations, e.g. for the character € the named character reference € and the decimal character reference € and the hexadecimal character reference €. The edit box shows the entered code, the webpage the resulting character. Unavailable characters which are copied into the edit box are first displayed as the character, and automatically converted to their decimal codes on Preview or Save.
    • the most common special characters, such as é, are in the character set, so code like é, although allowed, is not needed.

Note that Special:Export exports using UTF-8 even if the database is encoded in ISO 8859-1, at least that was the case for the English Wikipedia, already when it used version 1.4.

To find out which character set applies in a project, use the browser's "View Source" feature and look for such as this:

<meta http-equiv="Content-type" content="text/html; charset=iso-8859-1" />

or

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Editing

Many characters not in the repertoire of standard ASCII will be useful—even necessary—for wiki pages, especially for foreign language textbooks. This page contains recommendations for which characters are safe to use and how to use them. There are three ways to enter a non-ASCII character into the wikitext:

  1. Enter the character directly from a foreign keyboard, or by cut and paste from a "character map" type application, or by some special means provided by the operating system or text editing application. On ISO-8859-1 wikis some browsers will change characters outside the charset of the wiki into html numeric charater entities (see below).
  2. Use an HTML named character entity reference like &agrave;. This is unambiguous even when the server does not announce the use of any special character set, and even when the character does not display properly on some browsers. However, it may cause difficulties with searches (see below).
  3. Use an HTML numeric character entity reference like &#161;. Unfortunately some old browsers incorrectly interpret these as references to the native character set. It is, however, the only way to enter Unicode values for which there is no named entity, such as the Turkish letters. Note that because the code points 128 to 159 are unused in both en:ISO-8859-1 and Unicode, character references in that range such as &#131; are illegal and ambiguous, though they are commonly used by many web sites. (Note they are not technically unused, but they map to rare control codes that are illegal in html.) Also note that almost all browsers treat iso-8859-1 as windows-1252, which does have printable characters in that space, and they often find their way into article titles on en, which really causes confusion when trying to create interwiki links to said pages.

Generally speaking, Western European languages such as Spanish, French, and German pose few problems. For specific details about other languages, see: Turkish. (More will be added to this list as contributors in other languages appear.)

For the purpose of searching, a word with a special character can best be written using the first method. If the second method is used a word like Odiliënberg can only be found by searching for Odili, euml and|or nberg; this is actually a bug that should be fixed—the entities should be folded into their raw character equivalents so all searches on them are equivalent. See also Help:Searching.

Esperanto

in edit boxin database and output
SS
SxŜ
SxxSx
SxxxŜx
SxxxxSxx
SxxxxxŜxx

Mediawiki installations configured for Esperanto use UTF-8 for storage and display. However when editing the text is converted to a form that is designed to be easier to edit with a standard keyboard.

The characters for which this applies are: Ĉ, Ĝ, Ĥ, Ĵ, Ŝ, Ŭ, ĉ, ĝ, ĥ, ĵ, ŝ, ŭ. you may enter these directly in the edit box if you have the facilities to do so. However when you edit the page again you will see them encoded as Sx. This form is referred to as "x-sistemo" or "x-kodo". In order to preserve round trip capability when one or more x's follow these characters or their non-accented forms (A, G, H, J, S, U, c, g, h, j, s, u), the number of x's in the edit box is double the number in the actual stored article text.

For example, the interlanguage link [[en:Luxury car]] to en:Luxury car has to be entered in the edit box as [[en:Luxxury car]] on eo:. This has caused problems with interwiki update bots in the past.

Browser issues

Some browsers are known to do nasty things to text in the edit box. Most commonly they convert it to an encoding native to the platform (whilst the nt line of windows is internally utf-16 it has a complete duplicate set of apis in the windows ansi code page and many older apps tend to use these especially for things like edit boxes). Then they let the user edit it using a standard edit control and convert it back. The result is that any characters that do not exist in the encoding used for editing get replaced with something that does (often a question mark though at least one browser has been reported to actually transliterate text!).

IE for the mac

This relatively common browser translates to mac-roman for the edit box with the result it munges most unicode stuff (usually but not always by replacing them with a question mark). It also munges things that are in ISO-8859-1 but not macroman (specifically ¤ ¦ ¹ ² ³ ¼ ½ ¾ Ð × Ý Þ ð ý þ and the soft hyphen) so the problems it causes are not limited to unicode wikis (though they tend to be much worse on unicode wikis because they affect actual text and interwiki links rather than just fairly obscure symbols).

Netscape 4.x

Similar issues to IE mac though the character set converted to and from will obviously not always be macroman.

Lynx

Appears to transliterate edit box text depending on settings, more research is needed.

The workaround

In database and edit
box for normal browsers
In edit box
for bad browsers
œ&#x153;
&#x153;&#x0153;
&#x0153;&#x00153;

After en switched to utf-8 and interwiki bots started replacing html entities in interwikis with literal unicode text, edits that broke unicode characters became so common they could no longer be ignored. A workaround was developed to allow broken browsers to edit safely provided mediawiki knew they were broken.

Browsers listed in the setting $wgBrowserBlackList (a list of regexps that match against user agent strings) will be supplied text for editing in a special form. Existing hexadecimal html entities in the page have an extra leading zero added, non-ascii characters that are stored in the wikitext are repreresented as hexadecimal html entities with no leading zeros.

Currently the default settings only have IE mac and a specific version of netscape 4.x for linux in the blacklist. Nevertheless it seems to have stopped most of the problem. Hopefully the default list will be expanded in future but that relies on getting someone with cvs access to commit the changes.

Egyptian Hieroglyphs

E.g. <hiero>P2</hiero> gives

P2

See Help:WikiHiero syntax.

This is not dependent on browser capabilities, because it uses images on the servers.

Hieroglyphs could also be represented using unicode however browser support for this is likely to be near nonexistant.

See also

Help contents
Meta · Wikinews · Wikipedia · Wikiquote · Wiktionary · Commons: · Wikidata · MediaWiki · Wikibooks · Wikisource · MediaWiki: Manual · Google
Versions of this help page (for other languages see further)
What links here on Meta or from Meta · Wikipedia · MediaWiki
Reading
Go · Search · Namespace · Page naming · Section · Backlinks · Redirect · Category · Image page · Special pages · Printing
Tracking changes
Recent changes (enhanced) | Related changes · Watching pages · Diff · Page history · Edit summary · User contributions · Minor edit · Patrolled edit
Logging in and preferences
Logging in · Preferences
Editing
Starting a new page · Advanced editing · Editing FAQ · Export · Import · Shortcuts · Edit conflict · Page size
Referencing
Links · URL ·  · Footnotes
Style and formatting
Wikitext examples · CSS · Reference card · HTML in wikitext · Formula · Lists · Table · Sorting · Colors · Images and file uploads
Fixing mistakes
Show preview · Reverting edits
Advanced functioning
Expansion · Template · Advanced templates · Parser function · Magic words · System message · Substitution · Arrays · Expr parser function syntax · Transclusion
Others
Special characters · Renaming (moving) a page · Preparing a page for translation · Talk pages · Signatures · Sandbox · Legal issues for editors
Other languages: