<person id="MJ" name="Marc Johnson" email="mjohnson@apache.org"/>
<person id="NKB" name="Nicola Ken Barozzi" email="barozzi@nicolaken.com"/>
<person id="POI-DEVELOPERS" name="POI Developers" email="poi-dev@jakarta.apache.org"/>
+ <person id="RK" name="Rainer Klute" email="klute@apache.org"/>
</devs>
+ <release version="2.0-pre3" date="unreleased">
+ <action dev="RK" type="add">HPSF: Much better codepage support</action>
+ </release>
<release version="2.0-pre1" date="unreleased">
<action dev="POI-DEVELOPERS" type="add">Patch applied for deep cloning of worksheets was provided</action>
<action dev="POI-DEVELOPERS" type="add">Patch applied to allow sheet reordering</action>
<td>The property's value is the number of a <strong>codepage</strong>,
i.e. a mapping from character codes to characters. All strings in the
section containing this property must be interpreted using this
- codepage. Typical property values are 1252 (8-bit "western" characters)
- or 1200 (16-bit Unicode characters).</td>
+ codepage. Typical property values are 1252 (8-bit "western" characters,
+ ISO-8859-1), 1200 (16-bit Unicode characters, UFT-16), or 65001 (8-bit
+ Unicode characters, UFT-8).</td>
</tr>
</table>
</section>
</section>
<section><title>Codepage support</title>
- <fixme author="Rainer Klute">Improve codepage support!</fixme>
<p>The property with ID 1 holds the number of the codepage which was used
- to encode the strings in this section. The present HPSF codepage support
- is still very limited: When reading property value strings, HPSF
- distinguishes between 16-bit characters and 8-bit characters. 16-bit
- characters should be Unicode characters and thus be okay. 8-bit
- characters are interpreted according to the platform's default character
- set. This is fine as long as the document being read has been written on
- a platform with the same default character set. However, if you receive a
- document from another region of the world and want to process it with
- HPSF you are in trouble - unless the creator used Unicode, of course.</p>
+ to encode the strings in this section. If this property is not available
+ in a section, the platform's default character encoding will be
+ used. This works fine as long as the document being read has been written
+ on a platform with the same default character encoding. However, if you
+ receive a document from another region of the world and the codepage is
+ undefined, you are in trouble.</p>
+
+ <p>HPSF's codepage support is as good as the character encoding support of
+ the Java Virtual Machine (JVM) the application runs on. If HPSF
+ encounters a codepage number it assumes that the JVM has a character
+ encoding with a corresponding name. For example, if the codepage is 1252,
+ HPSF uses the character encoding "cp1252" to read or write strings. If
+ the JVM does not have that character encoding installed or if the
+ codepage number is illegal, an UnsupportedEncodingException will be
+ thrown.</p>
+
+ <p>There are two exceptions to the rule that a character encoding's name
+ is derived from the codepage number by prepending the string "cp" to
+ it:</p>
+
+ <dl>
+ <dt>Codepage 1200</dt>
+ <dd>is mapped to the character encoding "UTF-16".</dd>
+ <dt>Codepage 65001</dt>
+ <dd>is mapped to the character encoding "UTF-8".</dd>
+ </dl>
</section>
</section>
+ <section>
+ <title>The Dictionary</title>
+
+ <p>What a dictionary is good for is explained in the <link
+ href="how-to.html">HPSF HOW-TO</link>. This chapter explains how it is
+ organized internally.</p>
+
+ <p>The dictionary has a simple header consisting of a single UInt value. It
+ tells how many entries the dictionary comprises:</p>
+
+ <table>
+ <tr>
+ <th>Name</th>
+ <th>Data type</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>nrEntries</td>
+ <th>UInt</th>
+ <td>Number of dictionary entries</td>
+ </tr>
+ </table>
+
+ <p>The dictionary entries follow the header. Each one looks like this:</p>
+
+ <table>
+ <tr>
+ <th>Name</th>
+ <td>Data type</td>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>key</td>
+ <td>UInt</td>
+ <td>The unique number of this property, i.e. the PID</td>
+ </tr>
+ <tr>
+ <td>length</td>
+ <td>UInt</td>
+ <td>The length of the property name associated with the key</td>
+ </tr>
+ <tr>
+ <td>value</td>
+ <td>String</td>
+ <td>The property's name, terminated with a 0x00 character</td>
+ </tr>
+ </table>
+
+ <p>The entries are not aligned, i.e. each one follows its predecessor
+ without any gap or fill characters.</p>
+ </section>
+
+
+
<section><title>References</title>
<p>In order to assemble the HPSF description I used information publically
information streams.
</li>
<li>
- Add codepage support: Presently the bytes making out the string in a
- property's value are interpreted using the platform's default character
- set.
- </li>
- <li>
- Add resource bundles to
- <code>org.apache.poi.hpsf.wellknown</code> to ease
- localizations. This would be useful for mapping standard property IDs to
- localized strings. Example: The property ID 4 could be mapped to "Author"
- in English or "Verfasser" in German.
+ Add resource bundles to
+ <code>org.apache.poi.hpsf.wellknown</code> to ease
+ localizations. This would be useful for mapping standard property IDs to
+ localized strings. Example: The property ID 4 could be mapped to "Author"
+ in English or "Verfasser" in German.
</li>
<li>
Implement reading functionality for those property types that are not
- yet supported. HPSF should return proper Java types instead of just byte
- arrays.
+ yet supported. HPSF should return proper Java types instead of just byte
+ arrays.
</li>
<li>
- Add WMF to <code>java.awt.Image</code> example code in <link
- href="thumbnails.html">Thumbnail HOW TO</link>.
+ Add WMF to <code>java.awt.Image</code> example code in the <link
+ href="thumbnails.html">Thumbnail HOW-TO</link>.
</li>
</ol>
</section>
* exists. However, since we have full control about directory
* creation we can ensure that this will never happen. */
ex.printStackTrace(System.err);
- throw new RuntimeException(ex);
+ throw new RuntimeException(ex.toString());
+ /* FIXME (2): Replace the previous line by the following once we
+ * no longer need JDK 1.3 compatibility. */
+ // throw new RuntimeException(ex);
}
}
}
* exists. However, since we have full control about directory
* creation we can ensure that this will never happen. */
ex.printStackTrace(System.err);
- throw new RuntimeException(ex);
+ throw new RuntimeException(ex.toString());
+ /* FIXME (2): Replace the previous line by the following once we
+ * no longer need JDK 1.3 compatibility. */
+ // throw new RuntimeException(ex);
}
}
}
* <p>Writes the property to an output stream.</p>
*
* @param out The output stream to write to.
+ * @param codepage The codepage to use for writing non-wide strings
* @return the number of bytes written to the stream
*
* @exception IOException if an I/O error occurs
* @exception WritingNotSupportedException if a variant type is to be
* written that is not yet supported
*/
- public int write(final OutputStream out)
+ public int write(final OutputStream out, final int codepage)
throws IOException, WritingNotSupportedException
{
int length = 0;
long variantType = getType();
length += TypeWriter.writeUIntToStream(out, variantType);
- length += VariantSupport.write(out, variantType, getValue());
+ length += VariantSupport.write(out, variantType, getValue(), codepage);
return length;
}
/* If the property ID is not equal 0 we write the property and all
* is fine. However, if it equals 0 we have to write the section's
- * dictionary which does not have a type but just a value. */
+ * dictionary which has an implicit type only and an explicit
+ * value. */
if (id != 0)
/* Write the property and update the position to the next
* property. */
- position += p.write(propertyStream);
+ position += p.write(propertyStream, getCodepage());
else
{
- final Integer codepage =
- (Integer) getProperty(PropertyIDMap.PID_CODEPAGE);
- if (codepage == null)
+ final int codepage = getCodepage();
+ if (codepage == -1)
throw new IllegalPropertySetDataException
("Codepage (property 1) is undefined.");
position += writeDictionary(propertyStream, dictionary);
*/
package org.apache.poi.hpsf;
+import java.io.UnsupportedEncodingException;
import java.util.HashMap;
import java.util.Map;
+import org.apache.poi.util.HexDump;
import org.apache.poi.util.LittleEndian;
/**
* @param length The property's type/value pair's length in bytes.
* @param codepage The section's and thus the property's
* codepage. It is needed only when reading string values.
+ *
+ * @exception UnsupportedEncodingException if the specified codepage is not
+ * supported
*/
public Property(final long id, final byte[] src, final long offset,
final int length, final int codepage)
+ throws UnsupportedEncodingException
{
this.id = id;
try
{
- value = VariantSupport.read(src, o, length, (int) type);
+ value = VariantSupport.read(src, o, length, (int) type, codepage);
}
catch (UnsupportedVariantTypeException ex)
{
b.append(getID());
b.append(", type: ");
b.append(getType());
+ final Object value = getValue();
b.append(", value: ");
- b.append(getValue());
+ b.append(value.toString());
+ if (value instanceof String)
+ {
+ final String s = (String) value;
+ final int l = s.length();
+ final byte[] bytes = new byte[l * 2];
+ for (int i = 0; i < l; i++)
+ {
+ final char c = s.charAt(i);
+ final byte high = (byte) ((c & 0x00ff00) >> 8);
+ final byte low = (byte) ((c & 0x0000ff) >> 0);
+ bytes[i * 2] = high;
+ bytes[i * 2 + 1] = low;
+ }
+ final String hex = HexDump.dump(bytes, 0L, 0);
+ b.append(" [");
+ b.append(hex);
+ b.append("]");
+ }
b.append(']');
return b.toString();
}
import java.io.IOException;
import java.io.InputStream;
+import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;
* @param length The length of the stream data.
* @throws NoPropertySetStreamException if the byte array is not a
* property set stream.
+ *
+ * @exception UnsupportedEncodingException if the codepage is not supported
*/
public PropertySet(final byte[] stream, final int offset, final int length)
- throws NoPropertySetStreamException
+ throws NoPropertySetStreamException, UnsupportedEncodingException
{
if (isPropertySetStream(stream, offset, length))
init(stream, offset, length);
* complete byte array contents is the stream data.
* @throws NoPropertySetStreamException if the byte array is not a
* property set stream.
+ *
+ * @exception UnsupportedEncodingException if the codepage is not supported
*/
- public PropertySet(final byte[] stream) throws NoPropertySetStreamException
+ public PropertySet(final byte[] stream)
+ throws NoPropertySetStreamException, UnsupportedEncodingException
{
this(stream, 0, stream.length);
}
* @param length Length of the property set stream.
*/
private void init(final byte[] src, final int offset, final int length)
+ throws UnsupportedEncodingException
{
/* FIXME (3): Ensure that at most "length" bytes are read. */
final PropertySet ps = (PropertySet) o;
int byteOrder1 = ps.getByteOrder();
int byteOrder2 = getByteOrder();
- ClassID classId1 = ps.getClassID();
+ ClassID classID1 = ps.getClassID();
ClassID classID2 = getClassID();
int format1 = ps.getFormat();
int format2 = getFormat();
int sectionCount1 = ps.getSectionCount();
int sectionCount2 = getSectionCount();
if (byteOrder1 != byteOrder2 ||
- !classId1.equals(classID2) ||
+ !classID1.equals(classID2) ||
format1 != format2 ||
osVersion1 != osVersion2 ||
sectionCount1 != sectionCount2)
*/
package org.apache.poi.hpsf;
+import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
* @param src Contains the complete property set stream.
* @param offset The position in the stream that points to the
* section's format ID.
+ *
+ * @exception UnsupportedEncodingException if the section's codepage is not
+ * supported.
*/
public Section(final byte[] src, final int offset)
+ throws UnsupportedEncodingException
{
int o1 = offset;
return dictionary;
}
+
+
+ /**
+ * <p>Gets the section's codepage, if any.</p>
+ *
+ * @return The section's codepage if one is defined, else -1.
+ */
+ public int getCodepage()
+ {
+ final Integer codepage =
+ (Integer) getProperty(PropertyIDMap.PID_CODEPAGE);
+ return codepage != null ? codepage.intValue() : -1;
+ }
+
}
* @exception IOException if an I/O error occurs
*/
public static void writeToStream(final OutputStream out,
- final Property[] properties)
+ final Property[] properties,
+ final int codepage)
throws IOException, UnsupportedVariantTypeException
{
/* If there are no properties don't write anything. */
final Property p = (Property) properties[i];
long type = p.getType();
writeUIntToStream(out, type);
- VariantSupport.write(out, (int) type, p.getValue());
+ VariantSupport.write(out, (int) type, p.getValue(), codepage);
}
}
import java.io.IOException;
import java.io.OutputStream;
+import java.io.UnsupportedEncodingException;
import java.util.Date;
import java.util.LinkedList;
import java.util.List;
* @param length The length of the variant including the variant
* type field
* @param type The variant type to read
+ * @param codepage The codepage to use to write non-wide strings
* @return A Java object that corresponds best to the variant
* field. For example, a VT_I4 is returned as a {@link Long}, a
* VT_LPSTR as a {@link String}.
* @exception ReadingNotSupportedException if a property is to be written
* who's variant type HPSF does not yet support
+ * @exception UnsupportedEncodingException if the specified codepage is not
+ * supported
*
* @see Variant
*/
public static Object read(final byte[] src, final int offset,
- final int length, final long type)
- throws ReadingNotSupportedException
+ final int length, final long type,
+ final int codepage)
+ throws ReadingNotSupportedException, UnsupportedEncodingException
{
Object value;
int o1 = offset;
* Read a byte string. In Java it is represented as a
* String object. The 0x00 bytes at the end must be
* stripped.
- *
- * FIXME (2): Reading an 8-bit string should pay attention
- * to the codepage. Currently the byte making out the
- * property's value are interpreted according to the
- * platform's default character set.
*/
final int first = o1 + LittleEndian.INT_SIZE;
long last = first + LittleEndian.getUInt(src, o1) - 1;
o1 += LittleEndian.INT_SIZE;
+ final int rawLength = (int) (last - first + 1);
while (src[(int) last] == 0 && first <= last)
last--;
- value = new String(src, (int) first, (int) (last - first + 1));
+ final int l = (int) (last - first + 1);
+ value = codepage != -1 ?
+ new String(src, (int) first, l,
+ codepageToEncoding(codepage)) :
+ new String(src, (int) first, l);
break;
}
case Variant.VT_LPWSTR:
+ /**
+ * <p>Turns a codepage number into the equivalent character encoding's
+ * name.</p>
+ *
+ * @param codepage The codepage number
+ *
+ * @return The character encoding's name. If the codepage number is 65001,
+ * the encoding name is "UTF-8". All other positive numbers are mapped to
+ * "cp" followed by the number, e.g. if the codepage number is 1252 the
+ * returned character encoding name will be "cp1252".
+ *
+ * @exception UnsupportedEncodingException if the specified codepage is
+ * less than zero.
+ */
+ public static String codepageToEncoding(final int codepage)
+ throws UnsupportedEncodingException
+ {
+ if (codepage <= 0)
+ throw new UnsupportedEncodingException
+ ("Codepage number may not be " + codepage);
+ switch (codepage)
+ {
+ case 1200:
+ return "UTF-16";
+ case 65001:
+ return "UTF-8";
+ default:
+ return "cp" + codepage;
+ }
+ }
+
+
/**
* <p>Writes a variant value to an output stream. This method ensures that
* always a multiple of 4 bytes is written.</p>
* @param out The stream to write the value to.
* @param type The variant's type.
* @param value The variant's value.
+ * @param codepage The codepage to use to write non-wide strings
* @return The number of entities that have been written. In many cases an
* "entity" is a byte but this is not always the case.
* @exception IOException if an I/O exceptions occurs
* who's variant type HPSF does not yet support
*/
public static int write(final OutputStream out, final long type,
- final Object value)
+ final Object value, final int codepage)
throws IOException, WritingNotSupportedException
{
int length = 0;
}
case Variant.VT_LPSTR:
{
- length = TypeWriter.writeUIntToStream
- (out, ((String) value).length() + 1);
- char[] s = Util.pad4((String) value);
- /* FIXME (2): The following line forces characters to bytes.
- * This is generally wrong and should only be done according to
- * a codepage. Alternatively Unicode could be written (see
- * Variant.VT_LPWSTR). */
- byte[] b = new byte[s.length + 1];
- for (int i = 0; i < s.length; i++)
- b[i] = (byte) s[i];
+ final byte[] bytes =
+ (codepage == -1 ?
+ ((String) value).getBytes() :
+ ((String) value).getBytes(codepageToEncoding(codepage)));
+ length = TypeWriter.writeUIntToStream(out, bytes.length + 1);
+ final byte[] b = new byte[bytes.length + 1];
+ System.arraycopy(bytes, 0, b, 0, bytes.length);
b[b.length - 1] = 0x00;
out.write(b);
length += b.length;
}
}
- /* Add 0x00 character to write a multiple of four bytes: */
- while (length % 4 != 0)
- {
- out.write(0);
- length++;
- }
+ /* Add 0x00 characters to write a multiple of four bytes: */
+ // FIXME (1) Try this!
+// while (length % 4 != 0)
+// {
+// out.write(0);
+// length++;
+// }
return length;
}
catch (Exception ex)
{
ex.printStackTrace();
- throw new RuntimeException(ex);
+ throw new RuntimeException(ex.toString());
+ /* FIXME (2): Replace the previous line by the following
+ * one once we no longer need JDK 1.3 compatibility. */
+ // throw new RuntimeException(ex);
}
}
},
public void testVariantTypes()
{
Throwable t = null;
+ final int codepage = -1;
+ /* FIXME (2): Add tests for various codepages! */
try
{
- check(Variant.VT_EMPTY, null);
- check(Variant.VT_BOOL, new Boolean(true));
- check(Variant.VT_BOOL, new Boolean(false));
- check(Variant.VT_CF, new byte[]{0});
- check(Variant.VT_CF, new byte[]{0, 1});
- check(Variant.VT_CF, new byte[]{0, 1, 2});
- check(Variant.VT_CF, new byte[]{0, 1, 2, 3});
- check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4});
- check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5});
- check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
- check(Variant.VT_I2, new Integer(27));
- check(Variant.VT_I4, new Long(28));
- check(Variant.VT_FILETIME, new Date());
- check(Variant.VT_LPSTR, "");
- check(Variant.VT_LPSTR, "ä");
- check(Variant.VT_LPSTR, "äö");
- check(Variant.VT_LPSTR, "äöü");
- check(Variant.VT_LPSTR, "äöüÄ");
- check(Variant.VT_LPSTR, "äöüÄÖ");
- check(Variant.VT_LPSTR, "äöüÄÖÜ");
- check(Variant.VT_LPSTR, "äöüÄÖÜß");
- check(Variant.VT_LPWSTR, "");
- check(Variant.VT_LPWSTR, "ä");
- check(Variant.VT_LPWSTR, "äö");
- check(Variant.VT_LPWSTR, "äöü");
- check(Variant.VT_LPWSTR, "äöüÄ");
- check(Variant.VT_LPWSTR, "äöüÄÖ");
- check(Variant.VT_LPWSTR, "äöüÄÖÜ");
- check(Variant.VT_LPWSTR, "äöüÄÖÜß");
+ check(Variant.VT_EMPTY, null, codepage);
+ check(Variant.VT_BOOL, new Boolean(true), codepage);
+ check(Variant.VT_BOOL, new Boolean(false), codepage);
+ check(Variant.VT_CF, new byte[]{0}, codepage);
+ check(Variant.VT_CF, new byte[]{0, 1}, codepage);
+ check(Variant.VT_CF, new byte[]{0, 1, 2}, codepage);
+ check(Variant.VT_CF, new byte[]{0, 1, 2, 3}, codepage);
+ check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4}, codepage);
+ check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5}, codepage);
+ check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
+ codepage);
+ check(Variant.VT_I2, new Integer(27), codepage);
+ check(Variant.VT_I4, new Long(28), codepage);
+ check(Variant.VT_FILETIME, new Date(), codepage);
+ check(Variant.VT_LPSTR, "", codepage);
+ check(Variant.VT_LPSTR, "ä", codepage);
+ check(Variant.VT_LPSTR, "äö", codepage);
+ check(Variant.VT_LPSTR, "äöü", codepage);
+ check(Variant.VT_LPSTR, "äöüÄ", codepage);
+ check(Variant.VT_LPSTR, "äöüÄÖ", codepage);
+ check(Variant.VT_LPSTR, "äöüÄÖÜ", codepage);
+ check(Variant.VT_LPSTR, "äöüÄÖÜß", codepage);
+ check(Variant.VT_LPWSTR, "", codepage);
+ check(Variant.VT_LPWSTR, "ä", codepage);
+ check(Variant.VT_LPWSTR, "äö", codepage);
+ check(Variant.VT_LPWSTR, "äöü", codepage);
+ check(Variant.VT_LPWSTR, "äöüÄ", codepage);
+ check(Variant.VT_LPWSTR, "äöüÄÖ", codepage);
+ check(Variant.VT_LPWSTR, "äöüÄÖÜ", codepage);
+ check(Variant.VT_LPWSTR, "äöüÄÖÜß", codepage);
}
catch (Exception ex)
{
* @throws UnsupportedVariantTypeException if the variant is not supported.
* @throws IOException if an I/O exception occurs.
*/
- private void check(final long variantType, final Object value)
+ private void check(final long variantType, final Object value,
+ final int codepage)
throws UnsupportedVariantTypeException, IOException
{
final ByteArrayOutputStream out = new ByteArrayOutputStream();
- VariantSupport.write(out, variantType, value);
+ VariantSupport.write(out, variantType, value, codepage);
out.close();
final byte[] b = out.toByteArray();
final Object objRead =
VariantSupport.read(b, 0, b.length + LittleEndian.INT_SIZE,
- variantType);
+ variantType, -1);
if (objRead instanceof byte[])
{
- final int diff = diff(org.apache.poi.hpsf.Util.pad4
- ((byte[]) value), (byte[]) objRead);
+// final int diff = diff(org.apache.poi.hpsf.Util.pad4
+// ((byte[]) value), (byte[]) objRead);
+ final int diff = diff((byte[]) value, (byte[]) objRead);
if (diff >= 0)
fail("Byte arrays are different. First different byte is at " +
"index " + diff + ".");