Garbage Characters in XML

1 minute read

Polar’s GPS often takes time to establish a satellite connection and retrieve a position. After export a “strange character” sits in the time node, causing jquery’s xml parser to choke and die.

gpx

there are options to filter unwanted characters in preprocessing using sed and awk.

1
2
sed -i '.orig' 's/[^[:print:]]//'  13011601.gpx
perl -i.bak -pe 's/[^[:print:]]//g' 13011601.gpx

I don’t want a general filter that could have other side effects. I want to target the character thats causing me problems. A helpful stack overflow post suggested that I revisit Joel Spolsky’s article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. I don’t often mess with character encodings, so I’m not positive where to begin.

I get the hex values for each character, and do a search for the hex value “30”, a zero, to find the 0.0000000 coordinate readings for the “trkpt” node

1
hexdump 13011601.gpx

less

Just below is the hex value 1e, which is a special ASCII character for a record separator. Its probably no cooincidence that the decimal vaule for a record separater of the same as the hex value for a zero.

Table

Now I can target and remove the record separater by using tr to remove the bothersome character.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ tr -d '\036' < 13011601.gpx | head -n15
<?xml version="1.0" encoding="UTF-8"?>
<gpx
version="1.0"
creator="Polar ProTrainer 5 - www.polar.fi"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.topografix.com/GPX/1/0"
xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd">
<time>2013-01-16T07:25:50Z</time>
<trk>
<trkseg>
<trkpt lat="0.000000000" lon="0.000000000">
<time></time>
 <fix>none</fix>
 <sat>0</sat>
</trkpt>

Updated: