Garbage Characters in XML
Polar’s GPS often takes time to establish a satellite connection and retrieve a position. After export a “strange character” sits in the time node, causing jquery’s xml parser to choke and die.
there are options to filter unwanted characters in preprocessing using sed
and awk
.
1
2
sed -i '.orig' 's/[^[:print:]]//' 13011601.gpx
perl -i.bak -pe 's/[^[:print:]]//g' 13011601.gpx
I don’t want a general filter that could have other side effects. I want to target the character thats causing me problems. A helpful stack overflow post suggested that I revisit Joel Spolsky’s article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. I don’t often mess with character encodings, so I’m not positive where to begin.
I get the hex values for each character, and do a search for the hex value “30”, a zero, to find the 0.0000000 coordinate readings for the “trkpt” node
1
hexdump 13011601.gpx
Just below is the hex value 1e
, which is a special ASCII character for a record separator. Its probably no cooincidence that the decimal vaule for a record separater of the same as the hex value for a zero.
Now I can target and remove the record separater by using tr to remove the bothersome character.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ tr -d '\036' < 13011601.gpx | head -n15
<?xml version="1.0" encoding="UTF-8"?>
<gpx
version="1.0"
creator="Polar ProTrainer 5 - www.polar.fi"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.topografix.com/GPX/1/0"
xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd">
<time>2013-01-16T07:25:50Z</time>
<trk>
<trkseg>
<trkpt lat="0.000000000" lon="0.000000000">
<time></time>
<fix>none</fix>
<sat>0</sat>
</trkpt>