[LRUG] UTF8 errors parsing mail file

Najaf Ali ali at happybearsoftware.com
Thu Aug 22 09:31:00 PDT 2013


> You could try tinkering with Encoding.default_external – maybe set it to
Encoding::ASCII_8BIT and see if that helps.

Theory: Since a plain ASCII file is in UTF-8 encoding there are definitely
weird non-ASCII bytes in the file. Doing the above will leave the weird
bytes in but they'll manifest as whatever the ASCII/ISO-8859-1 of the
individual bytes are. So for example, if the weird bytes are a UTF-16 byte
order mark, you might see "þÿ" in the output. I need to get out and spend
more time with nature.


On Thu, Aug 22, 2013 at 5:05 PM, Leo Cassarani <leonardo.cassarani at gmail.com
> wrote:

> Ruby 2.0 has changed the default encoding of Ruby files to be UTF-8. So
> the magic comment you suggested should be implicit in all files, unless
> specified otherwise.
>
> It sounds like, because your script is running under UTF-8 encoding, it's
> trying to convert the strings it finds to UTF-8, and throwing a fit when it
> finds something that it can't convert.
>
> You could try tinkering with Encoding.default_external – maybe set it to
> Encoding::ASCII_8BIT and see if that helps.
>
> Leo
>
>
> On 22 Aug 2013, at 16:24, George Drummond <drummond at rentify.com> wrote:
>
> Try running it with the magic UTF-8 comment at the top of the file
>
> # encoding: UTF-8
>
>
>
> On 22 Aug 2013, at 16:21, gvim <gvimrc at gmail.com> wrote:
>
> I'm encountering some UTF-8 errors in Ruby 2.0. When installing gems I
> often see non-fatal errors relating to conversion of ASCII characters to
> UTF-8. The following script is designed to search a large Maildir folder
> for lines beginning with 4 word characters:
>
> ---------------------------------------------------------
> dir = 'my/maildir/path'
> Dir.chdir(dir)
>
> Dir.foreach(dir) do |file|
>  next unless file =~ /^\d{4}/
>  print "\n\n************* Opening #{file} *************\n"
>  fh = File.open(file)
>  while fh.gets do
>    print if $_ =~ /^\w{4}\b/
>  end
>  fh.close
> end
>
> -------------------------------------------------------------
>
>
> After successfully scanning 7 email files it dies with a UTF-8 error:
>
>
> ************* Opening 1270516984.M407293P18051.mac,S=1601,W=1645:2,Sb
> *************
> Paul
> ./1.rb:13:in `block in <main>': invalid byte sequence in UTF-8
> (ArgumentError)
> from ./1.rb:8:in `foreach'
>  from ./1.rb:8:in `<main>'
>
>
> The equivalent Perl script parses the whole directory without any errors:
>
> ------------------------------------------------------------
> use 5.016;
> use autodie;
>
> my $dir = 'my/mail/path';
> chdir $dir;
> opendir my $dh, $dir;
>
> while (readdir $dh) {
>  next unless /^\d{4}/;
>  open my $fh, '<', $_;
>  say "\n\n************* Opening $_ *************";
>  while (<$fh>) {
>    chomp;
>    say if /^\w{4}\s/;
>  }
>  close $fh;
> }
> closedir $dh;
>
> -------------------------------------------------------------
>
> gvim
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>
>
>
>
> t.  020 7739 3277
> a. 131 Shoreditch High Street, London E1 6JE
>
>
> Follow us on Twitter <http://twitter.com/rentify> | Rentify has acquired
> Iigloo! Welcome to all Iigloo Landlords
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>
>
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>


-- 
Ali, http://happybearsoftware.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20130822/e39f5028/attachment.html>


More information about the Chat mailing list