|
NAMELingua::EN::AddressParse - extract components of a street address from free format textSYNOPSISuse Lingua::EN::AddressParse; my %args = ( country => 'US', auto_clean => 1, force_case => 1, abbreviate_subcountry => 0, abbreviated_subcountry_only => 0, force_post_code => 0 ); my $address = Lingua::EN::AddressParse->new(%args); $error = $address->parse("40 1/2 N OLD MASSACHUSETTS AVE APT 3B Washington Valley Washington 98100: HOLD MAIL"); print $address->report; Country address format 'US' Address type 'suburban' Non matching part 'HOLD MAIL ' Error '1' Error descriptions 'non matching section : HOLD MAIL ' Warning '1' Warning description '' Case all '40 1/2 N Old Massachusetts Ave Apt 3B Washington Valley WA 98100' COMPONENTS '' base_street_name 'Old Massachusetts' post_code '98100' property_identifier '40 1/2' street_direction_prefix 'N' street_name 'N Old Massachusetts' street_type 'Ave' sub_property_identifier '3B' sub_property_type 'Apt' subcountry 'WASHINGTON' suburb 'Washington Valley' %address_components = $address->components; print $address_components{sub_property_type}; # APT print $address_components{sub_property_identifier}; # 3B print $address_components{property_identifier}; # 40 1/2 %address_properties = $address->properties; print $address_properties{type}; # suburban print $address_properties{non_matching}; # : HOLD MAIL $correct_casing = $address->case_all; DESCRIPTIONThis module takes as input a suburban, rural or postal address in free format text such as,3080 28TH AVE N ST PETERSBURG, FL 33713-3810 12 1st Avenue N Suite # 2 Somewhere CA 12345 USA C/O JOHN, KENNETH JR POA 744 WIND RIVER DR SYLVANIA, OH 43560-4317 9 Church Street, Abertillery, Mid Glamorgan NP13 1DA 27 Bury Street, Abingdon, Oxfordshire OX14 3QT 2A O'CONNELL ST KEW NSW 2123 12/3-5 AUBREY ST MOUNT VICTORIA VICTORIA 3133 "OLD REGRET" WENTWORTH FALLS NSW 2782 AUSTRALIA GPO Box K318, HAYMARKET, NSW 2000 and attempts to parse it. If successful, the address is broken down into it's components and useful functions can be performed such as : converting upper or lower case values to title case (2A O'Connell St Kew NSW 2123) extracting the addresses individual components (2A,O'Connell,St,KEW,NSW,2123) determining the type of format the address is in ('suburban') If the address cannot be parsed you have the option of cleaning the address of bad characters, or extracting any portion that was parsed and the portion that failed. This module can be used for analysing and improving the quality of lists of residential and postal addresses. By using a large combination of regular expressiosn with look ahead analysis, patterns can be parsed that confuse many other parsers. Examples are Street names with several street types: Lane Cove Road Suburbs which include street types: Smith Road St Marys Suburbs that include state names: Fort Washington Washington DEFINITIONSThe following terms are used by AddressParse to define the components that can make up an address.Pre cursor : C/O MR A Smith... Sub property identifier : Level 1A Unit 2, Apartment B, Lot 12, Suite # 12 ... Property Identifier : 12/66A, 24-34, 2A, 23B/12C, 12/42-44, 2.5 Property name : "Old Regret" Post Box : GP0 Box K123, LPO 2345, RMS 23 ... Road Box : RMB 24A, RMS 234 ... Street Direction: North, SE, Sth. etc Street name : O'Hare, New South Head, The Causeway, Broadway Street type : Road, Rd., St, Lane, Highway, Crescent, Circuit ... Suburb : Dee Why, St. John's Wood ... Sub country : NSW, New South Wales, ACT, NY, New Jersey AZ ... Post (zip) code : 2062, 34532-1234, SG12A 9ET Country : Australia, UK, US or Canada The main address formats currently supported are as follows. (a ? means the component is optional): 'suburban' : sub_property(?) property_identifier(?) street street_type suburb subcountry post_code(?)country(?) OR for the USA 'suburban' : property_identifier(?) street street_type sub_property(?) suburb subcountry post_code(?) country(?) 'rural' : property_name suburb subcountry post_code(?) country(?) 'post_box' : post_box suburb subcountry post_code(?) country(?) 'road_box' : road_box street street_type suburb subcountry post_code(?) country(?) 'road_box' : road_box suburb subcountry post_code(?) country(?) Note that suburb and subcountry are not optional. The accuracy of the parser is improved by providing as much context as possible. Proding a suburb can ehlp to identify street names that would itherwise be ambigious. For the case where you only have a street address, dummy (but still valid) values can be used for suburb (such as 'Somewhere') and sub country (such as 'NY'). These dummy values will be parsed but can be ignored. All formats may contain a precursor Refer to the component grammar defined in the Lingua::EN::AddressParse::Grammar module for a complete list of combinations. METHODSnewThe "new" method creates an instance of an address object and sets up the grammar used to parse addresses. This must be called before any of the following methods are invoked. Note that the object only needs to be created once, and can be reused with new input data.Various setup options may be defined in a hash that is passed as an optional argument to the "new" method. my %args = ( country => 'US', auto_clean => 1, force_case => 1, abbreviate_subcountry => 1, abbreviated_subcountry_only => 1, force_post_code => 1 ); my $address = Lingua::EN::AddressParse->new(%args);
parse$error = $address->parse("12/3-5 AUBREY ST VERMONT VIC 3133"); The "parse" method takes a single parameter of a text string containing a address. It attempts to parse the address and break it down into the components described below. If the address is parsed successfully, a 0 is returned, otherwise a 1. Note that you can successfully parse all the components of an address and still have an error returned. This occurs when you have non matching data following a valid address. To check if the data is unusable, you also need to use the "properties" method to check the address type is 'unknown' This method is a prerequisite for all the following methods. components%address = $address->components($upper_case_all); $suburb = $address{suburb}; If the optional argument $upper_case_all is set to a postive value, all components are converted to upper case. The "components" method returns all the address components in a hash. The following keys are used for each component: pre_cursor - such as 'C/O Mr A Smith' po_box_type - such as 'Private Boxes' post_box road_box sub_property_type sub_property_identifier property_identifier property_name level - such as 12th Floor building - such as Tower A street_direction_prefix (such as East, NW, North etc) base_street_name (the name with direction removed, such as "Main" in "East Main St") street_name (the full street name such as "East Main") street_type street_direction_suffix (US only, abbreviated only such as N, SE etc) suburb subcountry post_code country If a component has no matching data for a given address, it's values will be set to the empty string. Each component is converted to title case, meaning the first letter of each component is set to capitals and the remainder to lower case. Proper name capitalisations such as MacNay and O'Brien are observed The following components are not converted to title case: post_box road_box subcountry post_code country street_direction_suffix If your input data is all upper case and you want to retian that format for parsed data, you will need to apply the 'uc' function to each component. case_all$correct_casing = $address->case_all; The "case_all" method does the same thing as the "components" method except the entire address is returned as a title cased text string. If the force_case option was set in the "new" method above, address case the entire input string, including any unmatched sections after a recognisable address that failed parsing. This option is useful when you know you have invalid data, but you still want to title case what you have. propertiesThe "properties" method returns several properties of the address as a hash. The following keys are used for each property -type - either suburban ,rural,post_box,road_box,unknown non_matching - any trailing string not part the address Additional properties can be accessed with the following $address->{original_input} $address->{input_string} - string after auto_clean option has been applied $address->{country_code} - abbreviated Country address format (as defined in the C<new> method) $address->{error} - error flag, 0 = good, 1 = error $address->{error_desc} - text to describe the type of parsing error $address->{warning} - warning flag, 0 = good, 1 = warning $address->{warning_desc} - text to to describe the type of parsing warning(s) Warnings mean that the address has parsed but there may still be errors within it's components reportCreate a formatted text reportthe input string the cleaned input string the country type the address type any non matching part of input string if any parsing errors occurred error description if any parsing warning occurred warning description the name and value of each defined component Returns a string containing a multi line formatted text report DEPENDENCIESLingua::EN::NameParse, Locale::SubCountry, Parse::RecDescentBUGSLIMITATIONSStreets such as 'The Esplanade' will return a street of 'The Esplanade' and a street type of null string.The abbreviation 'St' can be interpreted as either street or Saint. This leads to ambiguities such as '12 East St Thomas Lane'. This could be 'East Street', suburb of 'Thomas Lane' or 'East St Thomas Lane'. And the first pattern is the more common, that is what will match. For US addresses, an ambiguity arises between a street directional suffix and a suburb directional prefix, such as '12 Main St S Springfield CA 92345'. Is it South Main St, or South Springfield? The parser assumes that 'S' belongs to the street description. The huge number of character combinations that can form a valid address makes it is impossible to correctly identify them all. Valid addresses must contain: property address, suburb, subcountry (aka state) in that order. This format is widely accepted in Australia and the US. UK addresses will often include suburb, town, city and county, formats that are very difficult to parse. Property names must be enclosed in single or double quotes like "Old Regret" Because of the large combination of possible addresses defined in the grammar, the program is not very fast. REFERENCES"The Wordsworth Dictionary of Abbreviations & Acronyms" (1997)Australian Standard AS4212-1994 "Geographic Information Systems - Data Dictionary for transfer of street addressing information" ISO 3166-2:1998, Codes for the representation of names of countries and their subdivisions. Also released as AS/NZS 2632.2:1999 SEE ALSOAddressParse is designed to identify properties, which have a unique physical location. Geo::StreetAddress::US will also parse addresses for the USA, and can handle locations defined by street intersections, such as: "Hollywood & Vine, Los Angeles, CA" "Mission Street at Valencia Street, San Francisco, CA"L<Lingua::EN::NameParse> L<Geo::StreetAddress::US> L<Parse::RecDescent> L<Locale::SubCountry> See <http://www.upu.int/post_code/en/postal_addressing_systems_member_countries.shtml> for a list of different addressing formats from around the world. And also <http://www.bitboost.com/ref/international-address-formats.html> REPOSITORY<https://github.com/kimryan/Lingua-EN-AddressParse>TO DODefine grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.AUTHORAddressParse was written by Kim Ryan <kimryan at cpan d o t org>COPYRIGHT AND LICENSECopyright (c) 2018 Kim Ryan. All rights reserved.This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. |