Google results linking to www.domainname.com/wb/admin/media/basic_header.html

diodak

Buchsbaum

chio

OK,
Wie lautet die URL der Startseite?
1) www.domain.de
oder
2) www.domain.de/index.php
?

Richtig ist: 1)
Eine normale WB-Installation verlinkt nirgends auf www.domain.de/index.php, sondern immer auf www.domain.de

www.domain.de == /
Dort beginnt auch der Bot (zwangsläufig, was anderes kennt er ohne Deeplinks nicht)
Jetzt darf er aber nicht, weil / gesperrt ist.
Also interessiert er sich schon gar nicht für /index.php oder gar /pages/, woher sollte er denn wissen, dass es diese Seiten gibt.

Klar: Du kannst sagen: wenn die Startseite per robots.txt gesperrt ist, darf sie der Spider ja trotzdem crawlen, und dann würde er die Links finden - und dann würde er auch die Seiten finden.

Ich frage mich aber, ob das auch Google so sagt.

Und auch die von dir zitierte Quelle?:
http://janeandrobot.com/robots.txt

kweitzel

Also Chio, ich bin mir nicht ganz sicher, wo Dein Wissen herkommt, oder wie alt es ist ...

Der Regelblock, den ich aufgestellt habe bewirkt folgendes:


User-agent: * # betrifft ALLE Useragents
Disallow: / # erstmal darf garnix gecrawled werden bis auf:
Allow: /index.php #die index.php Seite
Allow: /pages/ # der Seitenordner


Vielleicht liest Du auch mal hier nach: http://janeandrobot.com/post/Managing-Robots-Access-To-Your-Website.aspx

Die guten Crawler habe eine Eigenschaft ... sie lesen und halten sich an diese Regeln, und zwar alle.

Gruß

Klaus

chio

weil die Startseite nicht index.php heißt (deswegen in Klammern)
sondern: /
Wenn du das Startverzeichnis per robots.txt sperrst, kommt der Crawler schon gar nicht auf die Domain, egal was danach offen ist. Wo soll er denn anfangen.

kweitzel

ja, hast recht ... aber warum schreibst Du dann nicht einfach den Zusatz:
Allow: /index.php

Das wäre vielleicht hilfreich gewesen ...

Gruß

Klaus


kweitzel

That is what we want to achieve ... the crawling allow list is in part 2 ...

Allow: /pages/

cheers

Klaus

chio

QuoteUser-agent: *
Disallow: /
Allow: /pages/

Uiui - and what about the homepage?
If you do this, your whole site isnt crawled if you have no deep links.

kweitzel

Guys,

I would actually disallow everything to everybody and in the next step allow the pages directory and, if wanted selected other directories. This way you do not give away your whole folder structure.


User-agent: *
Disallow: /


Then you open the pages directory, since you want to have the pages crawled


Allow: /pages/


Secure it like host systems used to be ... close everything and then open the required folders for indexing.

cheers

Klaus

chio

In some cases the apache "index of" pages might cause the problem.

I always disallow "modules" and "admin" in robots.txt, of couse I know a "bad guy" can also read it, but its more dangerous when a bad guy can simply use google-search to find security holes.

Ruud

The serious search engines will respect robots.txt. But they are not your worry.
The bad guys will use your robots.txt for finding places where you don't want the search engines to go.
Scanning for /admin will tell (on most WB sites) directly you are dealing with a WB site.
You can be sure, the admin structure is already known to someone serious in hacking.

Protecting the admin area with a .htaccess is something you could do, but you will need to logon twice every time you want to do something over there. The admin area is protected on php level.

There is nothing wrong with the html templates or images being accessible. They will not do anything anyway.
Personally I am a pretty paranoid guy, but as long as there are no vulnerability is popping up on sites like: http://secunia.com/advisories/23828/ I don't worry too much.

As I explained in my previous post, pages in the admin area are not indexed in google at all, unless you take the effort of asking google to do so.

Ruud

[url=https://dev4me.com/modules-snippets/]Dev4me - WebsiteBaker modules[/url] - [url=https://wbhelp.org/]WBhelp.org[/url]

albatros

Hi,

correct me, if I am wrong, but I am sure, that a robots.txt as a solution is very unsecure. Searchengines can respect robots.txt. But they don´t need to. Maybe reading robots.txt-files could be a help and inspiration for bad guys. :-D

And what, if a bad guy knows the stucture of WB (or any other cms)? He only has to find a website, built with this cms, and he is able to see the admin-files. If the files are in google-index or not doesn´t matter.

The very simple and only safe solution is passwordprotection of admin by htaccess. You can´t see any file in any subfolder. And the admin-functions are locked safely.

So why don´t you use this? Am I completely wrong or thinking much to simple?

Regards

Uwe


أنا لست إرهابيا. ليس لدى أي ورشة عمل القنبلة في الطابق السفلي. ليس لدى أي مرض الجمرة الخبيثة أيضا. أريد شراء لا مادة الريسين. أنا سوف ابدأ السفر إلى الولايات المتحدة الأمريكية

Ruud

I agree on the last part.
It should not be that difficult to prevent that. (using robots.txt or .php templates instead of .html)

BTW: I wonder how google has found those pages.
there are no links pointing to those files. Google doesn't look for pages without links pointing to them.
I think the only way that could happen is when a wb tree is installed on a server without .php enabled and with directory browsing enabled.

While writing this message, if found one that had a google_sitemap.xml document (generated with some external generator) that included ALL admin .html .js .gif .png etc..
This is the opposite of the robots.txt. It's asking google to "please index my hidden stuff".
(do a search on google like this: "/media/basic_header.html  google_sitemap.xml". You will see what I mean)

Conclusion, don't worry. Your pages will not be indexed by google unless you ask them to.

Ruud
[url=https://dev4me.com/modules-snippets/]Dev4me - WebsiteBaker modules[/url] - [url=https://wbhelp.org/]WBhelp.org[/url]

Argos

Although I understand your explanation, I do feel it's weird to have loose admin header files linked to in Google. I'll do some robots.txt stuff, but maybe it's an idea to have an updated WB version just prevent this possibility.
Jurgen Nijhuis
Argos Media
Heiloo, The Netherlands
----------------------------------------------------------------
Please don't request personal support, use the forums!

Ruud

There are lots of WB sites out there.
Every site has the /admin/media/basic_header.html file. (and all other non .php files in the admin area)
Everybody can download and look at the sourcecode to find out what it can do.

It should not be dangerous.
Typically a html page will not do much more than display data.
Php pages in the admin area (or the modules area) should all have a bit of code to prevent unning without the wb framework (and its security).

Although I can imagine that using html templates in the admin area together with .htaccess mods that allow html to run php code could be dangerous.

Ruud

[url=https://dev4me.com/modules-snippets/]Dev4me - WebsiteBaker modules[/url] - [url=https://wbhelp.org/]WBhelp.org[/url]

kweitzel

you can always use a robots.txt to protect the folders.

cheers

Klaus

Argos

I searched Google for "website baker media" and found a lot of results linking to "WebsiteBaker Administration - Media-" with url's that point to "www.domainname.com/wb/admin/media/basic_header.html".

I feel a bit uncomfortable about it somehow. Can it be dangerous?
Jurgen Nijhuis
Argos Media
Heiloo, The Netherlands
----------------------------------------------------------------
Please don't request personal support, use the forums!